I'm attempting to put a MongoDB database that I've imported with PyMongo into a pandas dataframe and then plot it by time with a "date" column of type datetime64 with matplotlib. However, I'm getting randomly connected dates. Does anyone know how I might fix this problem?
The date column seems to be unsorted. To reproduce consider e.g.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(15,5), columns=list("ABCDE"))
a = np.arange("2018-05-05", "2018-05-20", dtype="datetime64[D]")
np.random.shuffle(a)
df["date"] = a
plt.plot("date", "C", data=df)
plt.show()
If we sort the dataframe by the date column now,
df.sort_values(by="date", inplace=True)
the result looks much nicer.
A tangential remark here: I would recommend deciding for one style, either
plt.plot("date", "C", data=df)
or
plt.plot(df["date"], df["C"])
and not mix the two by supplying the x argument as Series and the y as string.
Related
I have currently started a project where I need to evaluate and plot data using python. The csv-file that I have to plot are structured like this:
date,ch1,ch2,ch3,date2
11:56:20.149766,0.909257531,0.909420371,1.140183687, 13:56:20.149980
11:56:20.154008,0.895447016,0.895601869,1.122751355, 13:56:20.154197
11:56:20.157245,0.881764293,0.881911397,1.105638862, 13:56:20.157404
11:56:20.160590,-0.009178977,-0.000108901,-1.486875653, 13:56:20.160750
11:56:20.190473,-1.473576546,-1.477073431,-1.846657276, 13:56:20.190605
11:56:20.193810,-1.460405469,-1.463766813,-1.8300246, 13:56:20.193933
11:56:20.197139,-1.447362065,-1.450844049,-1.813711882, 13:56:20.197262
11:56:20.200480,-1.434574604,-1.437921286,-1.797878742, 13:56:20.200604
11:56:20.203803,-1.422042727,-1.425382376,-1.782045603, 13:56:20.203926
11:56:20.207136,-1.40951097,-1.412971258,-1.7663728, 13:56:20.207258
11:56:20.210472,-0.436505407,-0.438260257,-0.54675138, 13:56:20.210595
11:56:20.213804,0.953246772,0.953690529,1.19551909, 13:56:20.213921
11:56:20.217136,0.93815738,0.938464701,1.176487565, 13:56:20.217252
11:56:20.220472,0.923707485,0.924006522,1.158255577, 13:56:20.220590
11:56:20.223807,0.909385324,0.909676254,1.140343547, 13:56:20.223922
11:56:20.227132,0.895447016,0.895729899,1.122911215, 13:56:20.227248
11:56:20.230466,0.881892085,0.882039428,1.105798721, 13:56:20.230582
I can already read the file and print it using pandas:
df = pd.read_csv (r'F:\Schule\HTL\Diplomarbeit\aw_python\datei_meas.csv')
print (df)
But now I want to plot the file using matplotlib. The first column date should be in the x axis and column 2,3 and 4 should be the y-values of different graphs.
I hope that anyone can help me with my problem.
Kind regards
Matthias
Edit:
This is what I have tried to convert the date-column into a readable file-format:
import matplotlib.pyplot as plt
import numpy as np
import mplcursors
import pandas as pd
import matplotlib.dates as mdates
df = pd.read_csv (r'F:\Schule\HTL\Diplomarbeit\aw_python\datei_meas.csv')
print (df)
x_list = df.date
y = df.ch1
x = mdates.date2num(x_list)
plt.scatter(x,y)
plt.show
And this is the occurring error message:
d = d.astype('datetime64[us]')
ValueError: Error parsing datetime string " 11:56:20.149766" at position 3
I was wondering why the x-axis plots the dates wrong, it begins at the 05/02 when it should start at the 30/01, and I'm not sure where it is I went wrong.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
cols = ['Time','Water Usage']
A = pd.read_csv("CSVFile", names=cols, parse_dates=[0])
plt.ylabel = "Time"
plt.xlabel = "Water Usage"
A.plot(x='Time',y='Water Usage')
plt.show()
The file is in the format
30/01/2018 16:00:00 , 50091
05/02/2018 14:00:00, 50890
so ideally it should plot the 30/01 first followed by the 05/02, whereas currently its doing the opposite.
If you just need to reorder it in timely order, You can simply sort dataframe prior to plotting. You can use:
A = A.sort_index()
if the date column is set to be index. If not then following will do the trick:
A = A.set_index('date').sort_index().reset_index()
Since the index is of datetime type, it will automatically sort the whole dataframe
I want to reduce the xlim label because i'm using datetime information and that take long space of the xlim. The problem it's when i want to read that
So i need some like to scale that, i think
dates = pd.read_csv("EURUSDtest.csv")
dates = dates["Date"]+" " + dates["Time"]
plt.title("EUR/USD")
plt.plot(dates, data_pred)
plt.xticks(rotation="vertical")
plt.tick_params(labelsize=10)
plt.plot(forecasting)
The problem...
IIUC: You need to convert the dates column to pandas datetime type by calling pd.to_datetime.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# To reproduce the issue you have lets create a date column as string
df = pd.DataFrame({"Dates":pd.date_range(start='2018-1-1', end='2019-1-1', freq='15MIN').strftime("%m-%d-%Y %H-%M-%S")})
# Convert the date string to date type
df["Dates"] = pd.to_datetime(df["Dates"])
# Add column to assign some dummy values
df = df.assign(VAL=np.linspace(10, 110, len(df)))
# Plot the graph
# Now the graph automatically adjusts the XLIM based on the size of the graph
plt.title("eur/usd")
plt.plot(df["Dates"], df["VAL"])
plt.xticks(rotation="vertical")
plt.show()
However if you need to further control xlim to your needs you need to go through matplotlib tutorials.
I have a csv file with data in the following format -
Issue_Type DateTime
Issue1 03/07/2011 11:20:44
Issue2 01/05/2011 12:30:34
Issue3 01/01/2011 09:44:21
... ...
I'm able to read this csv file, but what I'm unable to achieve is to plot a graph or rather trend based on the data.
For instance - I'm trying to plot a graph with X-axis as Datetime(only Month) and Y-axis as #of Issues. So I would show the trend in line-graphy with 3 lines indicating the pattern of issue under each category for the month.
I really don't have a code for plotting the graph and hence can't share any, but so far I'm only reading the csv file. I'm not sure how to proceed further to plot a graph
PS: I'm not bent on using python - Since I've parsed csv using python earlier I though of using the language, but if there is an easier approach using some other language - I would be open explore that as well.
A way to do this is to use dataframes with pandas.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";") #Reads the csv
df.index = pd.to_datetime(df["DateTime"]) #Set the index of the dataframe to the DateTime column
del df["DateTime"] #The DateTime column is now useless
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_Type"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m')) #This will only show the month number on the graph
This assumes that Issue1/2/3 are integers, I assumed they were as I didn't really understand what they were supposed to be.
Edit: This should do the trick then, it's not pretty and can probably be optimised, but it works well:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.read_csv("D:\Programmes Python\Data\Data_csv.txt",sep=";")
df.index = pd.to_datetime(df["DateTime"])
del df["DateTime"]
list=[]
for Issue in df["Issue_Type"]:
list.append(int(Issue[5:]))
df["Issue_number"]=list
fig, ax = plt.subplots()
ax.plot(df.index,df["Issue_number"])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m'))
plt.show()
The first thing you need to do is to parse the datetime fields as dates/times. Try using dateutil.parser for that.
Next, you will need to count the number of issues of each type in each month. The naive way to do that would be to maintain lists of lists for each issue type, and just iterate through each column, see which month and which issue type it is, and then increment the appropriate counter.
When you have such a frequency count of issues, sorted by issue types, you can simply plot them against dates like this:
import matplotlib.pyplot as plt
import datetime as dt
dates = []
for year in range(starting_year, ending_year):
for month in range(1, 12):
dates.append(dt.datetime(year=year, month=month, day=1))
formatted_dates = dates.DateFormatter('%b') # Format dates to only show month names
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(issues[0], dates) # To plot just issues of type 1
ax.plot(issues[1], dates) # To plot just issues of type 2
ax.plot(issues[2], dates) # To plot just issues of type 3
ax.xaxis.set_major_formatter(formatted_dates) # Format X tick labels
plt.show()
plt.close()
honestly, I would just use R. check this link out on downloading / setting up R & RStudio.
data <- read.csv(file="c:/yourdatafile.csv", header=TRUE, sep=",")
attach(data)
data$Month <- format(as.Date(data$DateTime), "%m")
plot(DateTime, Issue_Type)
I create a simple pandas dataframe with some random values and a DatetimeIndex like so:
import pandas as pd
from numpy.random import randint
import datetime as dt
import matplotlib.pyplot as plt
# create a random dataframe with datetimeindex
dateRange = pd.date_range('1/1/2011', '3/30/2011', freq='D')
randomInts = randint(1, 50, len(dateRange))
df = pd.DataFrame({'RandomValues' : randomInts}, index=dateRange)
Then I plot it in two different ways:
# plot with pandas own matplotlib wrapper
df.plot()
# plot directly with matplotlib pyplot
plt.plot(df.index, df.RandomValues)
plt.show()
(Do not use both statements at the same time as they plot on the same figure.)
I use Python 3.4 64bit and matplotlib 1.4. With pandas 0.14, both statements give me the expected plot (they use slightly different formatting of the x-axis which is okay; note that data is random so the plots do not look the same):
However, when using pandas 0.15, the pandas plot looks alright but the matplotlib plot has some strange tick format on the x-axis:
Is there any good reason for this behaviour and why it has changed from pandas 0.14 to 0.15?
Note that this bug was fixed in pandas 0.15.1 (https://github.com/pandas-dev/pandas/pull/8693), and plt.plot(df.index, df.RandomValues) now just works again.
The reason for this change in behaviour is that starting from 0.15, the pandas Index object is no longer a numpy ndarray subclass. But the real reason is that matplotlib does not support the datetime64 dtype.
As a workaround, in the case you want to use the matplotlib plot function, you can convert the index to python datetime's using to_pydatetime:
plt.plot(df.index.to_pydatetime(), df.RandomValues)
More in detail explanation:
Because Index is no longer a ndarray subclass, matplotlib will convert the index to a numpy array with datetime64 dtype (while before, it retained the Index object, of which scalars are returned as Timestamp values, a subclass of datetime.datetime, which matplotlib can handle). In the plot function, it calls np.atleast_1d() on the input which now returns a datetime64 array, which matplotlib handles as integers.
I opened an issue about this (as this gets possibly a lot of use): https://github.com/pydata/pandas/issues/8614
With matplotlib 1.5.0 this 'just works':
import pandas as pd
from numpy.random import randint
import datetime as dt
import matplotlib.pyplot as plt
# create a random dataframe with datetimeindex
dateRange = pd.date_range('1/1/2011', '3/30/2011', freq='D')
randomInts = randint(1, 50, len(dateRange))
df = pd.DataFrame({'RandomValues' : randomInts}, index=dateRange)
fig, ax = plt.subplots()
ax.plot('RandomValues', data=df)