Plotting using Pandas and datetime format - python

I have a dataframe with just two columns, Date, and ClosingPrice. I am trying to plot them using df.plot() but keep getting this error:
ValueError: view limit minimum -36785.37852 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
I have found documentation about this from matplotlib but that says how to make sure that the format is datetime. Here is code that I have to make sure the format is datetime and also printing the data type for each column before attempting to plot.
df.Date = pd.to_datetime(df.Date)
print(df['ClosingPrice'].dtypes)
print(df['Date'].dtypes)
The output for these print statements are:
float64
datetime64[ns]
I am not sure what the problem is since I am verifying the data type before plotting. Here is also what the first few rows of the data set look like:
Date ClosingPrice
0 2013-09-10 64.7010
1 2013-09-11 61.1784
2 2013-09-12 61.8298
3 2013-09-13 60.8108
4 2013-09-16 58.8776
5 2013-09-17 59.5577
6 2013-09-18 60.7821
7 2013-09-19 61.7788
Any help is appreciated.

EDIT 2 after seeing more people ending up here. To be clear for new people to python, you should first import pandas for the codes bellow to work:
import pandas as pd
EDIT 1: (short quick answer)
If³ you don't want to drop your original index (this makes sense after reading the original and long answer bellow) you could:
df[['Date','ClosingPrice']].plot('Date', figsize=(15,8))
Original and long answer:
Try setting your index as your Datetime column first:
df.set_index('Date', inplace=True, drop=True)
Just to be sure, try setting the index dtype (edit: this probably wont be needed as you did it previously):
df.index = pd.to_datetime(df.index)
And then plot it
df.plot()
If this solves the issue it's because when you use the .plot() from DataFrame object, the X axis will automatically be the DataFrame's index.
If² your DataFrame had a Datetimeindex and 2 other columns (say ['Currency','pct_change_1']) and you wanted to plot just one of them (maybe pct_change_1) you could:
# single [ ] transforms the column into series, double [[ ]] into DataFrame
df[['pct_change_1']].plot(figsize=(15,8))
Where figsize=(15,8) you're setting the size of the plot (width, height).

Here is a simple solution:
my_dict = {'Date':['2013-09-10', '2013-09-11', '2013-09-12', '2013-09-13', '2013-09-16', '2013-09-17', '2013-09-18',
'2013-09-19'], 'ClosingPrice': [ 64.7010, 61.1784, 61.8298, 60.8108, 58.8776, 59.5577, 60.7821, 61.7788]}
df = pd.DataFrame(my_dict)
df.set_index('Date', inplace=True)
df.plot()

Related

Plot value versus date for each rowname in Python using pandas and matplotlib

I got a dataframe with three columns and almost 800.000 rows. I want to plot a line plot where the x axis is DateTime and the Y is Value. The problem is, I want to do a different line for EACH code (there are 6 different codes) in the same plot.Each code has NOT the same length, but that does not matter. At the end, I wanna have a plot with 6 different lines where x axis is DATETIME and Y axis is value. I tried many things but I can not plot it.
Here is a sample of my dataframe
import pandas as pd
# intialise data of lists.
data = {'Code':['AABB', 'AABC', 'AABB', 'AABC','AABD', 'AABC', 'AABB', 'AABC'],
'Value':[1, 1, 2, 2,1,3,3,4],
'Datetime': [2022-03-29,2022-03-29,2022-03-30,2022-03-30,2022-03-30,2022-03-31,
2022-03-31,2022-03-31]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
I tried this but it plots something that does not make any sense
plt.plot(df["DateTime"], df["value"],linewidth=2.0,color='b',alpha=0.5,marker='o')
Your data
There is a duplicate record, as mentioned by #claudio. There are two rows for AABC:2022/3/31:3 and AABC:2022/3/31:4. So, in such cases, I have taken the average of the two value (3.5 in this case). Also, there is only one entry for AABD. Need two points for a line. So, I have added another entry at the end. Also, the column Datetime has been changed from string to type datetime using the pandas function pd.to_datetime()
The method
You can use the pivot_table() to change the data you have provided to a format that can be converted to line plot. Here, I have used the datetime to be the index, each of the unique Code to be a column (so that each column can be converted to a separate line) and the values as values. Note that I have used aggfunc='mean' to handle the cases of duplicate entry. This will take the mean if there are multiple datapoints. Once the pivot_table is created, that can be plotted as line plot using pandas plot.
Code
import pandas as pd
# intialise data of lists.
data = {'Code':['AABB', 'AABC', 'AABB', 'AABC','AABD', 'AABC', 'AABB', 'AABC', 'AABD'],
'Value':[1, 1, 2, 2,1,3,3,4, 4],
'Datetime': ['2022-03-29','2022-03-29','2022-03-30','2022-03-30','2022-03-30','2022-03-31','2022-03-31','2022-03-31','2022-03-31']}
# Create DataFrame
df = pd.DataFrame(data)
df['Datetime'] = pd.to_datetime(df['Datetime'])
df1 = df.pivot_table(index='Datetime', columns='Code', values='Value', aggfunc='mean')
#print the pivoted data
print(df1)
df1.plot()
Output
>>> df1
Code AABB AABC AABD
Datetime
2022-03-29 1.0 1.0 NaN
2022-03-30 2.0 2.0 1.0
2022-03-31 3.0 3.5 NaN

What is the equivalent function of ts from R language in python? [duplicate]

I have a dataframe with various attributes, including one datetime column. I want to extract one of the attribute columns as a time series indexed by the datetime column. This seemed pretty straightforward, and I can construct time series with random values, as all the pandas docs show.. but when I do so from a dataframe, my attribute values all convert to NaN.
Here's an analogous example.
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = pd.Series(df.a, index=df.date)
In this case, the series will have correct time series index, but all the values will be NaN.
I can do the series in two steps, as below, but I don't understand why this should be required.
s = pd.Series(df.a)
s.index = df.date
What am I missing? I assume it has to do with series references, but don't understand at all why the values would go to NaN.
I am also able to get it to work by copying the index column.
s = pd.Series(df.a, df.date.copy())
The problem is that pd.Series() is trying to use the values specified in index to select values from the dataframe, but the date values in the dataframe are not present in the index.
You can set the index to the date column and then select the one data column you want. This will return a series with the dates as the index
import pandas as pd
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = df.set_index('date')['a']
Examining s gives:
In [1]: s
Out[1]:
date
2017-04-01 0
2017-04-02 1
Name: a, dtype: int64
And you can confirm that s is a Series:
In [2]: isinstance(s, pd.Series)
Out[2]: True

How to pop out the error-causing date records using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({'date': ['45:42.7','11/1/2012 0:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
I would like to convert the date column to type datetime.
So, I tried the below
df['date'] = pd.to_datetime(df['date'])
I get the below error
ValueError: hour must be in 0..23
As we can see from the sample dataframe NA is not causing this error but the 1st record which is 45:42.7.
While the raw excel file displays this date value 45:42.7 when I open the file but when I double click the cell, it displays correctly the actual date.
How can I filter the dataframe to pop-out the first record as output (which is the error causing record)?
I expect my output to be like shown in sample dataframe below
df = pd.DataFrame({'error_date': ['45:42.7']})
First if need to see wrong values convert to datetimes and filter missing values like:
print(df[pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce').isna()])
I think None is no problem, you need specify column format and for not matched rows are generated NaNs if add errors='coerce' parameter:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce')
print (df)
date
0 2012-03-06 08:57:00
1 2012-01-11 00:00:00
2 2012-01-20 02:48:00
3 2012-01-15 00:00:00
4 NaT
The Error is caused by using something like 24:00.
Testing with (note the change in the second entry to 24:00):
df = pd.DataFrame({'date': ['6/3/2012 8:57','11/1/2012 24:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
We receive the same error as in your big dataframe. Going trough with a for loop maybe a bit slower but this way we can catch the errors.
wrong_datetime_list = []
for index, value in enumerate(df['date']):
try:
df.loc[index,'date']= pd.to_datetime(df.loc[index,'date'])
except:
wrong_datetime_list.append((index, value))

DataFrame does not allow Timestamps conversion for resampling

I have a 12 millions entries csv file that I imported as dataframe with pandas that looks like this.
pair time open close
0 AUD/JPY 20170102 00:00:08.238 83.774002 84.626999
1 AUD/JPY 20170102 00:00:08.352 83.774002 84.626999
2 AUD/JPY 20170102 00:00:13.662 84.184998 84.324997
3 AUD/JPY 20170102 00:00:13.783 84.184998 84.324997
The time column is a string but I need a datetime object in order to downsample the dataframe and get OHLC values. The df.resample function requires datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex). I tried
df['time'] = pd.to_datetime(df['time'])
but this creates Timestamp, and for some reason I cannot convert the Timestamps into datetime object.
time = df['time'].dt.to_pydatetime()
df['time'] = time
This works creating a separate array and assigning the resulting list but as soon as I incorporate it into the dataframe it is converted back into Timestamps automatically. It does not work even creating a new dataframe with dtype = 'object' and then adding the datetime list as before.
A way around would be that of converting each row individually but given the size of the dataframe it would take ages. Any suggestions?
EDIT: with
time = pd.DataFrame(dtype = 'datetime64')
time = pd.to_datetime(df['time'])
time = time.dt.to_pydatetime()
new = pd.DataFrame({'pair': df['pair'],'time': pd.Series(time, dtype='object'), 'open': df['open'], 'close': df['close']}, dtype ='object')
I am now able to receive a datetime object when calling new['time'][0], however
new['time'].resample('5T')
still raises the error: "Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'"
EDIT: Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues.
can you try:
import datetime as dt
df['time']=pd.to_datetime(df['time'], format="%y/%m/%d")
df['timeconvert'] = df['time'].dt.date
Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues. There is no need to bother with timestamp conversion or anything else, thanks anyway for the reply.

Setting freq of pandas DatetimeIndex after DataFrame creation

Im using pandas datareader to get stock data.
import pandas as pd
import pandas_datareader.data as web
ABB = web.DataReader(name='ABB.ST',
data_source='yahoo',
start='2000-1-1')
However by default freq is not set on the resulting dataframe.
I need freq to be able to navigate using the index like this:
for index, row in ABB.iterrows():
ABB.loc[[index + 1]]
If freq is not set on DatetimeIndex im not able to use +1 etc to navigate.
What I have found are two functions astype and resample. Since I already know to freq resample looks like overkill, I just want to set freq to daily.
Now my question is how can i use astype on ABB to set freq to daily?
Try:
ABB = ABB.asfreq('d')
This should change the frequency to daily with NaN for days without data.
Also, you should rewrite your for-loop as follows:
for index, row in ABB.iterrows():
print(ABB.loc[[index + pd.Timedelta(days = 1)]])
Thanks!
ABB is pandas DataFrame, whose index type is DatetimeIndex.
DatetimeIndex has freq attribute which can be set as below
ABB.index.freq = 'd'
Check out the change
ABB.index
If need change frequency of index resample is for you, but then need aggregate columns by some functions like mean or sum:
print (ABB.resample('d').mean())
print (ABB.resample('d').sum())
If need select another row use iloc with get_loc for find position of value in DatetimeIndex:
print (ABB.iloc[ABB.index.get_loc('2001-05-09') + 1])
Open 188.00
High 192.00
Low 187.00
Close 191.00
Volume 764200.00
Adj Close 184.31
Name: 2001-05-10 00:00:00, dtype: float64

Categories

Resources