Im using pandas datareader to get stock data.
import pandas as pd
import pandas_datareader.data as web
ABB = web.DataReader(name='ABB.ST',
data_source='yahoo',
start='2000-1-1')
However by default freq is not set on the resulting dataframe.
I need freq to be able to navigate using the index like this:
for index, row in ABB.iterrows():
ABB.loc[[index + 1]]
If freq is not set on DatetimeIndex im not able to use +1 etc to navigate.
What I have found are two functions astype and resample. Since I already know to freq resample looks like overkill, I just want to set freq to daily.
Now my question is how can i use astype on ABB to set freq to daily?
Try:
ABB = ABB.asfreq('d')
This should change the frequency to daily with NaN for days without data.
Also, you should rewrite your for-loop as follows:
for index, row in ABB.iterrows():
print(ABB.loc[[index + pd.Timedelta(days = 1)]])
Thanks!
ABB is pandas DataFrame, whose index type is DatetimeIndex.
DatetimeIndex has freq attribute which can be set as below
ABB.index.freq = 'd'
Check out the change
ABB.index
If need change frequency of index resample is for you, but then need aggregate columns by some functions like mean or sum:
print (ABB.resample('d').mean())
print (ABB.resample('d').sum())
If need select another row use iloc with get_loc for find position of value in DatetimeIndex:
print (ABB.iloc[ABB.index.get_loc('2001-05-09') + 1])
Open 188.00
High 192.00
Low 187.00
Close 191.00
Volume 764200.00
Adj Close 184.31
Name: 2001-05-10 00:00:00, dtype: float64
Related
I have a dataframe with various attributes, including one datetime column. I want to extract one of the attribute columns as a time series indexed by the datetime column. This seemed pretty straightforward, and I can construct time series with random values, as all the pandas docs show.. but when I do so from a dataframe, my attribute values all convert to NaN.
Here's an analogous example.
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = pd.Series(df.a, index=df.date)
In this case, the series will have correct time series index, but all the values will be NaN.
I can do the series in two steps, as below, but I don't understand why this should be required.
s = pd.Series(df.a)
s.index = df.date
What am I missing? I assume it has to do with series references, but don't understand at all why the values would go to NaN.
I am also able to get it to work by copying the index column.
s = pd.Series(df.a, df.date.copy())
The problem is that pd.Series() is trying to use the values specified in index to select values from the dataframe, but the date values in the dataframe are not present in the index.
You can set the index to the date column and then select the one data column you want. This will return a series with the dates as the index
import pandas as pd
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = df.set_index('date')['a']
Examining s gives:
In [1]: s
Out[1]:
date
2017-04-01 0
2017-04-02 1
Name: a, dtype: int64
And you can confirm that s is a Series:
In [2]: isinstance(s, pd.Series)
Out[2]: True
I have a dataframe like as shown below
df = pd.DataFrame({'date': ['45:42.7','11/1/2012 0:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
I would like to convert the date column to type datetime.
So, I tried the below
df['date'] = pd.to_datetime(df['date'])
I get the below error
ValueError: hour must be in 0..23
As we can see from the sample dataframe NA is not causing this error but the 1st record which is 45:42.7.
While the raw excel file displays this date value 45:42.7 when I open the file but when I double click the cell, it displays correctly the actual date.
How can I filter the dataframe to pop-out the first record as output (which is the error causing record)?
I expect my output to be like shown in sample dataframe below
df = pd.DataFrame({'error_date': ['45:42.7']})
First if need to see wrong values convert to datetimes and filter missing values like:
print(df[pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce').isna()])
I think None is no problem, you need specify column format and for not matched rows are generated NaNs if add errors='coerce' parameter:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce')
print (df)
date
0 2012-03-06 08:57:00
1 2012-01-11 00:00:00
2 2012-01-20 02:48:00
3 2012-01-15 00:00:00
4 NaT
The Error is caused by using something like 24:00.
Testing with (note the change in the second entry to 24:00):
df = pd.DataFrame({'date': ['6/3/2012 8:57','11/1/2012 24:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
We receive the same error as in your big dataframe. Going trough with a for loop maybe a bit slower but this way we can catch the errors.
wrong_datetime_list = []
for index, value in enumerate(df['date']):
try:
df.loc[index,'date']= pd.to_datetime(df.loc[index,'date'])
except:
wrong_datetime_list.append((index, value))
I have a dataframe with just two columns, Date, and ClosingPrice. I am trying to plot them using df.plot() but keep getting this error:
ValueError: view limit minimum -36785.37852 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
I have found documentation about this from matplotlib but that says how to make sure that the format is datetime. Here is code that I have to make sure the format is datetime and also printing the data type for each column before attempting to plot.
df.Date = pd.to_datetime(df.Date)
print(df['ClosingPrice'].dtypes)
print(df['Date'].dtypes)
The output for these print statements are:
float64
datetime64[ns]
I am not sure what the problem is since I am verifying the data type before plotting. Here is also what the first few rows of the data set look like:
Date ClosingPrice
0 2013-09-10 64.7010
1 2013-09-11 61.1784
2 2013-09-12 61.8298
3 2013-09-13 60.8108
4 2013-09-16 58.8776
5 2013-09-17 59.5577
6 2013-09-18 60.7821
7 2013-09-19 61.7788
Any help is appreciated.
EDIT 2 after seeing more people ending up here. To be clear for new people to python, you should first import pandas for the codes bellow to work:
import pandas as pd
EDIT 1: (short quick answer)
If³ you don't want to drop your original index (this makes sense after reading the original and long answer bellow) you could:
df[['Date','ClosingPrice']].plot('Date', figsize=(15,8))
Original and long answer:
Try setting your index as your Datetime column first:
df.set_index('Date', inplace=True, drop=True)
Just to be sure, try setting the index dtype (edit: this probably wont be needed as you did it previously):
df.index = pd.to_datetime(df.index)
And then plot it
df.plot()
If this solves the issue it's because when you use the .plot() from DataFrame object, the X axis will automatically be the DataFrame's index.
If² your DataFrame had a Datetimeindex and 2 other columns (say ['Currency','pct_change_1']) and you wanted to plot just one of them (maybe pct_change_1) you could:
# single [ ] transforms the column into series, double [[ ]] into DataFrame
df[['pct_change_1']].plot(figsize=(15,8))
Where figsize=(15,8) you're setting the size of the plot (width, height).
Here is a simple solution:
my_dict = {'Date':['2013-09-10', '2013-09-11', '2013-09-12', '2013-09-13', '2013-09-16', '2013-09-17', '2013-09-18',
'2013-09-19'], 'ClosingPrice': [ 64.7010, 61.1784, 61.8298, 60.8108, 58.8776, 59.5577, 60.7821, 61.7788]}
df = pd.DataFrame(my_dict)
df.set_index('Date', inplace=True)
df.plot()
I am working with some financial data that is organized as a df with a MultiIndex that contains the ticker and the date and a column that contains the return. I am wondering whether one should convert the index to a PeriodIndex instead of a DateTimeIndex since returns are really over a period rather than an instant in time. Beside the philosophical argument, what practical functionality does PeriodIndex provide that may be useful in this particular use case vs DateTimeIndex?
There are some functions available in DateTimeIndex (such as is_month_start, is_quarter_end) which are not available in PeriodIndex. I use PeriodIndex when is not possible to have the format I need with DateTimeIndex. For example if I need a monthly frequency in the format yyyy-mm, I use the PeriodIndex.
Example:
Assume that df has an index as
df.index
'2020-02-26 13:50:00', '2020-02-27 14:20:00',
'2020-02-28 11:10:00', '2020-02-29 13:50:00'],
dtype='datetime64[ns]', name='peak_time', length=1025, freq=None)
The minimum monthly data can be obtained via the following code
dfg = df.groupby([df.index.year, df.index.month]).min()
whose index is a MultiIndex
dfg.index
MultiIndex([(2017, 1),
...
(2020, 1),
(2020, 2)],
names=['peak_time', 'peak_time'])
No I convert it to a PeriodIndex:
dfg["date"] = pd.PeriodIndex (dfg.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
For me, the PeriodIndex can be automatically displayed as the corresponding month, quarter and year in the downsampling.
import pandas as pd
# https://github.com/jiahe224/bug_report/blob/main/resample_test.csv
temp = pd.read_csv('resample_test.csv',dtype={'stockcode':str, 'A股代码':str})
temp['date'] = pd.to_datetime(temp['date'])
temp = temp.set_index(['date'])
result = temp['北向占自由流通比'].resample('Q',closed='left').first()
result
result = temp['北向占自由流通比'].resample('Q',closed='left').first().to_period()
result
Off topic, there is a problem with resample that has not been fixed as of yet, the bug report at https://github.com/pandas-dev/pandas/issues/45869
Behavior on partial periods.
date_range returns empty index. period_range returns index with len 1 when specifying start and end that do not cover a whole period.
(also, the timezone information is lost for periods of months).
date_range:
dates = pd.core.indexes.datetimes.date_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", inclusive="both", freq="1M")
dates
DatetimeIndex([], dtype='datetime64[ns, UTC]', freq='M')
period_range:
periods = pd.core.indexes.period.period_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", freq="1M")
periods
PeriodIndex(['2022-12'], dtype='period[M]')
I am using pandas to deal with monthly data that have some missing value. I would like to be able to use the resample method to compute annual statistics but for years with no missing data.
Here is some code and output to demonstrate :
import pandas as pd
import numpy as np
dates = pd.date_range(start = '1980-01', periods = 24,freq='M')
df = pd.DataFrame( [np.nan] * 10 + range(14), index = dates)
Here is what I obtain if I resample :
In [18]: df.resample('A')
Out[18]:
0
1980-12-31 0.5
1981-12-31 7.5
I would like to have a np.nan for the 1980-12-31 index since that year does not have monthly values for every month. I tried to play with the 'how' argument but to no luck.
How can I accomplish this?
i'm sure there's a better way, but in this case you can use:
df.resample('A', how=[np.mean, pd.Series.count, len])
and then drop all rows where count != len