Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.
I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).
import numpy as np
import pandas as pd
import os
# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')
# sort
df.sort_index(inplace=True)
# sample of the data
df.head(15)
Out[4]:
PX_LAST PX_VOLUME
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862
2016-12-02 109.90 26527997
2016-12-05 109.11 34324540
2016-12-06 109.95 26195462
2016-12-07 111.03 29998719
2016-12-08 112.12 27068316
2016-12-09 113.95 34402627
2016-12-12 113.30 26374377
2016-12-13 115.19 43733811
2016-12-14 115.19 34031834
2016-12-15 115.82 46524544
2016-12-16 115.97 44351134
2016-12-19 116.64 27779423
2016-12-20 116.95 21424965
2016-12-21 117.06 23783165
df.tail(15)
Out[5]:
PX_LAST PX_VOLUME
Security Name date
MSFT US Equity 2017-02-07 63.43 20277226
2017-02-08 63.34 18096358
2017-02-09 64.06 22644443
2017-02-10 64.00 18170729
2017-02-13 64.72 22920101
2017-02-14 64.57 23108426
2017-02-15 64.53 17005157
2017-02-16 64.52 20546345
2017-02-17 64.62 21248818
2017-02-21 64.49 20655869
2017-02-22 64.36 19292651
2017-02-23 64.62 20273128
2017-02-24 64.62 21796800
2017-02-27 64.23 15871507
2017-02-28 63.98 23239825
When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:
df.head(5)
Out[7]:
PX_LAST PX_VOLUME px_change_%
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862 NaN
2016-12-02 109.90 26527997 0.003745
2016-12-05 109.11 34324540 -0.007188
2016-12-06 109.95 26195462 0.007699
2016-12-07 111.03 29998719 0.009823
But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:
# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
df.iloc[:,3:].tail(40)
Out[12]:
30_day_volume volume_change_%
Security Name date
MSFT US Equity 2016-12-30 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
2017-01-09 NaN NaN
2017-01-10 NaN NaN
2017-01-11 NaN NaN
2017-01-12 NaN NaN
2017-01-13 NaN NaN
2017-01-17 NaN NaN
2017-01-18 NaN NaN
2017-01-19 NaN NaN
2017-01-20 NaN NaN
2017-01-23 NaN NaN
2017-01-24 NaN NaN
2017-01-25 NaN NaN
2017-01-26 NaN NaN
2017-01-27 NaN NaN
2017-01-30 NaN NaN
2017-01-31 NaN NaN
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 NaN NaN
2017-02-06 NaN NaN
2017-02-07 NaN NaN
2017-02-08 NaN NaN
2017-02-09 NaN NaN
2017-02-10 NaN NaN
2017-02-13 NaN NaN
2017-02-14 NaN NaN
2017-02-15 NaN NaN
2017-02-16 NaN NaN
2017-02-17 NaN NaN
2017-02-21 NaN NaN
2017-02-22 NaN NaN
2017-02-23 NaN NaN
2017-02-24 NaN NaN
2017-02-27 NaN NaN
2017-02-28 NaN NaN
As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.
Edit: I've tried some other ways as well.
Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex. df.index.levels[0] or ...levels[1] wasn't working.
Thanks!
Can you try the following to see if it works?
df['30_day_volume'] = df.groupby(level=0)['PX_VOLUME'].rolling(window=30).mean().values
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
I can verify Allen's answer works when using pandas_datareader, modifying the index level for the groupby operation for the datareader multiindexing.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016, 12, 1)
end = datetime.datetime(2017, 2, 28)
data = web.DataReader(['AAPL', 'IBM', 'MSFT'], 'yahoo', start, end).to_frame()
data['30_day_volume'] = data.groupby(level=1).rolling(window=30)['Volume'].mean().values
data['volume_change_%'] = (data['Volume'] - data['30_day_volume']) / data['30_day_volume']
# double-check that it computed starting at 30 trading days.
data.loc['2017-1-17':'2017-1-30']
The original poster might try editing this line:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
to the following, using mean().values:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean().values
The data don't get properly aligned without this, resulting in NaN's.
Related
I am trying to reindex the dates in pandas. This is because there are dates which are missing, such as weekends or national hollidays.
To do this I am using the following code:
import pandas as pd
import yfinance as yf
import datetime
start = datetime.date(2015,1,1)
end = datetime.date.today()
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index = df.index.strftime('%Y-%m-%d')
full_dates = pd.date_range(start, end)
df.reindex(full_dates)
This code is producing this dataframe:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 NaN NaN NaN NaN NaN NaN
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2023-01-13 NaN NaN NaN NaN NaN NaN
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
Could you please advise why is it not reindexing the data and showing NaN values instead?
===Edit ===
Could it be a python version issue? I ran the same code in python 3.7 and 3.10
In python 3.7
In python 3.10
In python 3.10 - It is datetime as you can see from the image.
Getting datetime after yf.download('F', start, end, interval ='1d', progress = False) without strftime
Remove converting DatetimeIndex to strings by df.index = df.index.strftime('%Y-%m-%d'), so can reindex by datetimes.
df = yf.download('F', start, end, interval ='1d', progress = False)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407450 44079700.0
... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
print (df.index)
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
'2015-01-09', '2015-01-10',
...
'2023-01-08', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13', '2023-01-14', '2023-01-15',
'2023-01-16', '2023-01-17'],
dtype='datetime64[ns]', length=2939, freq='D')
EDIT: There is timezones difference, for remove it use DatetimeIndex.tz_convert:
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index= df.index.tz_convert(None)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
You need to use strings in reindex to keep an homogeneous type, else pandas doesn't match the string (e.g., 2015-01-02) with the Timestamp (e.g., pd.Timestamp('2015-01-02')):
df.reindex(full_dates.astype(str))
#or
df.reindex(full_dates.strftime('%Y-%m-%d'))
Output:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407451 44079700.0
... ... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
i have a dataframe named zz
zz columns name ['Ancolmekar','Cidurian','Dayeuhkolot','Hantap','Kertasari','Meteolembang','Sapan']
for col in zz.columns:
df = pd.DataFrame(zz[col],index=pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df.resample('1M').mean()
error : invalid syntax
i want to know the mean value by month in 10 minutes data interval. when i run this just sapan values appear with NaN. before, i have replace the NaN data 1 else 0.
Sapan
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 NaN
2017-04-30 NaN
2017-05-31 NaN
2017-06-30 NaN
2017-07-31 NaN
2017-08-31 NaN
2017-09-30 NaN
2017-10-31 NaN
2017-11-30 NaN
2017-12-31 NaN
2018-01-31 NaN
2018-02-28 NaN
2018-03-31 NaN
2018-04-30 NaN
2018-05-31 NaN
2018-06-30 NaN
2018-07-31 NaN
2018-08-31 NaN
2018-09-30 NaN
2018-10-31 NaN
2018-11-30 NaN
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 NaN
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 NaN
2021-04-30 NaN
2021-05-31 NaN
2021-06-30 NaN
2021-07-31 NaN
2021-08-31 NaN
2021-09-30 NaN
2021-10-31 NaN
2021-11-30 NaN
2021-12-31 NaN
what should i do? thanks before
You are re-assigninig variable df to a dataframe with a single column during each pass through the for loop. The last column is sapan. Hence, only this column is shown.
Additionally, you are setting the index on df that probably isn't the index in zz, therefore you get Not A Number NaN for non-existing values.
If the index in zz is corresponding to the one you are setting, this should work:
df = zz.copy()
df['new_column'] = pd.Series(pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df = df.set_index('new_column')
df.resample('1M').mean()
I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks
Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0
How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?
Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex