Not getting same results as pct_change when doing manually

Not getting same results as pct_change when doing manually - python

My code is as follows:
import pandas as pd
from pandas_datareader import data as web
import datetime
start = datetime.datetime(2021, 1, 1)
end = datetime.datetime.today()
df = web.DataReader('goog', 'yahoo', start, end)
df['pct']= df['Close'].pct_change()
Simple enough, produces:
High Low Open Close Volume Adj pct
Date
2020-12-31 1758.930054 1735.420044 1735.420044 1751.880005 1011900 1751.880005 NaN
2021-01-04 1760.650024 1707.849976 1757.540039 1728.239990 1901900 1728.239990 -0.013494
2021-01-05 1747.670044 1718.015015 1725.000000 1740.920044 1145300 1740.920044 0.007337
2021-01-06 1748.000000 1699.000000 1702.630005 1735.290039 2602100 1735.290039 -0.003234
2021-01-07 1788.400024 1737.050049 1740.060059 1787.250000 2265000 1787.250000 0.029943
... ... ... ... ... ... ... ...
2021-08-13 2773.479980 2760.100098 2767.149902 2768.120117 628600 2768.120117 0.000119
2021-08-16 2779.810059 2723.314941 2760.000000 2778.320068 902000 2778.320068 0.003685
2021-08-17 2774.370117 2735.750000 2763.820068 2746.010010 1063600 2746.010010 -0.011629
2021-08-18 2765.879883 2728.419922 2742.310059 2731.399902 746700 2731.399902 -0.005320
2021-08-19 2748.925049 2707.120117 2709.350098 2738.270020 856623 2738.270020 0.002515
160 rows × 7 columns
So the last row says the pct is 0.002515
My objecting was to reproduce the same result without pct_change to do this i have this code
(1- (df['Close'] / df['Close'].shift(-1))).shift(1)
which produces this:
Date
2020-12-31 NaN
2021-01-04 -0.013679
2021-01-05 0.007284
2021-01-06 -0.003244
2021-01-07 0.029073
...
2021-08-13 0.000119
2021-08-16 0.003671
2021-08-17 -0.011766
2021-08-18 -0.005349
2021-08-19 0.002509
Name: Close, Length: 160, dtype: float64
The last value I get is 0.002509 not 0.002515. Could you please explain why I am getting the last 2 digits off on each calulation?

Percent change is normally the change relative to the initial value:
(final - initial) / initial = final / initial - 1
You have the ratio relative to the final value. Try
df['Close'].shift(1) / df['Close'] - 1
By the way, you only need to shift once in your original expression as well.

Related

Slicing across a timeseries range in a multiindex DataFrame

I have a DataFrame that tracks the 'Adj Closing' price for several global markets causing there to be repeating dates. To clean this up I use .set_index(['Index Ticker', 'Date']).
DataFrame sample
My issue is that the Closing Prices run as far back as 1997-07-02 but I only need 2020-01-01 and forward. I tried using idx = pd.IndexSlice followed by df.loc[idx[ :, '2020-01-01':], :] as well as df.loc[(slice(None), '2020-01-01':), :], but both methods return a syntax error on the : that I'm using to slice across a range of dates. Any tips on getting the data I need past a specific date? Thank you in advance!

Try:
# create dataframe to approximate your data
df = pd.DataFrame({'ticker' : ['A']*5 + ['M']*5,
'Date' : pd.date_range(start='2021-01-01', periods=5).tolist() + pd.date_range(start='2021-01-01', periods=5).tolist(),
'high' : range(10)}
).groupby(['ticker', 'Date']).sum()
high
ticker Date
A 2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
M 2021-01-01 5
2021-01-02 6
2021-01-03 7
2021-01-04 8
2021-01-05 9
# evaluate conditions against level 1 (Date) of your multiIndex; level 0 is ticker
df[df.index.get_level_values(1) > '2021-01-03']
high
ticker Date
A 2021-01-04 3
2021-01-05 4
M 2021-01-04 8
2021-01-05 9
Alternatively, if possible, remove the unwanted dates prior to setting your multiIndex.

Filling in missing hourly data in Pandas

I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.

try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns

Pandas datetime column increment day when reach midnight timestamp

I have pandas column with only timestamps in incremental order.
I use to_datetime() to work with that column but it automatically adds same day throughout column without incrementing when encounters midnight.
So how can I logically tell it to increment day when it crosses midnight.
rail[8].iloc[121]
rail[8].iloc[100]
printing these values outputs:
TIME 2020-11-19 00:18:00
Name: DSG, dtype: datetime64[ns]
TIME 2020-11-19 21:12:27
Name: KG, dtype: datetime64[ns]
whereas iloc[121] should be 2020-11-20
Sample data is like:
df1.columns = df1.iloc[0]
ids = df1.loc['TRAIN NO'].unique()
df1.drop('TRAIN NO',axis=0,inplace=True)
rail = {}
for i in range(len(ids)):
rail[i] = df1.filter(like=ids[i])
rail[i] = rail[i].reset_index()
rail[i].rename(columns={0:'TRAIN NO'},inplace=True)
rail[i] = pd.melt(rail[i],id_vars='TRAIN NO',value_name='TIME',var_name='trainId')
rail[i].drop(columns='trainId',inplace=True)
rail[i].rename(columns={'TRAIN NO': 'CheckPoints'},inplace=True)
rail[i].set_index('CheckPoints',inplace=True)
rail[i].dropna(inplace=True)
rail[i]['TIME'] = pd.to_datetime(rail[i]['TIME'],infer_datetime_format=True)
CheckPoints TIME
DEPOT 2020-11-19 05:10:00
KG 2020-11-19 05:25:00
RI 2020-11-19 05:51:11
RI 2020-11-19 06:00:00
KG 2020-11-19 06:25:44
... ...
DSG 2020-11-19 23:41:50
ATHA 2020-11-19 23:53:56
NBAA 2020-11-19 23:58:00
NBAA 2020-11-19 00:01:00
DSG 2020-11-19 00:18:00
Could someone help me out..!

You can check where the timedelta of subsequent timestamps is less than 0 (= date changes). Use the cumsum of that and add it as a timedelta (days) to your datetime column:
import pandas as pd
df = pd.DataFrame({'time': ["23:00", "00:00", "12:00", "23:00", "01:00"]})
# cast time string to datetime, will automatically add today's date by default
df['datetime'] = pd.to_datetime(df['time'])
# get timedelta between subsequent timestamps in the column; df['datetime'].diff()
# compare to get a boolean mask where the change in time is negative (= new date)
m = df['datetime'].diff() < pd.Timedelta(0)
# m
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Name: datetime, dtype: bool
# the cumulated sum of that mask accumulates the booleans as 0/1:
# m.cumsum()
# 0 0
# 1 1
# 2 1
# 3 1
# 4 2
# Name: datetime, dtype: int32
# ...so we can use that as the date offset, which we add as timedelta to the datetime column:
df['datetime'] += pd.to_timedelta(m.cumsum(), unit='d')
df
time datetime
0 23:00 2020-11-19 23:00:00
1 00:00 2020-11-20 00:00:00
2 12:00 2020-11-20 12:00:00
3 23:00 2020-11-20 23:00:00
4 01:00 2020-11-21 01:00:00

Creating Bin for timestamp column

I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!

In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00

You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!

We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps

How about transpose it:
df_seasons.T.plot()
Output:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not getting same results as pct_change when doing manually - python

Percent change is normally the change relative to the initial value: (final - initial) / initial = final / initial - 1 You have the ratio relative to the final value. Try df['Close'].shift(1) / df['Close'] - 1 By the way, you only need to shift once in your original expression as well.

Related

Slicing across a timeseries range in a multiindex DataFrame

Filling in missing hourly data in Pandas

Pandas datetime column increment day when reach midnight timestamp

Creating Bin for timestamp column

Plot each column mean grouped by specific date range

Categories

Resources