My code is as follows:
import pandas as pd
from pandas_datareader import data as web
import datetime
start = datetime.datetime(2021, 1, 1)
end = datetime.datetime.today()
df = web.DataReader('goog', 'yahoo', start, end)
df['pct']= df['Close'].pct_change()
Simple enough, produces:
High Low Open Close Volume Adj pct
Date
2020-12-31 1758.930054 1735.420044 1735.420044 1751.880005 1011900 1751.880005 NaN
2021-01-04 1760.650024 1707.849976 1757.540039 1728.239990 1901900 1728.239990 -0.013494
2021-01-05 1747.670044 1718.015015 1725.000000 1740.920044 1145300 1740.920044 0.007337
2021-01-06 1748.000000 1699.000000 1702.630005 1735.290039 2602100 1735.290039 -0.003234
2021-01-07 1788.400024 1737.050049 1740.060059 1787.250000 2265000 1787.250000 0.029943
... ... ... ... ... ... ... ...
2021-08-13 2773.479980 2760.100098 2767.149902 2768.120117 628600 2768.120117 0.000119
2021-08-16 2779.810059 2723.314941 2760.000000 2778.320068 902000 2778.320068 0.003685
2021-08-17 2774.370117 2735.750000 2763.820068 2746.010010 1063600 2746.010010 -0.011629
2021-08-18 2765.879883 2728.419922 2742.310059 2731.399902 746700 2731.399902 -0.005320
2021-08-19 2748.925049 2707.120117 2709.350098 2738.270020 856623 2738.270020 0.002515
160 rows × 7 columns
So the last row says the pct is 0.002515
My objecting was to reproduce the same result without pct_change to do this i have this code
(1- (df['Close'] / df['Close'].shift(-1))).shift(1)
which produces this:
Date
2020-12-31 NaN
2021-01-04 -0.013679
2021-01-05 0.007284
2021-01-06 -0.003244
2021-01-07 0.029073
...
2021-08-13 0.000119
2021-08-16 0.003671
2021-08-17 -0.011766
2021-08-18 -0.005349
2021-08-19 0.002509
Name: Close, Length: 160, dtype: float64
The last value I get is 0.002509 not 0.002515. Could you please explain why I am getting the last 2 digits off on each calulation?
Percent change is normally the change relative to the initial value:
(final - initial) / initial = final / initial - 1
You have the ratio relative to the final value. Try
df['Close'].shift(1) / df['Close'] - 1
By the way, you only need to shift once in your original expression as well.
Related
I have a DataFrame that tracks the 'Adj Closing' price for several global markets causing there to be repeating dates. To clean this up I use .set_index(['Index Ticker', 'Date']).
DataFrame sample
My issue is that the Closing Prices run as far back as 1997-07-02 but I only need 2020-01-01 and forward. I tried using idx = pd.IndexSlice followed by df.loc[idx[ :, '2020-01-01':], :] as well as df.loc[(slice(None), '2020-01-01':), :], but both methods return a syntax error on the : that I'm using to slice across a range of dates. Any tips on getting the data I need past a specific date? Thank you in advance!
Try:
# create dataframe to approximate your data
df = pd.DataFrame({'ticker' : ['A']*5 + ['M']*5,
'Date' : pd.date_range(start='2021-01-01', periods=5).tolist() + pd.date_range(start='2021-01-01', periods=5).tolist(),
'high' : range(10)}
).groupby(['ticker', 'Date']).sum()
high
ticker Date
A 2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
M 2021-01-01 5
2021-01-02 6
2021-01-03 7
2021-01-04 8
2021-01-05 9
# evaluate conditions against level 1 (Date) of your multiIndex; level 0 is ticker
df[df.index.get_level_values(1) > '2021-01-03']
high
ticker Date
A 2021-01-04 3
2021-01-05 4
M 2021-01-04 8
2021-01-05 9
Alternatively, if possible, remove the unwanted dates prior to setting your multiIndex.
I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.
So I'm expecting a table like this:
name time output
x 2018-02-22 00:00:00 100
...
x 2018-02-22 23:00:00 200
x 2018-02-24 00:00:00 300
...
x 2018-02-24 23:00:00 300
y 2018-02-22 00:00:00 100
...
y 2018-02-22 23:00:00 200
y 2018-02-25 00:00:00 300
...
y 2018-02-25 23:00:00 300
For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.
def add_missing_hours(df):
start_date = df.time.iloc[0].date()
end_date = df.time.iloc[-1].date()
dates_range = pd.date_range(start_date, end_date, freq = '1H')
new_dates = set(dates_range) - set(df.time)
name = df["name"].iloc[0]
df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
return df
For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?
Edit 1:
This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.
Edit 2:
I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:
start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')
mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))
The last line removes any days that were completely missing for the specific name in the initial dataframe.
try:
1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.
df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)
df4:
name output
2018-02-22 00:00:00 x 100.0
2018-02-22 00:00:00 y 100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y 300.0
98 rows × 2 columns
I have pandas column with only timestamps in incremental order.
I use to_datetime() to work with that column but it automatically adds same day throughout column without incrementing when encounters midnight.
So how can I logically tell it to increment day when it crosses midnight.
rail[8].iloc[121]
rail[8].iloc[100]
printing these values outputs:
TIME 2020-11-19 00:18:00
Name: DSG, dtype: datetime64[ns]
TIME 2020-11-19 21:12:27
Name: KG, dtype: datetime64[ns]
whereas iloc[121] should be 2020-11-20
Sample data is like:
df1.columns = df1.iloc[0]
ids = df1.loc['TRAIN NO'].unique()
df1.drop('TRAIN NO',axis=0,inplace=True)
rail = {}
for i in range(len(ids)):
rail[i] = df1.filter(like=ids[i])
rail[i] = rail[i].reset_index()
rail[i].rename(columns={0:'TRAIN NO'},inplace=True)
rail[i] = pd.melt(rail[i],id_vars='TRAIN NO',value_name='TIME',var_name='trainId')
rail[i].drop(columns='trainId',inplace=True)
rail[i].rename(columns={'TRAIN NO': 'CheckPoints'},inplace=True)
rail[i].set_index('CheckPoints',inplace=True)
rail[i].dropna(inplace=True)
rail[i]['TIME'] = pd.to_datetime(rail[i]['TIME'],infer_datetime_format=True)
CheckPoints TIME
DEPOT 2020-11-19 05:10:00
KG 2020-11-19 05:25:00
RI 2020-11-19 05:51:11
RI 2020-11-19 06:00:00
KG 2020-11-19 06:25:44
... ...
DSG 2020-11-19 23:41:50
ATHA 2020-11-19 23:53:56
NBAA 2020-11-19 23:58:00
NBAA 2020-11-19 00:01:00
DSG 2020-11-19 00:18:00
Could someone help me out..!
You can check where the timedelta of subsequent timestamps is less than 0 (= date changes). Use the cumsum of that and add it as a timedelta (days) to your datetime column:
import pandas as pd
df = pd.DataFrame({'time': ["23:00", "00:00", "12:00", "23:00", "01:00"]})
# cast time string to datetime, will automatically add today's date by default
df['datetime'] = pd.to_datetime(df['time'])
# get timedelta between subsequent timestamps in the column; df['datetime'].diff()
# compare to get a boolean mask where the change in time is negative (= new date)
m = df['datetime'].diff() < pd.Timedelta(0)
# m
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Name: datetime, dtype: bool
# the cumulated sum of that mask accumulates the booleans as 0/1:
# m.cumsum()
# 0 0
# 1 1
# 2 1
# 3 1
# 4 2
# Name: datetime, dtype: int32
# ...so we can use that as the date offset, which we add as timedelta to the datetime column:
df['datetime'] += pd.to_timedelta(m.cumsum(), unit='d')
df
time datetime
0 23:00 2020-11-19 23:00:00
1 00:00 2020-11-20 00:00:00
2 12:00 2020-11-20 12:00:00
3 23:00 2020-11-20 23:00:00
4 01:00 2020-11-21 01:00:00
I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!
In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00
You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])
I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!
We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps
How about transpose it:
df_seasons.T.plot()
Output: