I want to resample a pandas dataframe from hourly to annual/daily frequency with the how=mean method. However, of course some hourly data are missing during the year.
How can I set a threshold for the ratio of allowed NaNs before the mean is set to NaN, too? I couldn't find anything considering that in the docs...
Thanks in advance!
Here is a simple solution using groupby.
# Test data
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end=start_date + 3 * Hour(), freq='H')
df.loc[na_range,'values'] = np.NaN
# grouping by day, computing the mean and the count
df = df.groupby(df.index.date).agg(['mean', 'count'])
df.columns = df.columns.droplevel()
# Populating the mean only if the number of values (count) is > to the threshold
df['values'] = np.NaN
df.loc[df['count']>=20, 'values'] = df['mean']
print(df.head)
# Result
mean count values
2015-01-01 4.947368 20 NaN
2015-01-02 5.125000 24 5.125
2015-01-03 4.875000 24 4.875
2015-01-04 5.750000 24 5.750
2015-01-05 4.875000 24 4.875
Here is an alternative solution, based on resampling.
# Test data (taken from Romain)
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end='2015-01-01 12:00', freq='H')
df.loc[na_range,'values'] = np.NaN
# Add a column with 1 if data is not NaN, 0 if data is NaN
df['data coverage'] = (~np.isnan(df['values'])).astype(int)
df = df.resample('D').mean()
# Specify a threshold on data coverage of 80%
threshold = 0.8
df.loc[df['data coverage'] < threshold, 'values'] = np.NaN
print(df.head)
# Result
values data coverage
2015-01-01 NaN 0.458333
2015-01-02 5.708333 1.000000
2015-01-03 5.083333 1.000000
2015-01-04 4.958333 1.000000
2015-01-05 5.125000 1.000000
2015-01-06 4.791667 1.000000
2015-01-07 5.625000 1.000000
Related
I have a pandas DataFrame, which stores stock price and time, time column's type is pd.datetime.
here is a demo:
import pandas as pd
df = pd.DataFrame([['2022-09-01 09:33:00', 100.], ['2022-09-01 09:33:14', 101.], ['2022-09-01 09:33:16', 99.4], ['2022-09-01 09:33:30', 100.9]], columns=['time', 'price'])
df['time'] = pd.to_datetime(df['time'])
In [11]: df
Out[11]:
time price
0 2022-09-01 09:33:00 100.0
1 2022-09-01 09:33:14 101.0
2 2022-09-01 09:33:16 99.4
3 2022-09-01 09:33:30 100.9
I want to calculate future return in 15s. (first price after 15 second - current price)
which I want is:
In [13]: df
Out[13]:
time price return
0 2022-09-01 09:33:00 100.0 -0.6 // the future price is 99.4, period is 16s
1 2022-09-01 09:33:14 101.0 -0.1 // the future price is 100.9, period is 16s
2 2022-09-01 09:33:16 99.4 NaN
3 2022-09-01 09:33:30 100.9 NaN
I know df.diff can get difference in index, is there any good methods can do this?
merge_asof to the rescue
Subtract a timedelta of 15s from the right dataframe then self merge on time using merge_asof with direction=forward which selects the first row in right dataframe whose on key is greater than or equal to the on key in the left dataframe then subtract the price column to calculate the return
df1 = pd.merge_asof(
left=df,
right=df.assign(time=df['time'] - pd.Timedelta('15s')),
on='time', direction='forward', suffixes=['', '_r']
)
df1['return'] = df1.pop('price_r') - df1['price']
Result
time price return
0 2022-09-01 09:33:00 100.0 -0.6
1 2022-09-01 09:33:14 101.0 -0.1
2 2022-09-01 09:33:16 99.4 NaN
3 2022-09-01 09:33:30 100.9 NaN
Please, try this (but I don't believe the output is very meaningful :-( ). Is it what you expected? (I realized this code assigned the return for the previous "15" seconds, not the next "15" seconds. But this is how the return is usually indexed - by the time when it is realized, not when it is still expected for the future).
import numpy as np
import pandas as pd
df = pd.DataFrame([['2022-09-01 09:33:00', 100.], ['2022-09-01 09:33:14', 101.], ['2022-09-01 09:33:16', 99.4], ['2022-09-01 09:33:30', 100.9]], columns=['time', 'price'])
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values('time').reset_index(drop=True)
df.loc[:, 'return'] = df['price'].diff()
df['time_diff'] = df['time'].diff()
df['15sec_or_more'] = (df['time_diff'] >= np.timedelta64(15, 's'))
for k, i in enumerate(df.index):
if k:
if not df.loc[i,'15sec_or_more']:
temp = df.iloc[k:].loc[:,['return','time_diff']].cumsum(axis=0)
conds = (temp['time_diff'] >= np.timedelta64(15, 's'))
if conds.sum():
true_return_index = conds.idxmax()
df.loc[i, 'return'] = df.loc[true_return_index, 'return']
else:
df.loc[i, 'return'] = np.nan
df = df[['time', 'price' ,'return']]
print(df)
I have got a big data frame. Below you will find an extract of it:
lst=[['31122020','A',12],['31012021','A',14],['28022021','A',15],['31032021','A',17]]
df2=pd.DataFrame(lst, columns=['Date','FN','AuM'])
I would like to calculate the Year to date (YTD) of the column AuM. The new column should look like this:
lst=[['31122020','A',12,'NaN'],['31012021','A',14,0.167],['28022021','A',15,0.25],['31032021','A',17,0.417]]
df2=pd.DataFrame(lst, columns=['Date','FN','AuM','AuM_YTD_%Change'])
Do you know any pandas function which can reach my goal?
You can create a mask for dates inside one year, then use diff + cumsum for the changes, and div for the change rates:
df2['Date'] = pd.to_datetime(df2['Date'], format='%d%m%Y')
msk = df2['Date'] < df2.loc[0, 'Date'] + pd.to_timedelta(365, unit='D')
df2['AuM_YTD_%Change'] = df2.loc[msk, 'AuM'].diff().cumsum().div(df2.loc[0,'AuM'])
Output:
Date FN AuM AuM_YTD_%Change
0 2020-12-31 A 12 NaN
1 2021-01-31 A 14 0.166667
2 2021-02-28 A 15 0.250000
3 2021-03-31 A 17 0.416667
I have a dataframe in long format with speed data with varying time sampling intervals and frequencies for two observations locations (A and B). If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 locations but produce daily average speed data?
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['Location'] = 'A'
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
dti2 = pd.date_range('2015-01-01', '2016-06-05', freq='30min')
df2 = pd.DataFrame(index = dti2)
df2['Location'] = 'B'
df2['speed'] = np.random.randint(low=0, high=60, size=len(df2.index))
df = df.append(df2)
df2 = df.resample('d', on='index').mean()
Use groupby and resample:
>>> df.groupby("Location").resample("D").mean().reset_index(0)
Location speed
2015-01-01 A 29.114583
2015-01-02 A 27.083333
2015-01-03 A 31.135417
2015-01-04 A 30.354167
2015-01-05 A 29.427083
... ...
2016-06-01 B 33.770833
2016-06-02 B 28.979167
2016-06-03 B 29.812500
2016-06-04 B 31.270833
2016-06-05 B 42.000000
If you instead want separate columns for A and B, you can use unstack:
>>> df.groupby("Location").resample("D").mean().unstack(0)
speed
Location A B
2015-01-01 29.114583 29.520833
2015-01-02 27.083333 27.291667
2015-01-03 31.135417 30.375000
2015-01-04 30.354167 31.645833
2015-01-05 29.427083 26.645833
... ...
2016-06-01 NaN 33.770833
2016-06-02 NaN 28.979167
2016-06-03 NaN 29.812500
2016-06-04 NaN 31.270833
2016-06-05 NaN 42.000000
I have a data frame as follows.
pd.DataFrame({'Date':['2020-08-01','2020-08-01','2020-09-01'],'value':[10,12,9],'item':['a','d','b']})
I want to convert this to weekly data keeping all the columns apart from the Date column constant.
Expected output
pd.DataFrame({'Date':['2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-09-01','2020-09-08','2020-09-15','2020-09-22','2020-09-29'],
'value':[10,10,10,10,10,12,12,12,12,12,9,9,9,9,9],'item':['a','a','a','a','a','d','d','d','d','d','b','b','b','b','b']})
It should be able to convert any month data to weekly data. Date in the input data frame is always the first day of that month.
How do I make this happen?
Thanks in advance.
Since the desired new datetime index is irregular (re-starts at the 1st of each month), an iterative creation of the index is an option:
df = pd.DataFrame({'Date':['2020-08-01','2020-09-01'],'value':[10,9],'item':['a','b']})
df = df.set_index(pd.to_datetime(df['Date'])).drop(columns='Date')
dti = pd.to_datetime([]) # start with an empty datetime index
for month in df.index: # for each month, add a 7-day step datetime index to the previous
dti = dti.union(pd.date_range(month, month+pd.DateOffset(months=1), freq='7d'))
# just reindex and forward-fill, no resampling needed
df = df.reindex(dti).ffill()
df
value item
2020-08-01 10.0 a
2020-08-08 10.0 a
2020-08-15 10.0 a
2020-08-22 10.0 a
2020-08-29 10.0 a
2020-09-01 9.0 b
2020-09-08 9.0 b
2020-09-15 9.0 b
2020-09-22 9.0 b
2020-09-29 9.0 b
I added one more date to your data and then used resample:
df = pd.DataFrame({'Date':['2020-08-01', '2020-09-01'],'value':[10, 9],'item':['a', 'b']})
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df = df.resample('W').ffill().reset_index()
print(df)
Date value item
0 2020-08-02 10 a
1 2020-08-09 10 a
2 2020-08-16 10 a
3 2020-08-23 10 a
4 2020-08-30 10 a
5 2020-09-06 9 b
I have a dataframe like the following:
Index Diff
2019-03-14 11:32:21.583000+00:00 0
2019-03-14 11:32:21.583000+00:00 2
2019-04-14 11:32:21.600000+00:00 13
2019-04-14 11:32:21.600000+00:00 14
2019-05-14 11:32:21.600000+00:00 19
2019-05-14 11:32:21.600000+00:00 27
What would be the best approach to group by the month and take the difference inside of those months?
Using the .diff() option I am able to find the difference between each row, but I am trying to use the df.groupby(pd.Grouper(freq='M')) with no success.
Expected Output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0
Any help would be much appreciated!!
Depending on whether or not your date is on the index, you can comment out df1 = df.reset_index(). Also, check that your index is in DateTimeIndex format if it is on the index. If not in the correct format, then you can change the data type with df.index = pd.to_datetime(df.index). Then, you should be set to change the Diff column with df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff() and then later groupby with the full dataframe:
input:
import pandas as pd
df = pd.DataFrame({'Diff': {'2019-03-14 11:32:21.583000+00:00': 2,
'2019-04-14 11:32:21.600000+00:00': 14,
'2019-05-14 11:32:21.600000+00:00': 27}})
df.index.name = 'Index'
df.index = pd.to_datetime(df.index)
code:
df1 = df.reset_index()
df1['Diff'] = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff()
df1 = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].max().reset_index()
df1
output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0