pandas Grouper not upsampling as expected - python

Consider a Series with a MultiIndex that provides a natural grouping value on level 0 and time series on level 1:
s = pd.Series(range(12), index=pd.MultiIndex.from_product([['a','b','c'],
pd.date_range(start='2019-01-01', freq='3D', periods=4)], names=['grp','ts']))
print(s)
grp ts
a 2019-01-01 0
2019-01-04 1
2019-01-07 2
2019-01-10 3
b 2019-01-01 4
2019-01-04 5
2019-01-07 6
2019-01-10 7
c 2019-01-01 8
2019-01-04 9
2019-01-07 10
2019-01-10 11
Length: 12, dtype: int64
I want to upsample the time series for each outer index value, say with a simple forward fill action:
s.groupby(['grp', pd.Grouper(level=1, freq='D')]).ffill()
Which produces unexpected results; namely, it doesn't do anything. The result is exactly s rather than what I desire which would be:
grp ts
a 2019-01-01 0
2019-01-02 0
2019-01-03 0
2019-01-04 1
2019-01-05 1
2019-01-06 1
2019-01-07 2
2019-01-08 2
2019-01-09 2
2019-01-10 3
b 2019-01-01 4
2019-01-02 4
2019-01-03 4
2019-01-04 5
2019-01-05 5
2019-01-06 5
2019-01-07 6
2019-01-08 6
2019-01-09 6
2019-01-10 7
c 2019-01-01 8
2019-01-02 8
2019-01-03 8
2019-01-04 9
2019-01-05 9
2019-01-06 9
2019-01-07 10
2019-01-08 10
2019-01-09 10
2019-01-10 11
Length: 30, dtype: int64
I can change the Grouper freq or the resample function to same effect. The one workaround I found was through creative trickery to force a simple time series index on each group (thank you Allen for providing the answer https://stackoverflow.com/a/44719843/3109201):
s.reset_index(level=1).groupby('grp').apply(lambda s: s.set_index('ts').resample('D').ffill())
which is slightly different from what I was originally asking for, because it returns a DataFrame:
0
grp ts
a 2019-01-01 0
2019-01-02 0
2019-01-03 0
2019-01-04 1
2019-01-05 1
2019-01-06 1
2019-01-07 2
2019-01-08 2
2019-01-09 2
2019-01-10 3
b 2019-01-01 4
2019-01-02 4
2019-01-03 4
2019-01-04 5
2019-01-05 5
2019-01-06 5
2019-01-07 6
2019-01-08 6
2019-01-09 6
2019-01-10 7
c 2019-01-01 8
2019-01-02 8
2019-01-03 8
2019-01-04 9
2019-01-05 9
2019-01-06 9
2019-01-07 10
2019-01-08 10
2019-01-09 10
2019-01-10 11
[30 rows x 1 columns]
I can and will use this workaround, but I'd like to know why the simpler (and frankly more elegant) method is not working.

use series.asfreq() which fulfills the missing dates.
def filldates(s_in):
s_in.reset_index(level="grp",drop=True,inplace=True)
s_in= s_in.asfreq("1D",method='ffill')
return s_in
s.groupby(level=0).apply(filldates)

Related

pandas consecutive Boolean event rollup time series

Here's some made up time series data on 1 minute intervals:
import pandas as pd
import numpy as np
import random
random.seed(5)
rows,cols = 8760,3
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T')
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)
This is just some code to create some Boolean columns
df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)
df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)
On my screen this prints:
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0 0 0
2019-01-01 00:01:00 0 0 1 <---- Count as same event!
2019-01-01 00:02:00 0 0 1 <---- Count as same event!
2019-01-01 00:03:00 1 0 0
2019-01-01 00:04:00 0 0 0
What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!
If I do:
df = df.resample('H').sum()
This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!
Thanks for any tips!!
Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.
>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 4 8 4
2019-01-01 01:00:00 9 7 6
2019-01-01 02:00:00 7 14 4
2019-01-01 03:00:00 2 8 7
2019-01-01 04:00:00 4 9 5
... ... ... ...
2019-01-06 21:00:00 4 8 2
2019-01-06 22:00:00 3 11 4
2019-01-06 23:00:00 6 11 4
2019-01-07 00:00:00 8 7 8
2019-01-07 01:00:00 4 9 6
[146 rows x 3 columns]
Using your code:
>>> df.resample('H').sum()
condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 5 8 5
2019-01-01 01:00:00 9 8 6
2019-01-01 02:00:00 7 14 5
2019-01-01 03:00:00 2 9 7
2019-01-01 04:00:00 4 11 5
... ... ... ...
2019-01-06 21:00:00 5 11 3
2019-01-06 22:00:00 3 15 4
2019-01-06 23:00:00 6 12 4
2019-01-07 00:00:00 8 7 10
2019-01-07 01:00:00 4 9 7
[146 rows x 3 columns]
Check:
dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00',
closed='left', freq='T')
df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
c1
2021-11-15 21:00:00 1
>>> df1.resample('H').sum()
c1
2021-11-15 21:00:00 60

How to refer to other rows in Pandas DataFrame in context of a single row?

I have the following example Pandas DataFrame
df
UserID Total Date
1 20 2019-01-01
1 18 2019-01-02
1 22 2019-01-03
1 16 2019-01-04
1 17 2019-01-05
1 26 2019-01-06
1 30 2019-01-07
1 28 2019-01-08
1 28 2019-01-09
1 28 2019-01-10
2 22 2019-01-01
2 11 2019-01-02
2 23 2019-01-03
2 14 2019-01-04
2 19 2019-01-05
2 29 2019-01-06
2 21 2019-01-07
2 22 2019-01-08
2 30 2019-01-09
2 16 2019-01-10
3 27 2019-01-01
3 13 2019-01-02
3 12 2019-01-03
3 27 2019-01-04
3 26 2019-01-05
3 26 2019-01-06
3 30 2019-01-07
3 19 2019-01-08
3 27 2019-01-09
3 29 2019-01-10
4 29 2019-01-01
4 12 2019-01-02
4 25 2019-01-03
4 11 2019-01-04
4 19 2019-01-05
4 20 2019-01-06
4 33 2019-01-07
4 24 2019-01-08
4 22 2019-01-09
4 24 2019-01-10
What I'm trying to achieve is to add a column TotalPast3Days that is basically the sum of Total of the previous 3 days (excluding the current date in the row) for that particular UserID
How can this be done?
For the first 3 days, you will get a NaN because there are no "previous 3 days (excluding the current date in the row)"; but, for the rest, you can use shift like df['TotalPast3Days'] = df['Date'].shift(1) + df['Date'].shift(2) + df['Date'].shift(3)
totals = []
for i in len(df.index):
if i < 3:
totals.append(0)
elif df['UserID'].iloc[i] == df['UserID'].iloc[i-3]:
total = df['Total'].iloc[i-1] +
df['Total'].iloc[i-2] +
df['Total'].iloc[i-3]
totals.append(total)
else:
totals.append(0)
df['Sum of past 3'] = totals

How to return CUMSUM of Days in Dates - Python

How do I return the CumSum number of days for the dates provided?
import pandas as pd
df = pd.DataFrame({
'date': ['2019-01-01','2019-01-03','2019-01-05',
'2019-01-06','2019-01-07','2019-01-08',
'2019-01-09','2019-01-12','2019-01-013']})
df['date'].cumsum() does not work here.
Desired dataframe:
date Cumsum days
0 2019-01-01 0
1 2019-01-03 2
2 2019-01-05 4
3 2019-01-06 5
4 2019-01-07 6
5 2019-01-08 7
6 2019-01-09 8
7 2019-01-12 9
8 2019-01-013 11
Another way is calling diff, fillna and cumsum
df['cumsum days'] = df['date'].diff().dt.days.fillna(0).cumsum()
Out[2044]:
date cumsum days
0 2019-01-01 0.0
1 2019-01-03 2.0
2 2019-01-05 4.0
3 2019-01-06 5.0
4 2019-01-07 6.0
5 2019-01-08 7.0
6 2019-01-09 8.0
7 2019-01-12 11.0
8 2019-01-13 12.0
Thanks, this should work:
def generate_time_delta_column(df, time_column, date_first_online_column):
return (df[time_column] - df[date_first_online_column]).dt.days

Efficiently creating columns from GroupBy Operations

Given is a DataFrame like this:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
The Goal is to group the dataframe by the kind of animal and having the two latest dates as column values :
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
My current Solution is vastly inefficient, it goes like this:
1. Creating the Dataframe
import pandas as pd
data = {"kind": ["tiger", "tiger", "bird", "whale", "bird", "tiger", "bird"],
"seen": pd.date_range('2019-01-01', periods = 7)}
df = pd.DataFrame(data)
Dataframe:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
2. Calculating the latest dates with groupby
df = df.groupby('kind')['seen'].nlargest(2)
Dataframe:
kind
bird 6 2019-01-07
4 2019-01-05
tiger 5 2019-01-06
1 2019-01-02
whale 3 2019-01-04
Here lies the problem, the second level of the MultiIndex keeps the original indices of the dates as value.
Meaning, if I now df.unstack() the Dataframe it looks like this:
1 3 4 5 6
kind
bird NaT NaT 2019-01-05 NaT 2019-01-07
tiger 2019-01-02 NaT NaT 2019-01-06 NaT
whale NaT 2019-01-04 NaT NaT NaT
the goal is to look like this:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
3. Transform the Dataframe in a really ugly way
I change the second level of the MultiIndex to values that allow df.unstack() to unstack the Dataframe just like the goal Dataframe
# Keeping track of the latest animal seen
predecessor_id = None
counter = 1
result = list()
for row in df.index:
if predecessor_id != row[0]:
counter = 1
else:
counter += 1
result.append((row[0], counter))
predecessor_id = row[0]
df.index = pd.MultiIndex.from_tuples(result)
Dataframe:
bird 1 2019-01-07
2 2019-01-05
tiger 1 2019-01-06
2 2019-01-02
whale 1 2019-01-04
df.unstack and renaming the columns then gives us goal Dataframe:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
Needless to say, this solution is overkill and unpythonic to the core.
Thank you for your time and happy holidays!
here is a way :
grp=df.groupby('kind')['seen'].nlargest(2).droplevel(1).to_frame()
grp=grp.set_index(grp.groupby(grp.index).cumcount(),append=True).unstack()
grp.columns=['last_seen','second_last_seen']
print(grp)
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
s = df.groupby('kind')['seen'].tail(2)
new_df = df.loc[df['seen'].isin(s)].groupby('kind').agg(['last','first'])
then we just need to remove values where first and last are the same, indicating there was only one value in the original data frame.
new_df.columns = new_df.columns.droplevel()
new_df.loc[a['first'] == new_df['last'],'last'] = pd.NaT
new_df.columns = new_df.columns.map(lambda x : x + '_seen')
last_seen first_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale NaT 2019-01-04
You can do something like this:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
df2['second_last_seen'] = g.nth(-2)
The result will be:
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
And you can use this solutions if you want more columns:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
for k in range(2,4):
df2[str(k)+'_last_seen'] = g.nth(-k)
Which results in:
last_seen 2_last_seen 3_last_seen
kind
bird 2019-01-07 2019-01-05 2019-01-03
tiger 2019-01-06 2019-01-02 2019-01-01
whale 2019-01-04 NaT NaT
UPD: added sort by 'seen' column because it is necessary in general case. Thanks #aitak
Yet another solution (if "seen" is of Timestamp dtype):
s=df.groupby("kind")["seen"].agg(lambda t: t.nlargest(2).to_list())
s
kind
bird [2019-01-07 00:00:00, 2019-01-05 00:00:00]
tiger [2019-01-06 00:00:00, 2019-01-02 00:00:00]
whale [2019-01-04 00:00:00]
Name: seen, dtype: object
pd.DataFrame( s.to_list(),index=s.index).rename(columns={0:"last_seen",1:"second_last_seen"})
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT

Pandas Rolling mean based on groupby multiple columns

I have a Long format dataframe with repeated values in two columns and data in another column. I want to find SMAs for each group. My problem is : rolling() simply ignores the fact that the data is grouped by two columns.
Here is some dummy data and code.
import numpy as np
import pandas as pd
dtix=pd.Series(pd.date_range(start='1/1/2019', periods=4) )
df=pd.DataFrame({'ix1':np.repeat([0,1],4), 'ix2':pd.concat([dtix,dtix]), 'data':np.arange(0,8) })
df
ix1 ix2 data
0 0 2019-01-01 0
1 0 2019-01-02 1
2 0 2019-01-03 2
3 0 2019-01-04 3
0 1 2019-01-01 4
1 1 2019-01-02 5
2 1 2019-01-03 6
3 1 2019-01-04 7
Now when I perform a grouped rolling mean on this data, I am getting an output like this:
df.groupby(['ix1','ix2']).agg({'data':'mean'}).rolling(2).mean()
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 3.5
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Desired Output:
Whereas, what I would actually like to have is this:
sma
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Will appreciate your help with this.
Use another groupby by firast level (ix1) with rolling:
df1 = (df.groupby(['ix1','ix2'])
.agg({'data':'mean'})
.groupby(level=0, group_keys=False)
.rolling(2)
.mean())
print (df1)
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
In your solution affter aggregation is returned one column DataFrame, so chained rolling working with all rows, not per groups like need:
print(df.groupby(['ix1','ix2']).agg({'data':'mean'}))
data
ix1 ix2
0 2019-01-01 0
2019-01-02 1
2019-01-03 2
2019-01-04 3
1 2019-01-01 4
2019-01-02 5
2019-01-03 6
2019-01-04 7

Categories

Resources