Efficiently creating columns from GroupBy Operations - python

Given is a DataFrame like this:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
The Goal is to group the dataframe by the kind of animal and having the two latest dates as column values :
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
My current Solution is vastly inefficient, it goes like this:
1. Creating the Dataframe
import pandas as pd
data = {"kind": ["tiger", "tiger", "bird", "whale", "bird", "tiger", "bird"],
"seen": pd.date_range('2019-01-01', periods = 7)}
df = pd.DataFrame(data)
Dataframe:
kind seen
0 tiger 2019-01-01
1 tiger 2019-01-02
2 bird 2019-01-03
3 whale 2019-01-04
4 bird 2019-01-05
5 tiger 2019-01-06
6 bird 2019-01-07
2. Calculating the latest dates with groupby
df = df.groupby('kind')['seen'].nlargest(2)
Dataframe:
kind
bird 6 2019-01-07
4 2019-01-05
tiger 5 2019-01-06
1 2019-01-02
whale 3 2019-01-04
Here lies the problem, the second level of the MultiIndex keeps the original indices of the dates as value.
Meaning, if I now df.unstack() the Dataframe it looks like this:
1 3 4 5 6
kind
bird NaT NaT 2019-01-05 NaT 2019-01-07
tiger 2019-01-02 NaT NaT 2019-01-06 NaT
whale NaT 2019-01-04 NaT NaT NaT
the goal is to look like this:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
3. Transform the Dataframe in a really ugly way
I change the second level of the MultiIndex to values that allow df.unstack() to unstack the Dataframe just like the goal Dataframe
# Keeping track of the latest animal seen
predecessor_id = None
counter = 1
result = list()
for row in df.index:
if predecessor_id != row[0]:
counter = 1
else:
counter += 1
result.append((row[0], counter))
predecessor_id = row[0]
df.index = pd.MultiIndex.from_tuples(result)
Dataframe:
bird 1 2019-01-07
2 2019-01-05
tiger 1 2019-01-06
2 2019-01-02
whale 1 2019-01-04
df.unstack and renaming the columns then gives us goal Dataframe:
last_seen second_last_seen
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
Needless to say, this solution is overkill and unpythonic to the core.
Thank you for your time and happy holidays!

here is a way :
grp=df.groupby('kind')['seen'].nlargest(2).droplevel(1).to_frame()
grp=grp.set_index(grp.groupby(grp.index).cumcount(),append=True).unstack()
grp.columns=['last_seen','second_last_seen']
print(grp)
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT

s = df.groupby('kind')['seen'].tail(2)
new_df = df.loc[df['seen'].isin(s)].groupby('kind').agg(['last','first'])
then we just need to remove values where first and last are the same, indicating there was only one value in the original data frame.
new_df.columns = new_df.columns.droplevel()
new_df.loc[a['first'] == new_df['last'],'last'] = pd.NaT
new_df.columns = new_df.columns.map(lambda x : x + '_seen')
last_seen first_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale NaT 2019-01-04

You can do something like this:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
df2['second_last_seen'] = g.nth(-2)
The result will be:
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT
And you can use this solutions if you want more columns:
g = df.sort_values('seen').groupby('kind')['seen']
df2 = g.nth(-1).rename('last_seen').to_frame()
for k in range(2,4):
df2[str(k)+'_last_seen'] = g.nth(-k)
Which results in:
last_seen 2_last_seen 3_last_seen
kind
bird 2019-01-07 2019-01-05 2019-01-03
tiger 2019-01-06 2019-01-02 2019-01-01
whale 2019-01-04 NaT NaT
UPD: added sort by 'seen' column because it is necessary in general case. Thanks #aitak

Yet another solution (if "seen" is of Timestamp dtype):
s=df.groupby("kind")["seen"].agg(lambda t: t.nlargest(2).to_list())
s
kind
bird [2019-01-07 00:00:00, 2019-01-05 00:00:00]
tiger [2019-01-06 00:00:00, 2019-01-02 00:00:00]
whale [2019-01-04 00:00:00]
Name: seen, dtype: object
pd.DataFrame( s.to_list(),index=s.index).rename(columns={0:"last_seen",1:"second_last_seen"})
last_seen second_last_seen
kind
bird 2019-01-07 2019-01-05
tiger 2019-01-06 2019-01-02
whale 2019-01-04 NaT

Related

How to return CUMSUM of Days in Dates - Python

How do I return the CumSum number of days for the dates provided?
import pandas as pd
df = pd.DataFrame({
'date': ['2019-01-01','2019-01-03','2019-01-05',
'2019-01-06','2019-01-07','2019-01-08',
'2019-01-09','2019-01-12','2019-01-013']})
df['date'].cumsum() does not work here.
Desired dataframe:
date Cumsum days
0 2019-01-01 0
1 2019-01-03 2
2 2019-01-05 4
3 2019-01-06 5
4 2019-01-07 6
5 2019-01-08 7
6 2019-01-09 8
7 2019-01-12 9
8 2019-01-013 11
Another way is calling diff, fillna and cumsum
df['cumsum days'] = df['date'].diff().dt.days.fillna(0).cumsum()
Out[2044]:
date cumsum days
0 2019-01-01 0.0
1 2019-01-03 2.0
2 2019-01-05 4.0
3 2019-01-06 5.0
4 2019-01-07 6.0
5 2019-01-08 7.0
6 2019-01-09 8.0
7 2019-01-12 11.0
8 2019-01-13 12.0
Thanks, this should work:
def generate_time_delta_column(df, time_column, date_first_online_column):
return (df[time_column] - df[date_first_online_column]).dt.days

Fill multiple rows in between pandas dataframe rows on condition

I have a dataset like below:
pd.DataFrame({'Date':['2019-01-01','2019-01-03','2019-01-01','2019-01-04','2019-01-01','2019-01-03'],'Name':['A','A','B','B','C','C'],'Open Price':[100,200,300,400,500,600],'Close Price':[200,300,400,500,600,700]})
Now we can see that we have few day entries missing in this table. i.e 2019-01-02 for A, and 2019-01-02, 2019-01-03 for B and 2019-01-02 for C.
What I'm looking to do is add dummy rows in the dataframe for these dates,
And close price column as the same of the next open price entry for next day. And I don't care the open price, it could be either nan or 0
Expected output
pd.DataFrame({'Date':['2019-01-01','2019-01-02','2019-01-03','2019-01-01','2019-01-02','2019-01-03','2019-01-04','2019-01-01','2019-01-02','2019-01-03'],'Name':['A','A','A','B','B','B','B','C','C','C'],'Open Price':[50,'nan',150,250,'nan','nan',350,450,'nan',550],'Close Price':[200,150,300,400,350,350,500,600,550,700]})
Any help would be appreciated !
Your logic is fuzzy for how the prices should be interpolated, but to get you started, consider this, remembering to get date into a datetime dtype:
df['Date'] = pd.to_datetime(df['Date'])
df = (df.groupby('Name')
.resample('D', on='Date')
.mean()
.swaplevel()
.interpolate()
)
print(df)
Open Price Close Price
Date Name
2019-01-01 A 100.000000 200.000000
2019-01-02 A 150.000000 250.000000
2019-01-03 A 200.000000 300.000000
2019-01-01 B 300.000000 400.000000
2019-01-02 B 333.333333 433.333333
2019-01-03 B 366.666667 466.666667
2019-01-04 B 400.000000 500.000000
2019-01-01 C 500.000000 600.000000
2019-01-02 C 550.000000 650.000000
2019-01-03 C 600.000000 700.000000

pandas Grouper not upsampling as expected

Consider a Series with a MultiIndex that provides a natural grouping value on level 0 and time series on level 1:
s = pd.Series(range(12), index=pd.MultiIndex.from_product([['a','b','c'],
pd.date_range(start='2019-01-01', freq='3D', periods=4)], names=['grp','ts']))
print(s)
grp ts
a 2019-01-01 0
2019-01-04 1
2019-01-07 2
2019-01-10 3
b 2019-01-01 4
2019-01-04 5
2019-01-07 6
2019-01-10 7
c 2019-01-01 8
2019-01-04 9
2019-01-07 10
2019-01-10 11
Length: 12, dtype: int64
I want to upsample the time series for each outer index value, say with a simple forward fill action:
s.groupby(['grp', pd.Grouper(level=1, freq='D')]).ffill()
Which produces unexpected results; namely, it doesn't do anything. The result is exactly s rather than what I desire which would be:
grp ts
a 2019-01-01 0
2019-01-02 0
2019-01-03 0
2019-01-04 1
2019-01-05 1
2019-01-06 1
2019-01-07 2
2019-01-08 2
2019-01-09 2
2019-01-10 3
b 2019-01-01 4
2019-01-02 4
2019-01-03 4
2019-01-04 5
2019-01-05 5
2019-01-06 5
2019-01-07 6
2019-01-08 6
2019-01-09 6
2019-01-10 7
c 2019-01-01 8
2019-01-02 8
2019-01-03 8
2019-01-04 9
2019-01-05 9
2019-01-06 9
2019-01-07 10
2019-01-08 10
2019-01-09 10
2019-01-10 11
Length: 30, dtype: int64
I can change the Grouper freq or the resample function to same effect. The one workaround I found was through creative trickery to force a simple time series index on each group (thank you Allen for providing the answer https://stackoverflow.com/a/44719843/3109201):
s.reset_index(level=1).groupby('grp').apply(lambda s: s.set_index('ts').resample('D').ffill())
which is slightly different from what I was originally asking for, because it returns a DataFrame:
0
grp ts
a 2019-01-01 0
2019-01-02 0
2019-01-03 0
2019-01-04 1
2019-01-05 1
2019-01-06 1
2019-01-07 2
2019-01-08 2
2019-01-09 2
2019-01-10 3
b 2019-01-01 4
2019-01-02 4
2019-01-03 4
2019-01-04 5
2019-01-05 5
2019-01-06 5
2019-01-07 6
2019-01-08 6
2019-01-09 6
2019-01-10 7
c 2019-01-01 8
2019-01-02 8
2019-01-03 8
2019-01-04 9
2019-01-05 9
2019-01-06 9
2019-01-07 10
2019-01-08 10
2019-01-09 10
2019-01-10 11
[30 rows x 1 columns]
I can and will use this workaround, but I'd like to know why the simpler (and frankly more elegant) method is not working.
use series.asfreq() which fulfills the missing dates.
def filldates(s_in):
s_in.reset_index(level="grp",drop=True,inplace=True)
s_in= s_in.asfreq("1D",method='ffill')
return s_in
s.groupby(level=0).apply(filldates)

How to perform a cumulative sum of distinct values in pandas dataframe

I have a dataframe like this:
id date company ......
123 2019-01-01 A
224 2019-01-01 B
345 2019-01-01 B
987 2019-01-03 C
334 2019-01-03 C
908 2019-01-04 C
765 2019-01-04 A
554 2019-01-05 A
482 2019-01-05 D
and I want to get the cumulative number of unique values over time for the 'company' column. So if a company appears at a later date they are not counted again.
My expected output is:
date cumulative_count
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
I've tried:
df.groupby(['date']).company.nunique().cumsum()
but this double counts if the same company appears on a different date.
Using duplicated + cumsum + last
m = df.duplicated('company')
d = df['date']
(~m).cumsum().groupby(d).last()
date
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
dtype: int32
Another way try to fix anky_91
(df.company.map(hash)).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
Out[196]:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
Name: company, dtype: float64
From anky_91
(df.company.astype('category').cat.codes).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
This takes more code than anky's answer, but still works for the sample data:
df = df.sort_values('date')
(df.drop_duplicates(['company'])
.groupby('date')
.size().cumsum()
.reindex(df['date'].unique())
.ffill()
)
Output:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
dtype: float64

Pandas Rolling mean based on groupby multiple columns

I have a Long format dataframe with repeated values in two columns and data in another column. I want to find SMAs for each group. My problem is : rolling() simply ignores the fact that the data is grouped by two columns.
Here is some dummy data and code.
import numpy as np
import pandas as pd
dtix=pd.Series(pd.date_range(start='1/1/2019', periods=4) )
df=pd.DataFrame({'ix1':np.repeat([0,1],4), 'ix2':pd.concat([dtix,dtix]), 'data':np.arange(0,8) })
df
ix1 ix2 data
0 0 2019-01-01 0
1 0 2019-01-02 1
2 0 2019-01-03 2
3 0 2019-01-04 3
0 1 2019-01-01 4
1 1 2019-01-02 5
2 1 2019-01-03 6
3 1 2019-01-04 7
Now when I perform a grouped rolling mean on this data, I am getting an output like this:
df.groupby(['ix1','ix2']).agg({'data':'mean'}).rolling(2).mean()
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 3.5
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Desired Output:
Whereas, what I would actually like to have is this:
sma
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Will appreciate your help with this.
Use another groupby by firast level (ix1) with rolling:
df1 = (df.groupby(['ix1','ix2'])
.agg({'data':'mean'})
.groupby(level=0, group_keys=False)
.rolling(2)
.mean())
print (df1)
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
In your solution affter aggregation is returned one column DataFrame, so chained rolling working with all rows, not per groups like need:
print(df.groupby(['ix1','ix2']).agg({'data':'mean'}))
data
ix1 ix2
0 2019-01-01 0
2019-01-02 1
2019-01-03 2
2019-01-04 3
1 2019-01-01 4
2019-01-02 5
2019-01-03 6
2019-01-04 7

Categories

Resources