Suppose I have the following DataFrame:
df = pd.DataFrame({'Event': ['A', 'B', 'A', 'A', 'B', 'C', 'B', 'B', 'A', 'C'],
'Date': ['2019-01-01', '2019-02-01', '2019-03-01', '2019-03-01', '2019-02-15',
'2019-03-15', '2019-04-05', '2019-04-05', '2019-04-15', '2019-06-10'],
'Sale':[100,200,150,200,150,100,300,250,500,400]})
df['Date'] = pd.to_datetime(df['Date'])
df
Event Date
A 2019-01-01
B 2019-02-01
A 2019-03-01
A 2019-03-01
B 2019-02-15
C 2019-03-15
B 2019-04-05
B 2019-04-05
A 2019-04-15
C 2019-06-10
I would like to obtain the following result:
Event Date Previous_Event_Count
A 2019-01-01 0
B 2019-02-01 0
A 2019-03-01 1
A 2019-03-01 1
B 2019-02-15 1
C 2019-03-15 0
B 2019-04-05 2
B 2019-04-05 2
A 2019-04-15 3
C 2019-06-10 1
where df['Previous_Event_Count'] is the number of an event (rows) when the event (df['Event']) takes place before its adjacent date (df['Date']). For instance,
The number of event A takes place before 2019-01-01 is 0,
The number of event A takes place before 2019-03-01 is 1, and
The number of event A takes place before 2019-04-15 is 3.
I am able to obtain the desired result using this line:
df['Previous_Event_Count'] = [df.loc[(df.loc[i, 'Event'] == df['Event']) & (df.loc[i, 'Date'] > df['Date']),
'Date'].count() for i in range(len(df))]
Although, it is slow but it works fine. I believe there is a better way to do that. I have tried this line:
df['Previous_Event_Count'] = df.query('Date < Date').groupby(['Event', 'Date']).cumcount()
but it produces NaNs.
groupby + rank
Dates can be treated as numeric. Use'min' to get your counting logic.
df['PEC'] = (df.groupby('Event').Date.rank(method='min')-1).astype(int)
Event Date PEC
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
First get counts by GroupBy.size per both columns, then aggregate by first level with shift and cumulative sum and last join to original:
s = (df.groupby(['Event', 'Date'])
.size()
.groupby(level=0)
.apply(lambda x: x.shift(1).cumsum())
.fillna(0)
.astype(int))
df = df.join(s.rename('Previous_Event_Count'), on=['Event','Date'])
print (df)
Event Date Previous_Event_Count
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
Finally, I can find a better and faster way to get the desired result. It turns out that it is very easy. One can try:
df['Total_Previous_Sale'] = df.groupby('Event').cumcount() \
- df.groupby(['Event', 'Date']).cumcount()
Related
I am using the following code to filter the dataframe using type and value column and then deleting any entries with a tdiff >0 and < 3.
import pandas as pd
d = {'Timestamp': ['2020-09-02 07:00:00','2020-09-02 07:10:00', '2020-09-02 07:30:00', '2020-09-02 08:00:00', '2020-09-02 10:00:00', '2020-09-02 11:10:00', '2020-09-02 11:30:00'],
'type': ['A','A','B','A', 'A','A','B'], 'value': [1,2,3,1,1,2,3]}
df = pd.DataFrame(data=d)
df3 = pd.DataFrame()
unique_type = pd.unique(df['type']).astype(str)
for i in range(0,len(unique_type)):
df1 = df[df.type == unique_type[i]]
unique_val = pd.unique(df1['value']).astype(int)
for j in range(0, len(unique_val)):
df2 = df1[df1.value == unique_val[j]]
trange = pd.to_datetime(df2.Timestamp)
tdiff = (trange-min(trange)).dt.total_seconds()/3600
df2['tdiff'] = tdiff#.round(1)
df3 = df3.append(df2, ignore_index=True)
df4 = df3[~((df3.tdiff>0) & (df3.tdiff<3))]
print(df)
df4.sort_values(by=['Timestamp'])
While this works, I would like to move away from for loops and use a more efficient code.
Try this simple code, with lambda function
Data:
d = {'type': ['A','A','B','C'], 'col1': [1,2,3,9]}
df = pd.DataFrame(data=d)
df:
type col1
0 A 1
1 A 2
2 B 3
3 C 9
f = pd.Series(pd.Series(df['type'].unique()).apply(lambda type_: (type_, df[df['type'] == type_].col1.sum()))).to_list()
df['New-column'] = df.type.replace(dict(f))
df.loc[df['New-column'] >= 9, 'New-column'] = np.nan
df:
type col1 New-column
0 A 1 3.0
1 A 2 3.0
2 B 3 3.0
3 C 9 NaN
You can use groupby and transform applied with min:
df['Timestamp'] = pd.to_datetime(df['Timestamp'] )
df['tmin'] = df.groupby(['type','value'])['Timestamp'].transform(min)
df['tdiff'] = (df['Timestamp'] - df['tmin']).dt.total_seconds()/3600
df[~((df.tdiff>0) & (df.tdiff<3))]
output
Timestamp type value tmin tdiff
-- ------------------- ------ ------- ------------------- -------
0 2020-09-02 07:00:00 A 1 2020-09-02 07:00:00 0
1 2020-09-02 07:10:00 A 2 2020-09-02 07:10:00 0
2 2020-09-02 07:30:00 B 3 2020-09-02 07:30:00 0
4 2020-09-02 10:00:00 A 1 2020-09-02 07:00:00 3
5 2020-09-02 11:10:00 A 2 2020-09-02 07:10:00 4
6 2020-09-02 11:30:00 B 3 2020-09-02 07:30:00 4
For a dataframe of:
import pandas as pd
df = pd.DataFrame({
'dt':[
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2020-01-01',
'2020-01-02',
'2020-01-03',
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2020-01-01',
'2020-01-02',
'2020-01-03'
],
'foo': [1,2,3, 4,5,6, 1,5,3, 4,10,6],
'category': [1,1,1,1,1,1, 2,2,2,2,2,2]
})
How can I find the lagged value from the previous year for each category?
df['dt'] = pd.to_datetime(df['dt'])
display(df)
Shifting the index only returns an empty result, and thus fails assignment.
df['last_year'] = df[df.dt == df.dt - pd.offsets.Day(365)]
Obviously, a join with the data from 2019 on the month and day would work - but seem rahter cumbersome. Is there a better way?
edit
the desired result:
dt foo category last_year
2020-01-01 4 1 1
2020-01-02 5 1 2
2020-01-03 6 1 3
2020-01-01 4 2 1
2020-01-02 10 2 5
2020-01-03 6 2 3
you can merge df with itself after you assign the column dt with the difference you want with pd.DateOffset.
print (df.merge(df.assign(dt=lambda x: x['dt']+pd.DateOffset(years=1)),
on=['dt', 'category'],
suffixes=('','_lastYear'),
how='left'))
dt foo category foo_lastYear
0 2019-01-01 1 1 NaN
1 2019-01-02 2 1 NaN
2 2019-01-03 3 1 NaN
3 2020-01-01 4 1 1.0
4 2020-01-02 5 1 2.0
5 2020-01-03 6 1 3.0
6 2019-01-01 1 2 NaN
7 2019-01-02 5 2 NaN
8 2019-01-03 3 2 NaN
9 2020-01-01 4 2 1.0
10 2020-01-02 10 2 5.0
11 2020-01-03 6 2 3.0
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00','2173/05/14 18:37:00','2173/05/15 19:39:00'],
'val' :[5,5,40,40,7,7,38,38],
'iid' :[12,12,12,12,21,21,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
I tried using stack,unstack,pivot and melt approach but doesn't seem to help
pd.melt(df, id_vars =['subject_id','val'], value_vars =['date','val']) #1
df.unstack().reset_index() #2
df.pivot(index='subject_id', columns='time_1', values='val') #3
I expect my output dataframe to look like as shown below
updated screenshot
Idea is create helper Series by GroupBy.cumcount with same column/columns for new index - here subject_id, create MultiIndex, reshape by DataFrame.unstack and last flatten MulitIndex in columns:
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-05-14 13:37:00 2173-05-15 13:39:00 2173-05-14 18:37:00
time_14 val1 val2 val3 val4
0 2173-04-12 13:14:00 5 5 40 40
1 2173-05-15 19:39:00 7 7 38 38
Missing values are expected, if not same number of id groups - unstack use max count and then are added misisng values:
df = pd.DataFrame({
'subject_id':[1,1,1,2,2,3],
'date':['2173/04/11','2173/04/12','2173/04/11','2173/04/12','2173/05/14','2173/05/15'],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00',
'2173/04/12 13:14:00','2173/05/14 13:37:00','2173/05/15 13:39:00'],
'val' :[5,5,40,40,7,7],
'iid' :[12,12,12,12,21,21]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
print (df)
subject_id date time_1 val iid day
0 1 2173/04/11 2173-04-11 12:35:00 5 12 11
1 1 2173/04/12 2173-04-12 12:50:00 5 12 12
2 1 2173/04/11 2173-04-11 12:59:00 40 12 11
3 2 2173/04/12 2173-04-12 13:14:00 40 12 12
4 2 2173/05/14 2173-05-14 13:37:00 7 21 14
5 3 2173/05/15 2173-05-15 13:39:00 7 21 15
cols = ['time_1','val']
df = df.set_index(['subject_id', df.groupby('subject_id').cumcount().add(1)])[cols].unstack()
df.columns = [f'{a}{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
subject_id time_11 time_12 time_13 \
0 1 2173-04-11 12:35:00 2173-04-12 12:50:00 2173-04-11 12:59:00
1 2 2173-04-12 13:14:00 2173-05-14 13:37:00 NaT
2 3 2173-05-15 13:39:00 NaT NaT
val1 val2 val3
0 5.0 5.0 40.0
1 40.0 7.0 NaN
2 7.0 NaN NaN
My goal is to complement the missing date entries per project_id with 0 in the data row.
For example
df = pd.DataFrame({
'project_id': ['A', 'A', 'A', 'B', 'B'],
'timestamp': ['2018-01-01', '2018-03-01', '2018-04-01', '2018-03-01', '2018-06-01'],
'data': [100, 28, 45, 64, 55]})
which is
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-03-01 28
2 A 2018-04-01 45
3 B 2018-03-01 64
4 B 2018-06-01 55
shall become
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-02-01 0
2 A 2018-03-01 28
3 A 2018-04-01 45
4 B 2018-03-01 64
5 B 2018-04-01 0
6 B 2018-05-01 0
7 B 2018-06-01 55
where indices 1, 5, and 6 are added.
My current approach :
df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']].set_index('timestamp').asfreq('M', how='start', fill_value=0))
is obviously wrong, because it sets everything to 0 and resampled not the first date of a month but the last one - although I thought this should be handled by how.
How do I expand/complement missing datetime entries after groupby to get a continuous time series for each group?
You are close:
df.timestamp = pd.to_datetime(df.timestamp)
# notice 'MS'
new_df = df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']]
.set_index('timestamp').asfreq('MS'))
new_df.data = df.set_index(['project_id', 'timestamp']).data
df = new_df.fillna(0).reset_index()
You can use groupby in combination with pandas.Grouper:
df_new = pd.concat([
d for n, d in df.set_index('timestamp').groupby(pd.Grouper(freq='MS'))
])
df_new = df_new.sort_values('project_id').reset_index()
Output
print(df_new)
timestamp project_id data
0 2018-01-01 A 100.0
1 2018-02-01 A 0.0
2 2018-03-01 A 28.0
3 2018-04-01 A 45.0
4 2018-03-01 B 64.0
5 2018-04-01 B 0.0
6 2018-05-01 B 0.0
7 2018-06-01 B 55.0
With this DataFrame:
import pandas as pd
df = pd.DataFrame([[1,1],[1,2],[1,3],[1,5],[1,7],[1,9]], index=pd.date_range('2015-01-01', periods=6), columns=['a', 'b'])
i.e.
a b
2015-01-01 1 1
2015-01-02 1 2
2015-01-03 1 3
2015-01-04 1 5
2015-01-05 1 7
2015-01-06 1 9
the fact of using df = df.groupby(df.b // 4).last() makes the datetime index disappear. Why?
a b
b
0 1 3
1 1 7
2 1 9
Expected result instead:
a b
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9
For groupby your index always getting from grouping values. For you case you could use reset_index and then set_index:
df['c'] = df.b // 4
result = df.reset_index().groupby('c').last().set_index('index')
In [349]: result
Out[349]:
a b
index
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9