I want to add a column that is a result of subtraction of min date from max date for each customer_id to this table
Input:
action_date customer_id
2017-08-15 1
2017-08-21 1
2017-08-21 1
2017-09-02 1
2017-08-28 2
2017-09-29 2
2017-10-15 3
2017-10-30 3
2017-12-05 3
And get this table
Output:
action_date customer_id diff
2017-08-15 1 18
2017-08-21 1 18
2017-08-21 1 18
2017-09-02 1 18
2017-08-28 2 32
2017-09-29 2 32
2017-10-15 3 51
2017-10-30 3 51
2017-12-05 3 51
I tried this code, but it puts lots of NaN's
group = df.groupby(by='customer_id')
df['diff'] = (group['action_date'].max() - group['action_date'].min()).dt.days
you can use transform method:
In [23]: df['diff'] = df.groupby('customer_id') \
['action_date'] \
.transform(lambda x: (x.max()-x.min()).days)
In [24]: df
Out[24]:
action_date customer_id diff
0 2017-08-15 1 18
1 2017-08-21 1 18
2 2017-08-21 1 18
3 2017-09-02 1 18
4 2017-08-28 2 32
5 2017-09-29 2 32
6 2017-10-15 3 51
7 2017-10-30 3 51
8 2017-12-05 3 51
Related
This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
Closed 2 years ago.
I have a dataframe as below:
F_Time BP BQ BO0 BO1 BO2 BO3 BO4
0 2020-07-10 09:30:00 10780.00 8550 1 28 1 1 2
1 2020-07-10 10:15:00 10788.00 8700 1 5 10 2 1
2 2020-07-10 10:20:00 10780.00 12150 1 1 1 3 76
3 2020-07-10 10:30:00 10770.00 15675 3 2 8 4 94
4 2020-07-10 10:35:00 10760.60 8100 2 1 1 1 29
5 2020-07-10 10:40:00 10750.00 18825 8 9 154 1 1
6 2020-07-10 11:05:00 10725.00 9825 3 4 94 1 1
I want to find the Max Value from group of Column Values (BO0, BO1, BO2, BO3, BO4) for every row:
Expected Output as below:
F_Time BP BQ BO
0 2020-07-10 09:30:00 10780.00 8550 28
1 2020-07-10 10:15:00 10788.00 8700 10
2 2020-07-10 10:20:00 10780.00 12150 76
3 2020-07-10 10:30:00 10770.00 15675 94
4 2020-07-10 10:35:00 10760.60 8100 29
5 2020-07-10 10:40:00 10750.00 18825 154
6 2020-07-10 11:05:00 10725.00 9825 94
Try this.
df["BO"] = df[["BO0", "BO1".....]].max(axis=1)
I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed
My data looks something like this:
ID1 ID2 Date Values
1 1 2018-01-05 75
1 1 2018-01-06 83
1 1 2018-01-07 17
1 1 2018-01-08 15
1 2 2018-02-01 85
1 2 2018-02-02 98
2 1 2018-02-15 54
2 1 2018-02-16 17
2 1 2018-02-17 83
2 1 2018-02-18 94
2 2 2017-12-18 16
2 2 2017-12-19 84
2 2 2017-12-20 47
2 2 2017-12-21 28
2 2 2017-12-22 38
All the operations must be done within groups of ['ID1', 'ID2'].
What I want to do is upsample the dataframe in a way such that I end up with a sub-dataframe for each 'Date' index which includes all previous dates including the current one from it's own ['ID1', 'ID2'] group. The resulting dataframe should look like this:
ID1 ID2 DateGroup Date Values
1 1 2018-01-05 2018-01-05 75
1 1 2018-01-06 2018-01-05 75
1 1 2018-01-06 2018-01-06 83
1 1 2018-01-07 2018-01-05 75
1 1 2018-01-07 2018-01-06 83
1 1 2018-01-07 2018-01-07 17
1 1 2018-01-08 2018-01-05 75
1 1 2018-01-08 2018-01-06 83
1 1 2018-01-08 2018-01-07 17
1 1 2018-01-08 2018-01-08 15
1 2 2018-02-01 2018-02-01 85
1 2 2018-02-02 2018-02-01 85
1 2 2018-02-02 2018-02-02 98
2 1 2018-02-15 2018-02-15 54
2 1 2018-02-16 2018-02-15 54
2 1 2018-02-16 2018-02-16 17
2 1 2018-02-17 2018-02-15 54
2 1 2018-02-17 2018-02-16 17
2 1 2018-02-17 2018-02-17 83
2 1 2018-02-18 2018-02-15 54
2 1 2018-02-18 2018-02-16 17
2 1 2018-02-18 2018-02-17 83
2 1 2018-02-18 2018-02-18 94
2 2 2017-12-18 2017-12-18 16
2 2 2017-12-19 2017-12-18 16
2 2 2017-12-19 2017-12-19 84
2 2 2017-12-20 2017-12-18 16
2 2 2017-12-20 2017-12-19 84
2 2 2017-12-20 2017-12-20 47
2 2 2017-12-21 2017-12-18 16
2 2 2017-12-21 2017-12-19 84
2 2 2017-12-21 2017-12-20 47
2 2 2017-12-21 2017-12-21 28
2 2 2017-12-22 2017-12-18 16
2 2 2017-12-22 2017-12-19 84
2 2 2017-12-22 2017-12-20 47
2 2 2017-12-22 2017-12-21 28
2 2 2017-12-22 2017-12-22 38
The dataframe I'm working with is quite big (~20 million rows), thus I would like to avoid iterating through each row.
Is it possible to use a function or combination of pandas functions like resample/apply/reindex to achieve what I need?
Assuming ID1 and ID2 is your original Index. You should reset the index, set Date as Index, reset the index back to [ID1, ID2]:
df = df.reset_index().set_index(['Date']).resample('d').ffill().reset_index().set_index(['ID1','ID2'])
If your 'Date' field is string, then you should be converting it into datetime before resampling on that field. You can use the below for that:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()