panda dataframe aggregate by ID and date - python

I'm trying to aggregate a dataframe by both ID and date. Suppose I had a dataframe:
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-03 0 20
2 2000-02-17 0 30
3 2000-01-04 1 40
I would like to aggregate the value by ID and date (frequency = 1W) and get a dataframe like:
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-17 0 30
2 2000-01-04 1 40
I understand it can be achieved by iterating the ID and using grouper to aggregate the price. Is there any more efficient way without iterating the IDs? Many thanks.

Use Grouper with aggregate sum, but not sure about frequency of Grouper (because all looks different like in question):
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-09 0 20
2 2000-02-20 0 30
3 2000-01-09 1 40
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W-Mon', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-03 0 30
1 2000-02-21 0 30
2 2000-01-10 1 40
Or:
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='7D', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-13 0 30
2 2000-01-02 1 40

Related

Pandas: Combine rows with consecutive dates (with NaT) within groups of same id

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

how to aggregate in pivot table in pandas

I have following dataframe in pandas
code date tank nozzle qty amount
123 2018-01-01 1 1 100 0
123 2018-01-01 1 2 0 50
123 2018-01-01 1 2 0 50
123 2018-01-01 1 2 100 0
123 2018-01-02 1 1 0 70
123 2018-01-02 1 1 0 50
123 2018-01-02 1 2 100 0
My desired dataframe is
code date tank nozzle_1_qty nozzle_2_qty nozzle_1_amount nozzle_2_amount
123 2018-01-01 1 100 100 0 100
123 2018-01-02 1 0 100 120 0
I am doing following in pandas..
df= (df.pivot_table(index=['date', 'tank'], columns='nozzle',
values=['qty','amount']).add_prefix('nozzle_')
.reset_index()
)
But,this does not give me my desired output.
Default aggregation function in pivot_table is np.mean, so is necessary change it to sum and then flatten MultiIndex in list comprehension:
df = df.pivot_table(index=['code','date', 'tank'],
columns='nozzle',
values=['qty','amount'], aggfunc='sum')
#python 3.6+
df.columns = [f'nozzle_{b}_{a}' for a, b in df.columns]
#python bellow
#df.columns = ['nozzle_{}_{}'.format(b,a) for a, b in df.columns]
df = df.reset_index()
print (df)
code date tank nozzle_1_amount nozzle_2_amount nozzle_1_qty \
0 123 2018-01-01 1 0 100 100
1 123 2018-01-02 1 120 0 0
nozzle_2_qty
0 100
1 100
I don't use pivot_table much in pandas, but you can get your result using groupby and some reshaping.
df = df.groupby(['code', 'date', 'tank', 'nozzle']).sum().unstack()
The columns will be a MultiIndex that you maybe want to rename.

Count total work hours of the employee per date in pandas

I have the pandas dataframe like this:
Employee_id timestamp
1 2017-06-21 04:47:45
1 2017-06-21 04:48:45
1 2017-06-21 04:49:45
for each employee, I am getting ping every 1 minute if he/she is in the office.
I have around 2000 employee's ping, I need the output like:
Employee_id date Total_work_hour
1 2018-06-21 8
1 2018-06-22 7
2 2018-06-21 6
2 2018-06-22 8
for all 2000 employee
Use groupby with lambda function for diff with sum of all diferences, then convert it to seconds by total_seconds and divide by 3600 for hours:
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(lambda x: x.diff().sum())
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour'))
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333
But if possible some missing consecutive minutes, is possible use custom function:
print (df)
Employee_id timestamp
0 1 2017-06-21 04:47:45
1 1 2017-06-21 04:48:45
2 1 2017-06-21 04:49:45
3 1 2017-06-21 04:55:45
def f(x):
vals = x.diff()
return vals.mask(vals > pd.Timedelta(60, unit='s')).sum()
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(f)
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour')
)
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333

Pandas AVG () function between two date columns

Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!
Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()
In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days

Pandas: count some values in a column

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Categories

Resources