Pandas AVG () function between two date columns

Pandas AVG () function between two date columns - python

Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!

Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()

In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days

Related

panda dataframe aggregate by ID and date

I'm trying to aggregate a dataframe by both ID and date. Suppose I had a dataframe:
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-03 0 20
2 2000-02-17 0 30
3 2000-01-04 1 40
I would like to aggregate the value by ID and date (frequency = 1W) and get a dataframe like:
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-17 0 30
2 2000-01-04 1 40
I understand it can be achieved by iterating the ID and using grouper to aggregate the price. Is there any more efficient way without iterating the IDs? Many thanks.

Use Grouper with aggregate sum, but not sure about frequency of Grouper (because all looks different like in question):
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-09 0 20
2 2000-02-20 0 30
3 2000-01-09 1 40
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W-Mon', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-03 0 30
1 2000-02-21 0 30
2 2000-01-10 1 40
Or:
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='7D', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-13 0 30
2 2000-01-02 1 40

Pandas unable to filter rows by quarter in specific year

I have a dataset like below-
Store Date Weekly_Sales
0 1 2010-05-02 1643690.90
1 1 2010-12-02 1641957.44
2 1 2010-02-19 1611968.17
3 1 2010-02-26 1409727.59
4 1 2010-05-03 1554806.68
It has 100 stores in all. I want to filter the data of the year 2012 by Quarter
# Filter out only the data in 2012 from the dataset
import datetime as dt
df['Date'] = pd.to_datetime(df['Date'])
ds_2012 = df[df['Date'].dt.year == 2012]
# Calculate Q on the dataset
ds_2012 = ds_2012.sort_values(['Date'],ascending=True)
quarterly_sales = ds_2012.groupby(['Store', pd.Grouper(key='Date', freq='Q')])['Weekly_Sales'].sum()
quarterly_sales.head(20)
Output Received
Store Date
1 2012-03-31 18951097.69
2012-06-30 21036965.58
2012-09-30 18633209.98
2012-12-31 9580784.77
The Summation of of Q2(2012-06-30) and Q3(2012-09-30) both are incorrect when filtered in excel. I am a newbie to Pandas

You can groupby store and resample the DataFrame quarterly:
import pandas as pd
df=pd.concat([pd.DataFrame({'Store':[i]*12, 'Date':pd.date_range(start='2020-01-01', periods=12, freq='M'), 'Sales':list(range(12))}) for i in [1,2]])
df.groupby('Store').resample('Q', on='Date').sum().drop('Store', axis=1)
Sales
Store Date
1 2020-03-31 3
2020-06-30 12
2020-09-30 21
2020-12-31 30
2 2020-03-31 3
2020-06-30 12
2020-09-30 21
2020-12-31 30
Maybe check the groupby and resample docs aswell.

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.

You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

How to count by time frequency using groupby - pandas

I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.

One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64

Count total work hours of the employee per date in pandas

I have the pandas dataframe like this:
Employee_id timestamp
1 2017-06-21 04:47:45
1 2017-06-21 04:48:45
1 2017-06-21 04:49:45
for each employee, I am getting ping every 1 minute if he/she is in the office.
I have around 2000 employee's ping, I need the output like:
Employee_id date Total_work_hour
1 2018-06-21 8
1 2018-06-22 7
2 2018-06-21 6
2 2018-06-22 8
for all 2000 employee

Use groupby with lambda function for diff with sum of all diferences, then convert it to seconds by total_seconds and divide by 3600 for hours:
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(lambda x: x.diff().sum())
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour'))
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333
But if possible some missing consecutive minutes, is possible use custom function:
print (df)
Employee_id timestamp
0 1 2017-06-21 04:47:45
1 1 2017-06-21 04:48:45
2 1 2017-06-21 04:49:45
3 1 2017-06-21 04:55:45
def f(x):
vals = x.diff()
return vals.mask(vals > pd.Timedelta(60, unit='s')).sum()
df1 = (df.groupby(['Employee_id', df['timestamp'].dt.date])['timestamp']
.apply(f)
.dt.total_seconds()
.div(3600)
.reset_index(name='Total_work_hour')
)
print (df1)
Employee_id timestamp Total_work_hour
0 1 2017-06-21 0.033333

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.