Calculating average differences with groupby in Python

Calculating average differences with groupby in Python - python

I'm new to Python and I want to aggregate (groupby) ID's in my first column.
The values in the second column are timestamps (datetime format) and by aggregating the ID's, I want the to get the average difference between the timestamps (in days) per ID in the aggregated ID column. My table looks like df1 and I want something like df2, but since I'm an absolute beginner, I have no idea how to do this.
import pandas as pd
import numpy as np
from datetime import datetime
In[1]:
# df1
ID = np.array([1,1,1,2,2,3])
Timestamp = np.array([
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-08 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-03-15 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-02-01 18:07:02', "%Y-%m-%d %H:%M:%S"),
datetime.strptime('2018-01-01 18:07:02', "%Y-%m-%d %H:%M:%S")])
df = pd.DataFrame({'ID': ID, 'Timestamp': Timestamp})
Out[1]:
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
In[2]:
#df2
ID = np.array([1,2,3])
Avg_Difference = np.array([7, 1, "nan"])
df2 = pd.DataFrame({'ID': ID, 'Avg_Difference': Avg_Difference})
Out[2]:
ID Avg_Difference
0 1 7
1 2 1
2 3 nan

You could do something like this:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
In your case, it looks like:
>>> df
ID Timestamp
0 1 2018-01-01 18:07:02
1 1 2018-01-08 18:07:02
2 1 2018-03-15 18:07:02
3 2 2018-01-01 18:07:02
4 2 2018-02-01 18:07:02
5 3 2018-01-01 18:07:02
>>> df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean())
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Name: Timestamp, dtype: timedelta64[ns]
If you want it as a dataframe with the column named Avg_Difference, just add to_frame at the end:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).to_frame('Avg_Difference')
Avg_Difference
ID
1 36 days 12:00:00
2 31 days 00:00:00
3 NaT
Edit Based on your comment, if you want to remove the time element, and just get the number of days, you can do the following:
df.groupby('ID')['Timestamp'].apply(lambda x: x.diff().mean()).dt.days.to_frame('Avg_Difference')
Avg_Difference
ID
1 36.0
2 31.0
3 NaN

Related

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.

While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Create a date counter variable starting with a particular date

I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019

You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]

Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.

python to_date wrong values

Command:
dataframe.date.head()
Result:
0 12-Jun-98
1 7-Aug-2005
2 28-Aug-66
3 11-Sep-1954
4 9-Oct-66
5 NaN
Command:
pd.to_date(dataframe.date.head())
Result:
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 2066-08-28 00:00:00
3 1954-09-11 00:00:00
4 2066-10-09 00:00:00
5 NaN
I don't want to get 2066 it should be 1966, what to do?
The year range supposed to be from 1920 to 2017. The dataframe contains Null values

You can substract 100 years if dt.year is more as 2017:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].mask(df['date'].dt.year > 2017,
df['date'] - pd.Timedelta(100, unit='Y'))
print (df)
date
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 1966-08-28 18:00:00
3 1954-09-11 00:00:00
4 1966-10-09 18:00:00

Pandas AVG () function between two date columns

Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!

Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()

In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days

PYTHON: Pandas datetime index range to change column values

I have a dataframe indexed using a 12hr frequency datetime:
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 0
2007-09-28 12:00:00 NaN NaN 0
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I use column 'ls' as a binary variable with default value '0' using:
data['ls'] = 0
I have a list of days in the form '2007-09-28' from which I wish to update all 'ls' values from 0 to 1.
id mm ls
date
2007-09-27 00:00:00 1 0 0
2007-09-27 12:00:00 1 0 0
2007-09-28 00:00:00 1 15 1
2007-09-28 12:00:00 NaN NaN 1
2007-09-29 00:00:00 NaN NaN 0
Timestamp('2007-09-27 00:00:00', offset='12H')
I understand how this can be done using another column variable ie:
data.ix[data.id == '1'], ['ls'] = 1
yet this does not work using datetime index.
Could you let me know what the method for datetime index is?

You have a list of days in the form '2007-09-28':
days = ['2007-09-28', ...]
then you can modify your df using:
df['ls'][pd.DatetimeIndex(df.index.date).isin(days)] = 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating average differences with groupby in Python - python

Related

int64 to HHMM string

Create a date counter variable starting with a particular date

python to_date wrong values

Pandas AVG () function between two date columns

PYTHON: Pandas datetime index range to change column values

Categories

Resources