How to count by time frequency using groupby - pandas

How to count by time frequency using groupby - pandas - python

I'm trying to count a frequency of 2 events by the month using 2 columns from my df. What I have done so far has counted all events by the unique time which is not efficient enough as there are too many results. I wish to create a graph with the results afterwards.
I've tried adapting my code by the answers on the SO questions:
[How to groupby time series by 10 minutes using pandas?
[Counting frequency of occurrence by month-year using python panda
[Pandas Groupby using time frequency
but can not seem to get the command working when I input freq='day' within the groupby command.
My code is:
print(df.groupby(['Priority', 'Create Time']).Priority.count())
which initially produced something like 170000 results in the structure of the following:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
...
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
...
But now for some reason (I'm using Jupyter Notebook) it only produces:
Priority Create Time
1.0 2011-01-01 00:00:00 1
2011-01-01 00:01:11 1
2011-01-01 00:02:10 1
2.0 2011-01-01 00:01:25 1
2011-01-01 00:01:35 1
Name: Priority, dtype: int64
No idea why the output has changed to only 5 results (maybe I unknowingly changed something).
I would like the results to be in the following format:
Priority month Count
1.0 2011-01 a
2011-02 b
2011-03 c
...
2.0 2011-01 x
2011-02 y
2011-03 z
...
Top points for showing how to change the frequency correctly for other values as well, for example hour/day/month/year. With the answers please could you explain what is going on in your code as I am new and learning pandas and wish to understand the process. Thank you.

One possible solution is convert datetime column to months periods by Series.dt.to_period:
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Or use Grouper:
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Sample:
np.random.seed(123)
df = pd.DataFrame({'Create Time':pd.date_range('2019-01-01', freq='10D', periods=10),
'Priority':np.random.choice([0,1], size=10)})
print (df)
Create Time Priority
0 2019-01-01 0
1 2019-01-11 1
2 2019-01-21 0
3 2019-01-31 0
4 2019-02-10 0
5 2019-02-20 0
6 2019-03-02 0
7 2019-03-12 1
8 2019-03-22 1
9 2019-04-01 0
print(df.groupby(['Priority', df['Create Time'].dt.to_period('m')]).Priority.count())
Priority Create Time
0 2019-01 3
2019-02 2
2019-03 1
2019-04 1
1 2019-01 1
2019-03 2
Name: Priority, dtype: int64
print(df.groupby(['Priority', pd.Grouper(key='Create Time', freq='MS')]).Priority.count())
Priority Create Time
0 2019-01-01 3
2019-02-01 2
2019-03-01 1
2019-04-01 1
1 2019-01-01 1
2019-03-01 2
Name: Priority, dtype: int64

Related

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.

While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Calculating readmission rate

I am fairly new to Python and I am trying to calculate if a patient was readmitted to the hospital within 30 days or not.
The data is in the form of Pandas dataframe with columns for Patient Id, Arrival Date, Departure Date and Status (Discharged, Admitted, Did Not Wait). The question is similar to this past question with same requirements but I need the code in Python.
Calculate readmission rate
I only need one column of readmission (30 day readmission status). Any help in the code's translation is appreciated. Thanks in advance.
# anky_91 Please do correct me if I am wrong in my understanding. Some random examples my dataex1 ex2 ex3

You can use the below:
df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(drop=True)
Full code:
Considering the df looks like:
Visit Patient Admission Date Discharge date
0 1 1 2015-01-01 2015-01-02
1 2 2 2015-01-01 2015-01-01
2 3 3 2015-01-01 2015-01-02
3 4 1 2015-01-09 2015-01-09
4 5 2 2015-04-01 2015-04-05
5 6 1 2015-05-01 2015-05-01
df[['Admission Date','Discharge date']] = df[['Admission Date','Discharge date']].\
apply(lambda x: pd.to_datetime(x))
df = df.sort_values(['Patient','Admission Date']) #Thanks #Jondiedoop
df['Readmit30']=df.groupby('Patient').apply(lambda x : (x['Admission Date'].\
shift(-1)-x['Discharge date']).dt.days.le(30).astype(int)).reset_index(0).drop('Patient',1)
print(df)
Visit Patient Admission Date Discharge date Readmit30
0 1 1 2015-01-01 2015-01-02 1
3 4 1 2015-01-09 2015-01-09 0
5 6 1 2015-05-01 2015-05-01 0
1 2 2 2015-01-01 2015-01-01 0
4 5 2 2015-04-01 2015-04-05 0
2 3 3 2015-01-01 2015-01-02 0

You can try this one also ( Don't know why upper one was giving false readmission flags for me):
After sorting on visit_start_date
visits_pandas_df.groupby('PatientId').apply(lambda x: (((x['visit_start_date'].shift(-1)-x['visit_end_date']).dt.days.shift(1).le(30)) ).astype(int)).values
Visits having only difference of one day are not counted in readmissions. So you will also need to check in your logic.

Pandas Difference Between Dates in Months

i have a dataframe date column with below values
2015-01-01
2015-02-01
2015-03-01
2015-07-01
2015-08-01
2015-10-01
2015-11-01
2016-02-01
i want to find the difference of these values in months, as below
date_dt diff_mnts
2015-01-01 0
2015-02-01 1
2015-03-01 1
2015-07-01 4
2015-08-01 1
2015-10-01 2
2015-11-01 1
2016-02-01 3
i tried to use the diff() method to calculate the days and then convert to astype('timedelta64(M)'). but in those cases, when days are less than 30 - its showing month difference values as 0. please let me know, if there is any easy built in function, which i can try in this case.

Option 1
Change the period and call diff.
df
Date
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-07-01
4 2015-08-01
5 2015-10-01
6 2015-11-01
7 2016-02-01
df.Date.dtype
dtype('<M8[ns]')
df.Date.dt.to_period('M').diff().fillna(0)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Option 2
Alternatively, call diff on dt.month, but you'll need to account for gaps over a year (solution improved thanks to #galaxyan!) -
i = df.Date.dt.year.diff() * 12
j = df.Date.dt.month.diff()
(i + j).fillna(0).astype(int)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Caveat (thanks to for spotting it) is that this wouldn't work for gaps over a year.

Try the following steps
Cast the column into datetime format.
Use the .month method to get the month number
Use the shift() method in pandas to calculate difference
example code will look something like this
df['diff_mnts'] = date_dt.month - date_dt.shift().month

Grouping by date range with pandas

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!

I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).

I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01

How to calculate time difference by group using pandas?

Problem
I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.
The original data :
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-25 16:36:04
2 A 2016-11-25 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-25 16:35:46
The result I want
Out[40]:
id time
0 A 00:35
1 A 03:12
2 B 00:22
notice: the type of time col is timedelta64[ns]
Trying
In [38]: df['time'].diff(1)
Out[38]:
0 NaT
1 00:03:47
2 -1 days +23:59:25
3 -1 days +23:59:55
4 00:00:22
Name: time, dtype: timedelta64[ns]
Don't get desired result.
Hope
Not only solve the problem but the code can run fast because there are 50 million rows.

You can use sort_values with groupby and aggregating diff:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time diff
0 A 2016-11-25 16:32:17 NaT
1 A 2016-11-25 16:36:04 00:00:35
2 A 2016-11-25 16:35:29 00:03:12
3 B 2016-11-25 16:35:24 NaT
4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT in column diff use dropna:
df = df.dropna(subset=['diff'])
print (df)
id time diff
2 A 2016-11-25 16:35:29 00:03:12
1 A 2016-11-25 16:36:04 00:00:35
4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time
0 A NaT
1 A 00:00:35
2 A 00:03:12
3 B NaT
4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
id time
1 A 00:00:35
2 A 00:03:12
4 B 00:00:22

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.