Calculating rolling retention with Python [duplicate] - python

This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed 3 years ago.
I am having trouble calculating rolling retention.
I was trying to figure out how to make groupby work, but it seems like it suits only for calculating classic retention.
Rolling retention - cound amount of users from each group who logged in on the exact month OR later.
data = {'id':[1, 1, 1, 2, 2, 2, 2, 3, 3],
'group_month': ['2013-05', '2013-05', '2013-05', '2013-06', '2013-06', '2013-06', '2013-06', '2013-06', '2013-06'],
'login_month': ['2013-05', '2013-06', '2013-07', '2013-06', '2013-07', '2013-09', '2013-10', '2013-09', '2013-10']}
Transforming data:
data = pd.DataFrame(data)
pd.to_datetime(data['group_month'], format='%Y-%m', errors='coerce')
pd.to_datetime(data['login_month'], format='%Y-%m', errors='coerce')
To calculate classic retention (count users from each cohort who logged in on the exact month I used following code:
classic_ret = pd.DataFrame(data[(data['login_month'] >= data['group_month'])].groupby(['group_month', 'login_month'])['id'].count())
classic_ret.unstack()
Rolling retention should have the following output:
+-------------+---------+---------+---------+---------+---------+---------+
| group_month | 2013-05 | 2013-06 | 2013-07 | 2013-08 | 2013-09 | 2013-10 |
+-------------+---------+---------+---------+---------+---------+---------+
| 2013-05 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2013-06 | 0 | 1 | 1 | 1 | 2 | 2 |
+-------------+---------+---------+---------+---------+---------+---------+

With cross tab, i could only manage the table below.
a = data.set_index('login_month').groupby('id').resample('M').last().ffill().drop('id', axis=1).reset_index()
pd.crosstab(a.group_month, a.login_month)
Output
login_month 2013-05-31 2013-06-30 2013-07-31 2013-08-31 2013-09-30 2013-10-31
group_month
2013-05-01 1 1 1 0 0 0
2013-06-01 0 1 1 1 2 2
However, we could get the values you need as below.
a = data.set_index('login_month').groupby('id').resample('M').last().ffill().drop('id', axis=1).reset_index()
pd.DataFrame(a[(a['login_month'] >= a['group_month'])].groupby(['group_month', 'login_month'])['id'].count()).unstack().fillna(method='ffill',axis=1).fillna(value=0)
output
login_month 2013-05-31 2013-06-30 2013-07-31 2013-08-31 2013-09-30 2013-10-31
group_month
2013-05-01 1.0 1.0 1.0 1.0 1.0 1.0
2013-06-01 0.0 1.0 1.0 1.0 2.0 2.0

Related

how to fill date column in one dataframe with nearest dates from another dataframe

I have a dataframe visit =
visit_occurrence_id visit_start_date person_id
1 2016-06-01 1
2 2019-05-01 2
3 2016-01-22 1
4 2017-02-14 2
5 2018-05-11 3
and another dataframe measurement =
measurement_date person_id visit_occurrence_id
2017-09-04 1 Nan
2018-04-24 2 Nan
2018-05-22 2 Nan
2019-02-02 1 Nan
2019-01-28 3 Nan
2019-05-07 1 Nan
2018-12-11 3 Nan
2017-04-28 3 Nan
I want to fill the visit_occurrence_id for measurement table with visit_occurrence_id of visit table on the basis of person_id and nearest date possible.
I have written a code but its taking a lot of time.
measurement has 7*10^5 rows.
Note: visit_start_date and measurement_date are object types
my code -
import datetime as dt
unique_person_list = measurement['person_id'].unique().tolist()
def nearest_date(row,date_list):
date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list]
row = min(date_list, key=lambda x: abs(x - row))
return row
modified_measurement = pd.DataFrame(columns = measurement.columns)
for person in unique_person_list:
near_visit_dates = visit[visit['person_id']==person]['visit_start_date'].tolist()
if near_visit_dates:
near_visit_dates = list(filter(None, near_visit_dates))
near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates]
store_dates = measurement.loc[measurement['person_id']== person]['measurement_date']
store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,))
modified_measurement = modified_measurement.append(store_dates)
My code's execution time is quite high. Can you help me in either reducing the time complexity or with another solution.
edit - adding dataframe constructors.
import numpy as np
measurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02',
'2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'],
'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
visit = {'visit_occurrence_id':[1, 2, 3, 4, 5],
'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'],
'person_id':[1, 2, 1, 2, 3]}
# Create DataFrame
measurement = pd.DataFrame(measurement)
visit = pd.DataFrame(visit)
You can do the following:
df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")
df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)
df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")
res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]
Output:
measurement_date person_id visit_occurrence_id_2
0 2017-09-04 1 1
1 2018-04-24 2 2
2 2018-05-22 2 2
3 2019-02-02 1 1
4 2019-01-28 3 5
5 2019-05-07 1 1
6 2018-12-11 3 5
7 2017-04-28 3 5
Here's what I've come up with:
# Get all visit start dates
df = measurement.drop('visit_occurrence_id', axis=1).merge(visit, on='person_id')
df['date_difference'] = abs(df.measurement_date - df.visit_start_date)
# Find the smallest visit start date for each person_id - measurement_date pair
df['smallest_difference'] = df.groupby(['person_id', 'measurement_date'])['date_difference'].transform(min)
df = df[df.date_difference == df.smallest_difference]
df = df[['measurement_date', 'person_id', 'visit_occurrence_id']]
# Fill in visit_occurrence_id from original dataframe
measurement.drop("visit_occurrence_id", axis=1).merge(
df, on=["measurement_date", "person_id"]
)
This produces:
| | measurement_date | person_id | visit_occurrence_id |
|---:|:-------------------|------------:|----------------------:|
| 0 | 2017-09-04 | 1 | 1 |
| 1 | 2018-04-24 | 2 | 2 |
| 2 | 2018-05-22 | 2 | 2 |
| 3 | 2019-02-02 | 1 | 1 |
| 4 | 2019-01-28 | 3 | 5 |
| 5 | 2019-05-07 | 1 | 1 |
| 6 | 2018-12-11 | 3 | 5 |
| 7 | 2017-04-28 | 3 | 5 |
I believe there's probably a cleaner way of writing this using sklearn: https://scikit-learn.org/stable/modules/neighbors.html

How to copy values from one df to the original df with a certain condition?

Currently I am working on clustering problem and I have a problem with copying the values from one dataframe to the original dataframe.
CustomerID | Date | Time| TotalSum | CohortMonth| CohortIndex
--------------------------------------------------------------------
0 |17850.0|2017-11-29||08:26:00|15.30|2017-11-01|1|
--------------------------------------------------------------------
1 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
2 |17850.0|2017-11-29||08:26:00|22.00|2017-11-01|1|
--------------------------------------------------------------------
3 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
And the dataframe with values (clusters) to copy:
CustomerID|Cluster
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
Please help me with the problem: How to copy values from the second df based on Customer ID criteria to the first dataframe.
I tried the code like this:
df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left').drop('CustomerID',1).fillna('')
But it doesn't work and I get an error...
Besides it tried a version of such code as:
df, ic = [d.reset_index(drop=True) for d in (df, ic)]
ic.join(df[['CustomerID']])
But it gets the same error or error like the 'Customer ID' not in df...
Sorry if it's not clear and bad formatted question...It is my first question on stackoverflow. Thank you all.
UPDATE
I have tried this
df1=df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left')
if ic['CustomerID'].values != df1['CustomerID_x'].values:
df1.Cluster=ic.Cluster
else:
df1.Cluster='NaN'
But I've got different cluster for the same customer.
CustomerID_x| Date | Time | TotalSum | CohortMonth | CohortIndex | CustomerID_y | Cluster
0|17850.0|2017-11-29||08:26:00|15.30 | 2017-11-01 | 1 | NaN | 1.0
1|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 0.0
2|17850.0|2017-11-29||08:26:00|22.00 | 2017-11-01 | 1 | NaN | 1.0
3|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 2.0
4|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 1.0
Given what you've written, I think you want:
>>> df1 = pd.DataFrame({"CustomerID": [17850.0] * 4, "CohortIndex": [1,1,1,1] })
>>> df1
CustomerID CohortIndex
0 17850.0 1
1 17850.0 1
2 17850.0 1
3 17850.0 1
>>> df2
CustomerID Cluster
0 12346.0 1
1 17850.0 1
2 12345.0 1
>>> pd.merge(df1, df2, 'left', 'CustomerID')
CustomerID CohortIndex Cluster
0 17850.0 1 1
1 17850.0 1 1
2 17850.0 1 1
3 17850.0 1 1

Pandas Aggregate by Month with 2 columns as index

Sample dataframe:
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-08-05 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
2017-06-05 | 2 | 0 | 1 | 0
2017-07-05 | 2 | 2 | 0 | 0
2017-09-15 | 3 | 0 | 0 | 5
I want to groupby on monthly basis such that each ID has row per month until the last available data. For example, in this case, ID=1 has data from 6th to 10th Month. So, ID=1 gets rows monthly from 6th till 10th month.
Expected output for ID=1:
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-07-05 | 1 | 2 | 1 | 0
2017-08-05 | 1 | 0 | 1 | 0
2017-09-05 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
It can be observed that the type columns don't sum up, instead the past data fills up the row. Like, for data on month 7th is using month 6th same data.
Below Scenario is out of scope for this question:
In case the input dataframe has multiple rows within same month.
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-06-19 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
2017-06-05 | 2 | 0 | 1 | 0
2017-06-25 | 2 | 2 | 0 | 0
2017-09-15 | 3 | 0 | 0 | 5
How to aggregate in this case such that each month only has a single row per ID?
There is main problem add days, because resample by MS - start of month:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m%d')
#replace days to 1
t1 = df['Date'].dt.to_period('m').dt.to_timestamp()
a = df['Date'] - t1
#create MultiIndex Series with difference of days from 1's day od month
s = pd.Series(a.values, index=[df['ID'], t1])
print (s)
ID Date
1 2017-06-01 4 days
2017-08-01 4 days
2017-10-01 4 days
2 2017-06-01 4 days
2017-07-01 4 days
3 2017-09-01 14 days
dtype: timedelta64[ns]
#helper df2 for append missing NaNs rows
df2 = df.set_index(['ID','Date'])
#add missing dates with resample by start od month and forward fill NaNs
df1 = df.set_index(['Date']).groupby('ID').resample('MS').ffill()
print (df1)
ID Type 1 Type 2 Type 3
ID Date
1 2017-06-01 NaN NaN NaN NaN
2017-07-01 1.0 2.0 1.0 0.0
2017-08-01 1.0 2.0 1.0 0.0
2017-09-01 1.0 0.0 1.0 0.0
2017-10-01 1.0 0.0 1.0 0.0
2 2017-06-01 NaN NaN NaN NaN
2017-07-01 2.0 0.0 1.0 0.0
3 2017-09-01 NaN NaN NaN NaN
#add missing timedeltas by added rows in df1 by forward filling
s1 = s.reindex(df1.index, method='ffill')
print (s1)
ID Date
1 2017-06-01 4 days
2017-07-01 4 days
2017-08-01 4 days
2017-09-01 4 days
2017-10-01 4 days
2 2017-06-01 4 days
2017-07-01 4 days
3 2017-09-01 14 days
dtype: timedelta64[ns]
#create final MultiIndex with added timedelta by set_index
mux = [df1.index.get_level_values('ID'),
df1.index.get_level_values('Date') + s1.values]
#add missing NaNs rows with combine original
df = df1.drop('ID', 1).set_index(mux).combine_first(df2).reset_index()
print (df)
ID Date Type 1 Type 2 Type 3
0 1 2017-06-05 2.0 1.0 0.0
1 1 2017-07-05 2.0 1.0 0.0
2 1 2017-08-05 2.0 1.0 0.0
3 1 2017-09-05 0.0 1.0 0.0
4 1 2017-10-05 0.0 1.0 0.0
5 2 2017-06-05 0.0 1.0 0.0
6 2 2017-07-05 0.0 1.0 0.0
7 3 2017-09-15 0.0 0.0 5.0
EDIT:
#set days to 1
df['Date'] = df['Date'] - pd.offsets.MonthBegin()
#aggregate for unique months
df1 = df.groupby(['Date','ID']).sum()
print (df1)
Type 1 Type 2 Type 3
Date ID
2017-06-01 1 2 2 0
2 2 1 0
2017-09-01 3 0 0 5
2017-10-01 1 2 1 1
#add missing months by resample
df1 = df1.reset_index(['ID']).groupby('ID').resample('MS').ffill()
print (df1)
ID Type 1 Type 2 Type 3
ID Date
1 2017-06-01 1 2 2 0
2017-07-01 1 2 2 0
2017-08-01 1 2 2 0
2017-09-01 1 2 2 0
2017-10-01 1 2 1 1
2 2017-06-01 2 2 1 0
3 2017-09-01 3 0 0 5

Python pandas - remove groups based on NaN count threshold

I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?
IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values
You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data
Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources