Generating a retention cohort from a pandas dataframe

Generating a retention cohort from a pandas dataframe - python

I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks

There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps

Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

Related

Count over PySpark dataframe by Running-window using a combination of two columns

I have a spark DataFrame (v.2.2.0) in which I want to count (by a group key) all events that occurred within a certain time-frame, for example, 5-day from the start_date of the event in each row to the end_date of the other rows. So for example
# For the sake of simplicity I included only 1 user, but there are multiple users
+-------------------+-------------------+-----+
|end_date |start_date |uid |
+-------------------+-------------------+-----+
|2020-11-26 09:30:28|2020-11-26 08:30:22|user1|
|2020-11-26 10:41:00|2020-11-26 10:00:00|user1|
|2020-11-22 12:40:27|2020-11-22 08:37:18|user1|
|2020-11-22 15:22:20|2020-11-22 13:32:30|user1|
|2020-11-20 17:20:07|2020-11-20 16:04:04|user1|
I defined a window
days = lambda i: i * 86400 # 60*60*24 = number of seconds in a day
w = (Window()
.partitionBy(col("uid"))
.orderBy(col("end_date").cast("timestamp").cast("long"))
.rangeBetween(-days(5), 0))
And I calculated over the window:
by_end = df.select(col("start_date"), f.count("end_date").over(w).alias("count"))
df = df.join(by_end, 'start_date', how='left')
I'll get this:
+-------------------+-------------------+-----+-------+
|end_date |start_date |uid |count |
+-------------------+-------------------+-----+-------+
|2020-11-26 09:30:28|2020-11-26 08:30:22|user1| 4 |
|2020-11-26 10:41:00|2020-11-26 10:00:00|user1| 3 |
|2020-11-22 12:40:27|2020-11-22 08:37:18|user1| 3 |
|2020-11-22 15:22:20|2020-11-22 13:32:30|user1| 2 |
|2020-11-20 17:20:07|2020-11-20 16:04:04|user1| 1 |
But, to my understanding, this will do the rolling-window count by end-date, which is almost correct, since I need it from the start_date of current event to the end_date of all others.
Any suggestions?

How can I get pandas to adjust my formula based on a specific value in a dataframe?

I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!

The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)

Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46

How do I select the specific data in a data frame based on thee contents of other columns?

I'm new to pandas and I'm currently trying to use it on a data set I have on my tablet using qPython (temporary situation, laptop's being fixed). I have a csv file with a set of data organised by country, region, market and item label, with additional columns price, year and month. These are set out in the following manner:
Country | Region | Market | Item Label | ... | Price | Year | Month |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 1 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 2 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 3 |
Canada | Quebec | Market No. | Item Name | ... | $$$ | 2002 | 4 |
and so on. I'm looking for a way to plot these prices against time (I've taken to adding the month/12 to the year to effectively merge the last columns).
Originally I had a code to take the csv data and put it in a Dictionary, like so:
{Country_Name: {Region_Name: {Market_Name: {Item_Name: {"Price": price_list, "Time": time_list}}}}}
and used for loops over the keys to access each price and time list.
However, I'm having difficulty using pandas to get a similar result: I've tried a fair few different approaches, such as iloc, data[data.Country == "Canada"][data.Region == "Quebec"][..., etc. to filter the data for each country, region, market and item, but all of them were particularly slow. The data set is fairly hefty (approx. 12000 by 12), so I wouldn't expect instant results, but is there something obvious I'm missing? Or should I just wait til I have my laptop back?
Edit: to try and provide more context, I'm trying to get the prices over the course of the years and months, to plot how the prices fluctuate. I want to separate them based on the country, region, market and item lael, so each line plotted will be a different item in a market in a region in a country. So far, I have the following code:
def abs_join_paths(*args):
return os.path.abspath(os.path.join(*args))
def get_csv_data_frame(*path, memory = True):
return pandas.read_csv(abs_join_paths(*path[:-1], path[-1] + ".csv"), low_memory = memory)
def get_food_data(*path):
food_price_data = get_csv_data_frame(*path, memory = False)
return food_price_data[food_price_data.cm_name != "Fuel (diesel) - Retail"]
food_data = get_food_data(data_path, food_price_file_name)
def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
plt.clf()
plt.hold(True)
data["mp_year"] += data["mp_month"]/12
for country in data["adm0_name"].unique():
for region in data[data.adm0_name == country]["adm1_name"].unique():
for market in data[data.adm0_name == country][data.adm1_name == region]["mkt_name"]:
for item_label in data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market]["cm_name"]:
current_data = data[data.adm0_name == country][data.adm1_name == region][data.mkt_name == market][data.cm_name == item_label]
#year = list(current_data["mp_year"])
#month = list(current_data["mp_month"])
#time = [float(y) + float(m)/12 for y, m in zip(year, month)]
plt.plot(list(current_data["mp_year"]), list(current_data["mp_price"]), plot_style)
print(list(current_data["mp_price"]))
plt.savefig(abs_join_paths(imagepath, title + ".png"))
Edit2/tl;dr: I have a bunch of prices and times, one after the other in one long list. How do I use pandas to split them up based on the contents of the other columns?
Cheers!

I hesitate to guess, but it seems that you are probably iterating through rows (you said you were using iloc). This is the slowest operation in pandas. Pandas data frames are optimized for series access.
If your plotting you can use matplotlib directly with pandas data frames and use the groupby method to combine data, without having to iterate through the rows of your data frame.
Without more information it's difficult to answer your question specifically. Please take a look at the comments on your question.

The groupby function did the trick:
def plot_food_price_time_data(data, title, ylabel, xlabel, plot_style = 'k-'):
plt.clf()
plt.hold(True)
group_data = data.groupby(["adm0_name", "adm1_name", "mkt_name", "cm_name"])
for i in range(len(data)):
print(data.iloc[i, [1, 3, 5, 7]])
specific_data = group_data.get_group(tuple(data.iloc[i, [1, 3, 5, 7]]))
plt.plot(specific_data["mp_price"], specific_data["mp_year"] + specific_data["mp_month"]/12)

RuntimeWarning: invalid value encountered in longlong_scalars

What I'm trying to do
I want to report the weekly rejection rate for multiple users. I use a for loop to go through a monthly dataset to get the numbers for every user. The final dataframe, rates, should look something like:
The end product, rates
Description
I have an initial dataframe (numbers), that contains only the ACCEPT, REJECT and REVIEW numbers, where I added these rows and columns:
Rows: Grand Total, Rejection Rate
Columns: Grand Total
Here's how numbers look like:
|---|--------|--------|--------|--------|-------------|
| | Week 1 | Week 2 | Week 3 | Week 4 | Grand Total |
|---|--------|--------|--------|--------|-------------|
| 0 | 994 | 699 | 529 | 877 | 3099 |
|---|--------|--------|--------|--------|-------------|
| 1 | 27 | 7 | 8 | 13 | 55 |
|---|--------|--------|--------|--------|-------------|
| 2 | 100 | 86 | 64 | 107 | 357 |
|---|--------|--------|--------|--------|-------------|
| 3 | 1121 | 792 | 601 | 997 | 3511 |
|---|--------|--------|--------|--------|-------------|
The indexes represent the following values:
0 - ACCEPT
1 - REJECT
2 - REVIEW
3 - TOTAL (Accept+Reject+Review)
I wrote 2 pre-defined functions:
get_decline_rates(df): The get the decline rates by week in the numbers dataframe.
copy(empty_df, data): To transfer all data to a new dataframe with "double" headers (for reporting purposes).
Here's my code where I add rows and columns to numbers, then re-format it:
# Adding "Grand Total" column and rows
totals = numbers.sum(axis=0) # column sum
numbers = numbers.append(totals, ignore_index=True)
grand_total = numbers.sum(axis=1) # row sum
numbers.insert(len(numbers.columns), "Grand Total", grand_total)
# Adding "Rejection Rate" and re-indexing numbers
decline_rates = get_decline_rates(numbers)
numbers = numbers.append(decline_rates, ignore_index=True)
numbers.index = ["ACCEPT","REJECT","REVIEW","Grand Total","Rejection Rate"]
# Creating a new df with report format requirements
final = pd.DataFrame(0, columns=numbers.columns, index=["User A"]+list(numbers.index))
final.ix["User A",:] = final.columns
# Copying data from numbers to newly formatted df
copy(final,numbers)
# Append final df of this user to the final dataframe
rates = rates.append(final)
I'm using Python 3.5.2 and Pandas 0.19.2. If it helps, here's how the initial dataset looks like:
Data format
I do a resampling on the date column to get the data by week.
What's going wrong
Here's the funny part - the code runs fine and I get all the required information in rates. However, I'm seeing this warning message:
RuntimeWarning: invalid value encountered in longlong_scalars
If i break down the code and run it line by line, this message does not appear. Even the message looks weird (what does longlong_scalars even mean?) Does anyone know what this warning message mean, and what's causing it?
UPDATE:
I just ran a similar script that takes in exactly the same input and produces a similar output (except I get daily rejection rates instead of weekly). I get the same Runtime warning, except more information is given:
RuntimeWarning: invalid value encountered in longlong_scalars
rej_rate = str(int(round((col.ix[1 ]/col.ix[3 ])*100))) + "%"
I suspect something must have gone wrong when I was trying to calculate the decline rates with my pre-defined function, get_decline_rates(df). Could it be due to the dtype of the values? All columns on the input df, numbers, are int64.
Here's the code for my pre-defined function (the input, numbers, can be found under Description):
# Description: Get rejection rates for all weeks.
# Parameters: Pandas Dataframe with ACCEPT, REJECT, REVIEW count by week.
# Output: Pandas Series with rejection rates for all days in input df.
def get_decline_rates(df):
decline_rates = []
for i in range(len(df.columns)):
col = df.ix[:,i]
try:
rej_rate = str(int(round((col[1]/col[3])*100))) + "%"
except ValueError:
rej_rate = "0%"
decline_rates.append(rej_rate)
return pd.Series(decline_rates, index=df.columns)

I had the same RuntimeWarning, and after looking into the data, it was because of a null-division. I did not have the time to look into your sample, but you could look around id=0, or some other records, where null-division or such could occur.

find a user/customer max count of continuous date using Python

Scenario: I have a sample data frame like below
user_id | date_login
--------|-----------
101 | 2015-10-11
101 | 2015-10-12
101 | 2015-11-01
101 | 2015-11-02
101 | 2015-11-03
102 | 2015-10-12
102 | 2015-10-13
...
I would like to know user's max active days, which means the count of continuous days he/she keeps log into the system. For the sample data frame above, the desired result should return like below:
user_id | max_continuous_login_count
--------|-----------
101|3
102|2
I'm thinking to convert date into number to compare, is it necessary, any good practice?
Thanks for the help,

Solution:
import operator
import datetime
from collections import defaultdict
from functools import reduce
dataset = [(101, "2015-10-11"), (101, "2015-10-12"), (102, "2015-10-13")]
data = defaultdict(list)
for user, date in dataset:
data[user].append(datetime.datetime.strptime(date, "%Y-%m-%d").date())
data[user].sort()
def count_days(data, new_date):
max_days, current_max, last_date = data
# Check if there's one day difference, else, reset back to 1.
if abs((new_date - last_date).days) != 1:
current_max = 0
current_max += 1
return max(max_days, current_max), current_max, new_date
result = {}
for user, dates in data.items():
result[user] = reduce(count_days, dates, (0, 0, datetime.date.min))[0]
What I did here, was to first convert the dataset into a dict mapping a user and his login dates. On the way, I converted the dates to date objects and sorted them in the correct order (just in case the dataset is garbled).
I then created a function count_days() which checks if the difference is 1 day between 2 dates. If it is, it increases the max count of days. Then, by using reduce, I created a new results dict mapping user id to max_days.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating a retention cohort from a pandas dataframe - python

Related

Count over PySpark dataframe by Running-window using a combination of two columns

How can I get pandas to adjust my formula based on a specific value in a dataframe?

How do I select the specific data in a data frame based on thee contents of other columns?

RuntimeWarning: invalid value encountered in longlong_scalars

find a user/customer max count of continuous date using Python

Categories

Resources