Date difference (in days) using a cutoff value in pyspark - python

I am learning Spark, and I am trying to create a column of the difference in days between a date and a cutoff value.
Here is some data along with my solution using pandas.
lst = ['2018-11-21',
'2018-11-01',
'2018-10-09',
'2018-11-23',
'2018-11-08',
'2018-10-06',
'2018-11-27',
'2018-10-07',
'2018-10-23',
'2018-11-02']
d = pd.DataFrame({'event':np.arange(len(lst)),'ts':lst})
d['ts'] = d['ts'].apply(pd.to_datetime) # only needed because I have alist of strings
d['new_ts'] = d.ts - (d.ts.max() - pd.to_timedelta(15, unit='d'))
Unfortunately I can't find a way to adapt this logic to pyspark. I think the issue is in the subtraction of a static date that is not part of the DataFrame.
Assuming that df is the "Spark version" of the above dataset "d", here is one of the things I tried:
calculator = udf(lambda x: datediff(datediff(date_sub(max(x),30),x)))
c = df.withColumn('Recency',calculator(col('ts')))
However, he followings give me a long error
c.select(col('Recency')).show(1)
c.show(1)
Thanks in advance to everyone who is gonna help.

The logic is:
Compute max date.
Subtract given number of days to get cutoff date.
Find difference in days from cutoff date.
df = spark.createDataFrame(data=[["2018-11-21"],["2018-11-01"],["2018-10-09"],["2018-11-23"],["2018-11-08"],["2018-10-06"],["2018-11-27"],["2018-10-07"],["2018-10-23"],["2018-11-02"]], schema=["ts"])
df = df.withColumn("ts", F.to_date("ts", "yyyy-MM-dd"))
cutoff_dt = df.select(F.date_sub(F.max("ts"), 15).alias("cutoff_dt")).first().asDict()["cutoff_dt"]
df = df.withColumn("new_ts", F.datediff("ts", F.lit(cutoff_dt)))
df.show(truncate=False)
+----------+------+
|ts |new_ts|
+----------+------+
|2018-11-21|9 |
|2018-11-01|-11 |
|2018-10-09|-34 |
|2018-11-23|11 |
|2018-11-08|-4 |
|2018-10-06|-37 |
|2018-11-27|15 |
|2018-10-07|-36 |
|2018-10-23|-20 |
|2018-11-02|-10 |
+----------+------+

Related

Get just one value of a nested list inside a dictionary to create a Dataframe Update #1

I am using an API that returns a dictionary with a nested list inside, lets name it coins_best The result looks like this:
{'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]]}
The first value on the list is a timestamp, while the second is a price in dollars. I want to create a DataFrame with the prices and having the timestamps as index. I tried with this code to do it in just one step:
d = pd.DataFrame()
for id, obj in coins_best.items():
for i in range(0,len(obj)):
temp = pd.DataFrame({
obj[i][1]
}
)
d = pd.concat([d, temp])
d
This attempt gave me a DataFrame with just one column and not the two required, because using the columns argument threw errors (TypeError: Index(...) must be called with a collection of some kind, 'bitcoin' was passed) when I tried with id
Then I tried with comprehensions to preprocess the dictionary and their lists:
for k in coins_best.keys():
inner_lists = (coins_best[k] for inner_dict in coins_best.values())
items = (item[1] for ls in inner_lists for item in ls)
I could not obtain the both elements in the dictionary, just the last.
I know is possible to try:
df = pd.DataFrame(coins_best, columns=coins_best.keys())
Which gives me:
bitcoin ethereum
0 [1603782192402, 13089.646908288987] [1603782053064, 393.6741989091851]
1 [1603785693143, 13146.275972229188] [1603785731599, 394.6174435303511]
And then try to remove the first element in every list of every row, but was even harder to me. The required answer is:
bitcoin ethereum
1603782192402 13089.646908288987 393.6741989091851
1603785693143 13146.275972229188 394.6174435303511
Do you know how to process the dictionary before creating the DataFrame in order the get this result?
Is my first question, I tried to be as clear as possible. Thank you very much.
Update #1
The answer by Sander van den Oord also solved the problem of timestamps and is useful for its purpose. However, the sample code while correct (as it used the info provided) was limited to these two keys. This is the final code that solved the problem for every key in the dictionary.
for k in coins_best:
df_coins1 = pd.DataFrame(data=coins_best[k], columns=['timestamp', k])
df_coins1['timestamp'] = pd.to_datetime(df_coins1['timestamp'], unit='ms')
df_coins = pd.concat([df_coins1, df_coins], sort=False)
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
Thank you very much for your answers.
I think you shouldn't ignore the fact that values of coins are taken at different times. You could do something like this:
import pandas as pd
import hvplot.pandas
coins_best = {
'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]],
}
df_bitcoin = pd.DataFrame(data=coins_best['bitcoin'], columns=['timestamp', 'bitcoin'])
df_bitcoin['timestamp'] = pd.to_datetime(df_bitcoin['timestamp'], unit='ms')
df_ethereum = pd.DataFrame(data=coins_best['ethereum'], columns=['timestamp', 'ethereum'])
df_ethereum['timestamp'] = pd.to_datetime(df_ethereum['timestamp'], unit='ms')
df_coins = pd.concat([df_ethereum, df_bitcoin], sort=False)
Your df_coins will now look like this:
+----+----------------------------+------------+-----------+
| | timestamp | ethereum | bitcoin |
|----+----------------------------+------------+-----------|
| 0 | 2020-10-27 07:00:53.064000 | 393.674 | nan |
| 1 | 2020-10-28 06:03:44.078000 | 404.861 | nan |
| 0 | 2020-10-27 07:03:12.402000 | nan | 13089.6 |
| 1 | 2020-10-28 06:14:03.028000 | nan | 13712.1 |
+----+----------------------------+------------+-----------+
Now if you want values to be on the same line, you could use resampling, here I do it per day: all values of the same day for a coin type are averaged:
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
df_coins_resampled will look like this:
+---------------------+------------+-----------+
| timestamp | ethereum | bitcoin |
|---------------------+------------+-----------|
| 2020-10-27 00:00:00 | 393.674 | 13089.6 |
| 2020-10-28 00:00:00 | 404.861 | 13712.1 |
+---------------------+------------+-----------+
I like to use hvplot to get an interactive plot of the result:
df_coins_resampled.hvplot.scatter(
x='timestamp',
y=['bitcoin', 'ethereum'],
s=20, padding=0.1
)
Resulting plot:
There are different timestamps, so the correct output looks differently, than what you presented, but other than that, its a oneliner (where d is your input dictionary):
pd.concat([pd.DataFrame(val, columns=['timestamp', key]).set_index('timestamp') for key, val in d.items()], axis=1)

How can I get pandas to adjust my formula based on a specific value in a dataframe?

I have a pandas dataframe that looks like this:
Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid | Multiplier | Hourly_Pay
A1 | 35 | 01/01/1990 | 06/04/2020 | 5000 | 0.229961 | 32.85
B2 | 35 | 02/01/2020 | NaN | 10000 | 0.229961 | 65.70
C3 | 30 | 23/03/2020 | NaN | 5800 | 0.229961 | 44.46
The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.
Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.
I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.
df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)
However the above approach returns an error stating that:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')
I should add here that I had to change the dates to datetime[ns64] format manually.
Any pointers gratefully received. Thanks!
The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.
One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
DATE_FORMAT = "%d-%m-%Y"
# Reproduce raw data
raw_data = [
["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]
# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
try:
spl = s.split("/")
ds = "%s-%s-%s" %(spl[2], spl[1], spl[0])
except:
ds = "NAT"
return np.datetime64(ds)
for line in raw_data:
line[2] = parse_raw_dates(line[2])
# Create dataframe
df = pd.DataFrame(
data = raw_data,
columns = [
"Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
"Salary_Paid", "Multiplier", "Hourly_Pay"],
)
# Create special conversion function
def myfunc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 0
else:
return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)
def format_date(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return ""
else:
return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)
Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:
d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'],
'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)
df
The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')
This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.
I now (slightly) adjusted Ken's custom solution for calculating the working days in April:
def workday_calc(d):
d = d.to_numpy().astype('datetime64[D]')
if np.isnat(d):
return 30.44
else:
d = d.astype(str)
d = dt.datetime.strptime(d, '%Y-%m-%d')
e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])
I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note:
The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.
However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function.
Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.
Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.
We then simply apply the custom function and create a new column.
df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)
I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.
df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)
Output:
Emp_ID Weekly Hours Hire_Date Termination_Date Salary_Paid Days_Worked_April Multiplier Hourly_Pay_Calc
0 A1 35.0 1990-01-01 2020-04-06 5000 4.00 1.750000 250.00
1 B2 35.0 2020-01-02 NaT 10000 30.44 0.229961 65.70
2 C3 30.0 2020-03-23 NaT 5800 30.44 0.229961 44.46

Eliminate perceived index value from data frame concatenation

I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?

How to filter records for fixed, regular blocks of time in PySpark?

I'm interested in capturing user behaviour during specific hours of the day, everyday. Suppose I have a dataframe with the columns
+-------+-----+----------+
| start | end | activity |
+-------+-----+----------+
Both start and end are in Unix timestamps. Is there any way to filter for a specific interval of time, like 10 a.m. to 11 a.m. every day in PySpark? Note that start may start before 10 and end may end after 11. I want to find all time periods which overlap.
You can get hour from timestamp.
from pyspark.sql import functions as F
day_timestamps = 24*60*60
hour_timestamps = 60*60
hour = 10
tmstmp2hour = lambda tm: (tm%day_timestamp)/hour_timestamp
df.filter(
(tmstmp2hour(F.col('start'))<hour)
&
(tmstmp2hour(F.col('end'))>hour)
)
The below solution works for your problem
Create a DataFrame, assuming you have unix timestamp
l1 = [(1541585700,1541585750,'playing'), (1531305900,1541589300, 'fishing'), (1541589400,1541589500,'working'),(1530919800, 1530923400, 'across-night')]
df = sqlContext.createDataFrame(l1, ['start','end','activity'])
df.show()
Initialize your range in 24 hour format, you can make it as a part of dataframe to speed up
capStart = 23 #11 pm,
capEnd = 1 # 1am
df = df.withColumn('startSec', df.start%(24*60*60))
df = df.withColumn('endSec', df.end%(24*60*60))
df = df.withColumn('match', (df.startSec >= capStart*60*60) & (df.endSec <= capEnd*60*60))
df.show()
+----------+----------+--------+--------+------+-----+
| start| end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1541585700|1541585750| playing| 36900| 36950|false|
|1531305900|1541589300| fishing| 38700| 40500|false|
|1541589400|1541589500| working| 40600| 40700|false|
|1530919800|1530923400| night| 84600| 1800| true|
+----------+----------+--------+--------+------+-----+
#Filter the result
df = df.filter(df.match == True)
df.show()
+----------+----------+--------+--------+------+-----+
| start| end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1530919800|1530923400| night| 84600| 1800| true|
+----------+----------+--------+--------+------+-----+

Generating a retention cohort from a pandas dataframe

I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks
There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps
Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

Categories

Resources