I've been performing a cohort analysis for a SaaS company, and I have been using Greg Rada's example, and I ran into some trouble looking up a cohorts retention.
Right now, I have a dataframe set up as:
import numpy as np
from pandas import DataFrame, Series
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2
%matplotlib inline
df = DataFrame ({
'Customer_ID': ['QWT19CLG2QQ','URL99FXP9VV','EJO15CUP4TO','ZDJ11ZPO5LX','QQW13PUF3HL','SIJ98IQH0GW','EBH36UPB2XR','BED40SMW5NQ','NYW11ZKC8WK','YLV60ERT0VT'],
'Plan_Start_Date': ['2014-01-30', '2014-03-04', '2014-01-27', '2014-02-10', '2014-01-02', '2014-04-15', '2014-05-28', '2014-05-03', '2014-02-09', '2014-06-09']
'Plan_Cancel_Date': ['2014-09-19', '2014-10-29', '2015-01-19', '2015-01-21', '2014-08-19', '2014-08-26', '2014-10-01', '2015-01-03', '2015-01-23', '2015-09-02']
'Monthly_Pay': [14.99, 14.99, 14.99, 14.99, 29.99, 29.99, 29.99, 74.99, 74.99, 74.99]
'Plan_ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
})
So far, what I have done is...
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
#Convert the dates from objects to datetime
df['Cohort'] = df.Plan_Start_Date.map(lambda x: x.strftime('%Y-%m'))
#Create a cohort based on the start dates month and year
df['Lifetime'] = (df.Plan_Cancel_Date.dt.year -
df.Plan_Start_Date.dt.year)*12 + (df.Plan_Cancel_Date.dt.month -
df.Plan_Start_Date.dt.month)
#calculat the total lifetime of each customer
df['Lifetime_Revenue'] = df['Monthly_Pay'] * df['Lifetime']
dfsort = df.sort_values(['Cohort'])
dfsort.head(10)
#Calculate the total revenue of each customer
I have tried to Create a retention column from the Plan_Start_Date, similar to how Greg structured his:
dfsort['Retention'] = dfsort.groupby(level=0)['Plan_Start_Date'].min().apply(lambda x:
x.strftime('%Y-%m'))
But that would just repeat the value of the ['Cohort'] on my dataset.
And in turn, when I try to create an index hierarchy to map out retention by:
grouped = dfsort.groupby(['Cohort', 'Retention'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
cohorts.head()
instead of looking like:
Total_Users
Cohort Retention
-------------------------------
2014-01 2014-01 3
2014-02 3
2014-03 3
...
2015-01 1
2014-02 2014-01 2
2014-02 2
It looks like:
Total_Users
Cohort Retention
-------------------------------
2014-1 2014-1 3
2014-2 2014-2 2
2014-3 2014-3 1
...
I know I am grouping wrong, and creating the retention column, but I am at a loss on how to fix it. Anyone able to help a rookie out?
You can use multi_indexing and then grouping on 2 columns.
dfsort = dfsort.set_index(['Cohort', 'Retention'])
dfsort.groupby(['Cohort', 'Retention']).count()
However, in your data, you only have one 'Retention' date for each cohort, which is why you don't see different Retention dates.
Cohort Retention
---------------------
2014-01 2014-01
2014-01
2014-01
2014-02 2014-02
2014-02
Maybe you want to look at how you calculated the Cohorts and Retentions.
Related
I have a dataset with 2 columns that are on a completely different scales.
I need to do a log transformation on both columns to be able to do some visualization on them.
I cannot find a code for python that allows me to do the log transformation on several columns.
Can anybody help me?
I have a dataset with Qualitative and Quantitative columns and I wish to do the log on The RealizedPL and Volume columns.
My dataset looks a bit like this:
Date Name Country Product RealizedPL Volume
0 2019.01.01 Charles Country1 ProductA 100 10200
1 2019.02.20 Pierre Country2 ProductB 150 20500
2 2019.03.02 Chiara Country1 ProductA 200 15300
How can I do the log transformation and keep the other columns as well? Either by creating new columns for the log or directly replacing the columns with the log.
Thank you
You may wish to try:
df[["RealizedPL","Volume"]] = df[["RealizedPL","Volume"]].apply(np.log)
print(df)
Date Name Country Product RealizedPL Volume
0 2019.01.01 Charles Country1 ProductA 4.605170 9.230143
1 2019.02.20 Pierre Country2 ProductB 5.010635 9.928180
2 2019.03.02 Chiara Country1 ProductA 5.298317 9.635608
or:
df[["RealizedPL_log", "Volume_log"]] = df[["RealizedPL","Volume"]].apply(np.log)
to have logs as separate columns.
Also note, if this is simply for visualization purposes, you may wish to try df.plot.scatter(..., logx=True, logy=True).
You can use FunctionTransformer in scikit learn for this and just choose to which columns you want to apply the transformation. As a second step, you can just add these transformed columns to your original dataframe.
On a dummy example, it would look like this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 44, 2], "c": [4, 4, 3]})
transformer = FunctionTransformer(np.log)
df[["a_log", "b_log"]] = transformer.fit_transform(df[["a", "b"]])
I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.
I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%
I have some data of different products and its corresponding sales with Datetime index. I managed to group them by product using:
grouped_df = data.loc[:, ['ProductID', 'Sales']].groupby('ProductID')
for key, item in grouped_df:
print(grouped_df.get_group(key), "\n\n")
And the output I got is:
ProductID Sales
Datetime
2014-03-31 1 2475.03
2014-09-27 1 10033.06
2015-02-03 1 5329.33
ProductID Sales
Datetime
2014-12-17 2 1960.0
2015-06-17 2 1400.0
2016-08-29 2 230.0
.
.
.
I would like to be able to plot each grouped data on the same graph to show a time-series of sales.
I also want to get the monthly mean sales of each productID and plot them on a bar chart. I have tried re-sampling but it did not work out for me.
How do I go about doing the above?
Help will be greatly appreciated!
you can use seaborn.lineplot here
you can use
import seaborn as sns
ax = sns.lineplot(x=data.index, y="Sales", hue = 'ProductID ', data=data)
you don't even need to group them
I have a multicolumn pandas dataframe, with rows for each day.
Now I would like to replace each weekend with it's mean values in one row.
I.e. (Fr,Sa,Su).resample().mean() --> (Weekend)
Not sure where to start even.
Thank you in advance.
import pandas as pd
from datetime import timedelta
# make some data
df = pd.DataFrame({'dt': pd.date_range("2018-11-27", "2018-12-12"), "val": range(0,16)})
# adjust the weekend dates to fall on the friday
df['shifted'] = [d - timedelta(days = max(d.weekday() - 4, 0)) for d in df['dt']]
# calc the mean
df2 = df.groupby(df['shifted']).val.mean()
df2
#Out[105]:
#shifted
#2018-11-27 0
#2018-11-28 1
#2018-11-29 2
#2018-11-30 4
#2018-12-03 6
#2018-12-04 7
#2018-12-05 8
#2018-12-06 9
#2018-12-07 11
#2018-12-10 13
#2018-12-11 14
#2018-12-12 15
I create a pandas dataframe with a DatetimeIndex like so:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create datetime index and random data column
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=14, freq='D')
data = np.random.randint(1, 10, size=14)
columns = ['A']
df = pd.DataFrame(data, index=index, columns=columns)
# initialize new weekend column, then set all values to 'yes' where the index corresponds to a weekend day
df['weekend'] = 'no'
df.loc[(df.index.weekday == 5) | (df.index.weekday == 6), 'weekend'] = 'yes'
print(df)
Which gives
A weekend
2014-10-13 7 no
2014-10-14 6 no
2014-10-15 7 no
2014-10-16 9 no
2014-10-17 4 no
2014-10-18 6 yes
2014-10-19 4 yes
2014-10-20 7 no
2014-10-21 8 no
2014-10-22 8 no
2014-10-23 1 no
2014-10-24 4 no
2014-10-25 3 yes
2014-10-26 8 yes
I can easily plot the A colum with pandas by doing:
df.plot()
plt.show()
which plots a line of the A column but leaves out the weekend column as it does not hold numerical data.
How can I put a "marker" on each spot of the A column where the weekend column has the value yes?
Meanwhile I found out, it is as simple as using boolean indexing in pandas. Doing the plot directly with pyplot instead of pandas' own plot wrapper (which is more convenient to me):
plt.plot(df.index, df.A)
plt.plot(df[df.weekend=='yes'].index, df[df.weekend=='yes'].A, 'ro')
Now, the red dots mark all weekend days which are given by df.weekend='yes' values.