Discussion on Churn prediction logic building for monthly renewals - python

I have a subscription based business dataset which looks like this:
Company RenewalMonth Year Month Metrics
ABC 10 2018 1 ...
DEF 1 2018 1 ...
GHI 7 2018 1 ...
ABC 10 2018 2 ...
DEF 1 2018 2 ...
GHI 7 2018 2 ...
ABC 10 2018 3 ...
DEF 1 2018 3 ...
GHI 7 2018 3 ...
ABC 10 2018 4 ...
DEF 1 2018 4 ...
GHI 7 2018 4 ...
ABC 10 2018 5 ...
DEF 1 2018 5 ...
GHI 7 2018 5 ...
and so on, there around 10k accounts and I have their data usage per month for the last 5 years.
Here the RenewalMonth represents the month each year the renewal takes place for that account.
Year and Month represents the aggregated usage parameters in that year and month, usage metrics consists of parameters such as sessions, content, region, products etc.
I am building a Churn model, but since renewal month of each account is not same, this posses a unique problem. If I aggregate the measures in year 2017, and use that as a train data to predict on 2018, it takes in assumption that the renewal of each account happens on 1st of January 2018 as I am predicting taking in to account last 12 months of data.
But since the renewal happens in different months, the other way is to find rolling 12 months usage of each account and then map it for prediction.
For example, there is an account 'xyz' whose renewal happens in November, I will map its data for the last 12 months of usage, use that as test data, and my train data would contain the rolling 12 months of data for all those accounts for which renewal has already happen, that means any account whose renewal falls before November.
But this is a very big task as there are about 10000 accounts and finding individual rolling means of those accounts is very difficult.
Could someone help me map this logic to create a rolling 12 months churn prediction model?

Related

Check how many IDs dropped out of multi-index dataframe over time

I have the following multi-index data frame, with ID and Year being part of the index. The solvency column is based on wether or not there are NaNs in both Profit/Loss and Total Sales for that year.
ID Year Profit/Loss Total Sales Solvency
0 2008 300. 2000. 1
0 2009 NaN NaN 0
0 2010 500. 2000. 1
1 2008 300. 2000. 1
1 2009 NaN NaN 0
1 2010 NaN NaN 0
However, it is the case that sometimes a company has NaNs in one year, but not in the one after, so it is in fact not insolvent and did not disappear from the data set. For my analysis I need to know how many companies drop out over the time period. I am guessing that I need a function with groupby that checks if a 0 appears in the Solvency column and then checks if there ever is a 1 again in the next years for that specific company. The final output should tell how many companies dropped out in every year.
Year Count Dropouts
2008 0
2009 1
2010 1

How to group by column and take average of column weighted by another column?

I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

how to find the number of rows in a column that are above the mean?

I have a dataset and among the columns there are column A that have the release year of products and column B that have the sales of each product.
I want to know how many product have sales above the mean for each year.
The dataset is a pandas dataframe.
Thank you and I hope my question is clear
Compute yearly averages with groupby.transform() and compare them against the individual sales, e.g.:
df = pd.DataFrame({'product': np.random.choice(['foo','bar'], size=10), 'year': np.random.choice([2019,2020,2021], size=10), 'sales': np.random.randint(10000, size=10)})
# product year sales
# 0 foo 2019 7507
# 1 bar 2019 9186
# 2 foo 2021 6234
# 3 foo 2021 7375
# 4 bar 2020 9934
# 5 foo 2021 6403
# 6 foo 2021 7729
# 7 foo 2021 1875
# 8 bar 2020 7148
# 9 foo 2019 8163
df['above_mean'] = df.sales > df.groupby(['product','year']).sales.transform('mean')
df.groupby('year', as_index=False).above_mean.sum()
# year above_mean
# 0 2019 1
# 1 2020 1
# 2 2021 4

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Categories

Resources