Sort Plots by year

Sort Plots by year - python

I have this data frame where I want to graph 3 plots based on year with x and y being Unspcs Desc and Total_Price. For example plot one will be specific to the year 2018 and only contain contents of Unspsc Desc and Total_Price for 2018
Material Total_Price Year_Purchase
Gasket 50,000 2018
Washer 6,000 2019
Bolts 7,000 2019
Nut 3,000 2020
Gasket 25,000 2019
Gasket 2500 2020
Washer 33500 2018
Nuts 7000 2019
The code I was using
dw.groupby(['Unspsc Desc', 'Total_Price']).Year_Purchase.sort_values().plot.bar()

Related

How can I derive a "today's money" compounded inflation rate using python?

I know the year-on-year inflation rates for the past 5yrs. But I want to derive another column containing compounded inflation relative to the current year.
To illustrate, I have the below table where compound_inflation_to_2022 is the product of all yoy_inflation instances from each year prior to 2022.
So, for 2021 this is simply 2021's yoy_inflation rate.
For 2020 the compound rate is 2020 x 2021.
For 2019 the compound rate is 2019 x 2020 x 2021, and so on.
year
yoy_inflation
compound_inflation_to_2022
2021
1.048
1.048
2020
1.008
1.056
2019
1.014
1.071
2018
1.02
1.093
2017
1.027
1.122
2016
1.018
1.142
Does anyone have an elegant solution for calculating this compound inflation column in python?

So Pandas DataFrame has this feature called .cumprod() and I think it can be of utmost help to you.
df['compound_inflation_to_2022'] = df['yoy_inflation'].cumprod()
I hope this was what you were looking for ^_^

Splitting single text column into multiple columns Pandas

I am working on extraction of raw data from various sources. After a process, I could form a dataframe that looked like this.
data
0 ₹ 16,50,000\n2014 - 49,000 km\nJaguar XF 2.2\nJAN 16
1 ₹ 23,60,000\n2017 - 28,000 km\nMercedes-Benz CLA 200 CDI Style, 2017, Diesel\nNOV 26
2 ₹ 26,00,000\n2016 - 44,000 km\nMercedes Benz C-Class Progressive C 220d, 2016, Diesel\nJAN 03
I want to split this raw dataframe into relevant columns in order of the raw data occurence: Price, Year, Mileage, Name, Date
I have tried to use df.data.split('-', expand=True) with other delimiter options sequentially along with some lambda functions to achieve this, but haven't gotten much success.
Need assistance in splitting this data into relevant columns.
Expected output:
price year mileage name date
16,50,000 2014 49000 Jaguar 2.2 XF Luxury Jan-17
23,60,000 2017 28000 CLA CDI Style Nov-26
26,00,000 2016 44000 Mercedes C-Class C220d Jan-03

Try split on '\n' then on '-'
df[["Price","Year-Mileage","Name","Date"]] =df.data.str.split('\n', expand=True)
df[["Year","Mileage"]] =df ["Year-Mileage"].str.split('-', expand=True)
df.drop(columns=["data","Year-Mileage"],inplace=True)
print(df)
Price Name Date Year Mileage
0 ₹ 16,50,000 Jaguar XF 2.2 JAN 16 2014 49,000 km
2 ₹ 26,00,000 Mercedes Benz C-Class Progressive C 220d, 2016, Diesel JAN 03 2016 44,000 km
1 ₹ 23,60,000 Mercedes-Benz CLA 200 CDI Style, 2017, Diesel NOV 26 2017 28,000 km

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.

Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

How do I groupby two columns and create a loop to subplots?

I have a large dataframe (df) in this strutcture:
year person purchase
2016 Peter 0
2016 Peter 223820
2016 Peter 0
2017 Peter 261740
2017 Peter 339987
2018 Peter 200000
2016 Carol 256400
2017 Carol 33083820
2017 Carol 154711
2018 Carol 3401000
2016 Frank 824043
2017 Frank 300000
2018 Frank 214416259
2018 Frank 4268825
2018 Frank 463080
2016 Rita 0
To see how much each person spent per year I do groupby year and person, which gives me what I want.
code:
df1 = df.groupby(['person','year']).sum().reset_index()
How do I create a loop to create subplots for each person containing what he/she spent on purchase each year?
So a subplot for each person where x = year and y = purchase.
I've tried a lot of different things explained here but none seems to work.
Thanks!

You can either do pivot_table or groupby().sum().unstack('person') and then plot:
(df.pivot_table(index='year',
columns='person',
values='purchase',
aggfunc='sum')
.plot(subplots=True)
);
Or
(df.groupby(['person','year'])['purchase']
.sum()
.unstack('person')
.plot(subplots=True)
);
Output:

Discussion on Churn prediction logic building for monthly renewals

I have a subscription based business dataset which looks like this:
Company RenewalMonth Year Month Metrics
ABC 10 2018 1 ...
DEF 1 2018 1 ...
GHI 7 2018 1 ...
ABC 10 2018 2 ...
DEF 1 2018 2 ...
GHI 7 2018 2 ...
ABC 10 2018 3 ...
DEF 1 2018 3 ...
GHI 7 2018 3 ...
ABC 10 2018 4 ...
DEF 1 2018 4 ...
GHI 7 2018 4 ...
ABC 10 2018 5 ...
DEF 1 2018 5 ...
GHI 7 2018 5 ...
and so on, there around 10k accounts and I have their data usage per month for the last 5 years.
Here the RenewalMonth represents the month each year the renewal takes place for that account.
Year and Month represents the aggregated usage parameters in that year and month, usage metrics consists of parameters such as sessions, content, region, products etc.
I am building a Churn model, but since renewal month of each account is not same, this posses a unique problem. If I aggregate the measures in year 2017, and use that as a train data to predict on 2018, it takes in assumption that the renewal of each account happens on 1st of January 2018 as I am predicting taking in to account last 12 months of data.
But since the renewal happens in different months, the other way is to find rolling 12 months usage of each account and then map it for prediction.
For example, there is an account 'xyz' whose renewal happens in November, I will map its data for the last 12 months of usage, use that as test data, and my train data would contain the rolling 12 months of data for all those accounts for which renewal has already happen, that means any account whose renewal falls before November.
But this is a very big task as there are about 10000 accounts and finding individual rolling means of those accounts is very difficult.
Could someone help me map this logic to create a rolling 12 months churn prediction model?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.