Calculating growth rates on specific level of multilevel index in Pandas - python

I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.

I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%

Related

Calculating trend per customer level in Python dataframe

I have 4 columns in dataset which are cid(customer level), month, spending and transaction (max.cid=10000). As seen below, df.head().
cid month spending transaction
0 1 3 61.94 28
1 1 4 73.02 23
2 1 7 59.34 25
3 1 8 48.69 24
4 1 9 121.79 26
I use the following function to calculate the trend (slope)in the outflow spending per customer. However, I get the identical number as a result for the whole dataset. Expected to calculate trend of spendings on customer level. (trend value for each customer).
Is there a way to iterate over each customer level in the dataset and obtain individual trends per customer? Thanks in advance!
df = pd.read_csv("/content/case_data.csv")
import numpy as np
def trendline(df, order=1):
coeffs = np.polyfit(df.index.values, list(df), order)
slope = coeffs[-2]
return float(slope)
outflow = df['spending']
cid = df['cid']
df_ = pd.DataFrame({'cid': cid, 'outflow': outflow})
slope_outflow = trendline(df_['cid'])
slope_outflow
Output : 0.13377820413729283
Expected Output: (Trend1), (Trend2), (Trend3), ......, (Trend10000)
def trendline(x, y, order=1):
return np.polyfit(x, y, order)[-2]
df.groupby('cid').apply(lambda subdf: trendline(subdf['month'].values, subdf['spending'].values))
You can use groupby to calculate the trend by each cid value. In the above example it is for the trend of spending.

Change data interval and add the average to the original data

I have a dataset that includes a country's temperature in 2020 and the projected temperature rise in 2050. I'm hoping to create a dataset that assumes the linear growth of temperature between 20201 and 2050 for this country. Take the sample df as an example. The temperature in 2020 for country A is 5 degree; by 2050, the temperature is projected to rise by 3 degree. In other words, the temperature would rise by 0.1 degree per year.
Country Temperature 2020 Temperature 2050
A 5 3
The desired output is df2
Country Year Temperature
A 2020 5
A 2021 5.1
A 2022 5.2
I tried to use resample but it seems to only work for scenario when the frequency is within a year (month, quarter). I also tried interpolate. But neither works.
df = df.reindex(pd.date_range(start='20211231', end='20501231', freq='12MS'))
df2 = df.interpolate(method='linear')
You can use something like this:
import numpy as np
import pandas as pd
def interpolate(df, start, stop):
a = np.empty((stop - start, df.shape[0]))
a[1:-1] = np.nan
a[0] = df[f'Temperature {start}']
a[-1] = df[f'Temperature {stop}']
df2 = pd.DataFrame(a, index=pd.date_range(start=f'{start+1}', end=f'{stop+1}', freq='Y'))
return df2.interpolate(method='linear')
df = pd.DataFrame([["A", 5, 3]], columns=["Country", f"Temperature 2020", f"Temperature 2050"])
df[f"Temperature 2050"] += df[f"Temperature 2020"]
print(interpolate(df, 2020, 2050))
This will output
2021-01-01 5.000000
2022-01-01 5.103448
2023-01-01 5.206897
2024-01-01 5.310345
2025-01-01 5.413793
2026-01-01 5.517241
2027-01-01 5.620690
2028-01-01 5.724138
2029-01-01 5.827586
2030-01-01 5.931034
2031-01-01 6.034483
2032-01-01 6.137931
2033-01-01 6.241379
2034-01-01 6.344828
2035-01-01 6.448276
2036-01-01 6.551724
2037-01-01 6.655172
2038-01-01 6.758621
2039-01-01 6.862069
2040-01-01 6.965517
2041-01-01 7.068966
2042-01-01 7.172414
2043-01-01 7.275862
2044-01-01 7.379310
2045-01-01 7.482759
2046-01-01 7.586207
2047-01-01 7.689655
2048-01-01 7.793103
2049-01-01 7.896552
2050-01-01 8.000000

Average of n lowest priced hourly intervals in a day pandas dataframe

I have a dataframe that is made up of hourly electricity price data. What I am trying to do is find a way to calculate the average of the n lowest price hourly periods in day. The data spans many years and aiming to get the average of the n lowest price periods for each day. Synthetic data can be created using the following:
np.random.seed(0)
rng = pd.date_range('2020-01-01', periods=24, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Price': np.random.randn(len(rng)) })
I have managed to get the lowest price for each day by using:
df_max = df.groupby([pd.Grouper(key='Date', freq='D')]).min()
Is there a way to get the average of the n lowest periods in a day?
Thanks in advance for any help.
We can group the dataframe by Grouper object with daily frequency then aggregate Price using nsmallest to obtain the n smallest values, now calculate the mean on level=0 to get the average of n smallest values in a day
df.groupby(pd.Grouper(key='Date', freq='D'))['Price'].nsmallest(5).mean(level=0)
Result of calculating the average of 5 smallest values daily
Date
2020-01-01 -1.066337
Name: Price, dtype: float64
You can also try the following:
bottom_5_prices_mean=df.sort_index(ascending=True).head(5)['Price'].mean()
top_5_prices_mean=df.sort_index(ascending=True).tail(5)['Price'].mean()

Calculating weighted average from my dataframe

I am trying to calculate the weighted average of amount of times a social media post was made on a given weekday between 2009- 2018.
This is the code I have:
weight = fb_posts2[fb_posts2['title']=='status'].groupby('year',as_index=False).apply(lambda x: (x.count())/x.sum())
What i am trying to do is to groupby year and weekday, count the number of time each weekday has occurred in a year and divide that by the total number of posts in each year. The idea is to return a dataframe with a weighted average of how many times each weekday occurred between 2009 and 2018.
This is a sample of the dataframe I am interacting with:
Use .value_counts() with the normalize argument, grouping only on year.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'year': np.random.choice([2010, 2011], 1000),
'weekday': np.random.choice(list('abcdefg'), 1000),
'val': np.random.normal(1, 10, 1000)})
Code:
df.groupby('year').weekday.value_counts(normalize=True)
Output:
year weekday
2010 d 0.152083
f 0.147917
g 0.147917
c 0.143750
e 0.139583
b 0.137500
a 0.131250
2011 d 0.182692
a 0.163462
e 0.153846
b 0.148077
c 0.128846
f 0.111538
g 0.111538
Name: weekday, dtype: float64

Calculating monthly retention

I've been performing a cohort analysis for a SaaS company, and I have been using Greg Rada's example, and I ran into some trouble looking up a cohorts retention.
Right now, I have a dataframe set up as:
import numpy as np
from pandas import DataFrame, Series
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2
%matplotlib inline
df = DataFrame ({
'Customer_ID': ['QWT19CLG2QQ','URL99FXP9VV','EJO15CUP4TO','ZDJ11ZPO5LX','QQW13PUF3HL','SIJ98IQH0GW','EBH36UPB2XR','BED40SMW5NQ','NYW11ZKC8WK','YLV60ERT0VT'],
'Plan_Start_Date': ['2014-01-30', '2014-03-04', '2014-01-27', '2014-02-10', '2014-01-02', '2014-04-15', '2014-05-28', '2014-05-03', '2014-02-09', '2014-06-09']
'Plan_Cancel_Date': ['2014-09-19', '2014-10-29', '2015-01-19', '2015-01-21', '2014-08-19', '2014-08-26', '2014-10-01', '2015-01-03', '2015-01-23', '2015-09-02']
'Monthly_Pay': [14.99, 14.99, 14.99, 14.99, 29.99, 29.99, 29.99, 74.99, 74.99, 74.99]
'Plan_ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
})
So far, what I have done is...
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
#Convert the dates from objects to datetime
df['Cohort'] = df.Plan_Start_Date.map(lambda x: x.strftime('%Y-%m'))
#Create a cohort based on the start dates month and year
df['Lifetime'] = (df.Plan_Cancel_Date.dt.year -
df.Plan_Start_Date.dt.year)*12 + (df.Plan_Cancel_Date.dt.month -
df.Plan_Start_Date.dt.month)
#calculat the total lifetime of each customer
df['Lifetime_Revenue'] = df['Monthly_Pay'] * df['Lifetime']
dfsort = df.sort_values(['Cohort'])
dfsort.head(10)
#Calculate the total revenue of each customer
I have tried to Create a retention column from the Plan_Start_Date, similar to how Greg structured his:
dfsort['Retention'] = dfsort.groupby(level=0)['Plan_Start_Date'].min().apply(lambda x:
x.strftime('%Y-%m'))
But that would just repeat the value of the ['Cohort'] on my dataset.
And in turn, when I try to create an index hierarchy to map out retention by:
grouped = dfsort.groupby(['Cohort', 'Retention'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
cohorts.head()
instead of looking like:
Total_Users
Cohort Retention
-------------------------------
2014-01 2014-01 3
2014-02 3
2014-03 3
...
2015-01 1
2014-02 2014-01 2
2014-02 2
It looks like:
Total_Users
Cohort Retention
-------------------------------
2014-1 2014-1 3
2014-2 2014-2 2
2014-3 2014-3 1
...
I know I am grouping wrong, and creating the retention column, but I am at a loss on how to fix it. Anyone able to help a rookie out?
You can use multi_indexing and then grouping on 2 columns.
dfsort = dfsort.set_index(['Cohort', 'Retention'])
dfsort.groupby(['Cohort', 'Retention']).count()
However, in your data, you only have one 'Retention' date for each cohort, which is why you don't see different Retention dates.
Cohort Retention
---------------------
2014-01 2014-01
2014-01
2014-01
2014-02 2014-02
2014-02
Maybe you want to look at how you calculated the Cohorts and Retentions.

Categories

Resources