I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.
I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%
Related
I have 4 columns in dataset which are cid(customer level), month, spending and transaction (max.cid=10000). As seen below, df.head().
cid month spending transaction
0 1 3 61.94 28
1 1 4 73.02 23
2 1 7 59.34 25
3 1 8 48.69 24
4 1 9 121.79 26
I use the following function to calculate the trend (slope)in the outflow spending per customer. However, I get the identical number as a result for the whole dataset. Expected to calculate trend of spendings on customer level. (trend value for each customer).
Is there a way to iterate over each customer level in the dataset and obtain individual trends per customer? Thanks in advance!
df = pd.read_csv("/content/case_data.csv")
import numpy as np
def trendline(df, order=1):
coeffs = np.polyfit(df.index.values, list(df), order)
slope = coeffs[-2]
return float(slope)
outflow = df['spending']
cid = df['cid']
df_ = pd.DataFrame({'cid': cid, 'outflow': outflow})
slope_outflow = trendline(df_['cid'])
slope_outflow
Output : 0.13377820413729283
Expected Output: (Trend1), (Trend2), (Trend3), ......, (Trend10000)
def trendline(x, y, order=1):
return np.polyfit(x, y, order)[-2]
df.groupby('cid').apply(lambda subdf: trendline(subdf['month'].values, subdf['spending'].values))
You can use groupby to calculate the trend by each cid value. In the above example it is for the trend of spending.
I have a dataset that includes a country's temperature in 2020 and the projected temperature rise in 2050. I'm hoping to create a dataset that assumes the linear growth of temperature between 20201 and 2050 for this country. Take the sample df as an example. The temperature in 2020 for country A is 5 degree; by 2050, the temperature is projected to rise by 3 degree. In other words, the temperature would rise by 0.1 degree per year.
Country Temperature 2020 Temperature 2050
A 5 3
The desired output is df2
Country Year Temperature
A 2020 5
A 2021 5.1
A 2022 5.2
I tried to use resample but it seems to only work for scenario when the frequency is within a year (month, quarter). I also tried interpolate. But neither works.
df = df.reindex(pd.date_range(start='20211231', end='20501231', freq='12MS'))
df2 = df.interpolate(method='linear')
You can use something like this:
import numpy as np
import pandas as pd
def interpolate(df, start, stop):
a = np.empty((stop - start, df.shape[0]))
a[1:-1] = np.nan
a[0] = df[f'Temperature {start}']
a[-1] = df[f'Temperature {stop}']
df2 = pd.DataFrame(a, index=pd.date_range(start=f'{start+1}', end=f'{stop+1}', freq='Y'))
return df2.interpolate(method='linear')
df = pd.DataFrame([["A", 5, 3]], columns=["Country", f"Temperature 2020", f"Temperature 2050"])
df[f"Temperature 2050"] += df[f"Temperature 2020"]
print(interpolate(df, 2020, 2050))
This will output
2021-01-01 5.000000
2022-01-01 5.103448
2023-01-01 5.206897
2024-01-01 5.310345
2025-01-01 5.413793
2026-01-01 5.517241
2027-01-01 5.620690
2028-01-01 5.724138
2029-01-01 5.827586
2030-01-01 5.931034
2031-01-01 6.034483
2032-01-01 6.137931
2033-01-01 6.241379
2034-01-01 6.344828
2035-01-01 6.448276
2036-01-01 6.551724
2037-01-01 6.655172
2038-01-01 6.758621
2039-01-01 6.862069
2040-01-01 6.965517
2041-01-01 7.068966
2042-01-01 7.172414
2043-01-01 7.275862
2044-01-01 7.379310
2045-01-01 7.482759
2046-01-01 7.586207
2047-01-01 7.689655
2048-01-01 7.793103
2049-01-01 7.896552
2050-01-01 8.000000
I have a dataframe that is made up of hourly electricity price data. What I am trying to do is find a way to calculate the average of the n lowest price hourly periods in day. The data spans many years and aiming to get the average of the n lowest price periods for each day. Synthetic data can be created using the following:
np.random.seed(0)
rng = pd.date_range('2020-01-01', periods=24, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Price': np.random.randn(len(rng)) })
I have managed to get the lowest price for each day by using:
df_max = df.groupby([pd.Grouper(key='Date', freq='D')]).min()
Is there a way to get the average of the n lowest periods in a day?
Thanks in advance for any help.
We can group the dataframe by Grouper object with daily frequency then aggregate Price using nsmallest to obtain the n smallest values, now calculate the mean on level=0 to get the average of n smallest values in a day
df.groupby(pd.Grouper(key='Date', freq='D'))['Price'].nsmallest(5).mean(level=0)
Result of calculating the average of 5 smallest values daily
Date
2020-01-01 -1.066337
Name: Price, dtype: float64
You can also try the following:
bottom_5_prices_mean=df.sort_index(ascending=True).head(5)['Price'].mean()
top_5_prices_mean=df.sort_index(ascending=True).tail(5)['Price'].mean()
I am trying to calculate the weighted average of amount of times a social media post was made on a given weekday between 2009- 2018.
This is the code I have:
weight = fb_posts2[fb_posts2['title']=='status'].groupby('year',as_index=False).apply(lambda x: (x.count())/x.sum())
What i am trying to do is to groupby year and weekday, count the number of time each weekday has occurred in a year and divide that by the total number of posts in each year. The idea is to return a dataframe with a weighted average of how many times each weekday occurred between 2009 and 2018.
This is a sample of the dataframe I am interacting with:
Use .value_counts() with the normalize argument, grouping only on year.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'year': np.random.choice([2010, 2011], 1000),
'weekday': np.random.choice(list('abcdefg'), 1000),
'val': np.random.normal(1, 10, 1000)})
Code:
df.groupby('year').weekday.value_counts(normalize=True)
Output:
year weekday
2010 d 0.152083
f 0.147917
g 0.147917
c 0.143750
e 0.139583
b 0.137500
a 0.131250
2011 d 0.182692
a 0.163462
e 0.153846
b 0.148077
c 0.128846
f 0.111538
g 0.111538
Name: weekday, dtype: float64
I've been performing a cohort analysis for a SaaS company, and I have been using Greg Rada's example, and I ran into some trouble looking up a cohorts retention.
Right now, I have a dataframe set up as:
import numpy as np
from pandas import DataFrame, Series
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2
%matplotlib inline
df = DataFrame ({
'Customer_ID': ['QWT19CLG2QQ','URL99FXP9VV','EJO15CUP4TO','ZDJ11ZPO5LX','QQW13PUF3HL','SIJ98IQH0GW','EBH36UPB2XR','BED40SMW5NQ','NYW11ZKC8WK','YLV60ERT0VT'],
'Plan_Start_Date': ['2014-01-30', '2014-03-04', '2014-01-27', '2014-02-10', '2014-01-02', '2014-04-15', '2014-05-28', '2014-05-03', '2014-02-09', '2014-06-09']
'Plan_Cancel_Date': ['2014-09-19', '2014-10-29', '2015-01-19', '2015-01-21', '2014-08-19', '2014-08-26', '2014-10-01', '2015-01-03', '2015-01-23', '2015-09-02']
'Monthly_Pay': [14.99, 14.99, 14.99, 14.99, 29.99, 29.99, 29.99, 74.99, 74.99, 74.99]
'Plan_ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
})
So far, what I have done is...
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
#Convert the dates from objects to datetime
df['Cohort'] = df.Plan_Start_Date.map(lambda x: x.strftime('%Y-%m'))
#Create a cohort based on the start dates month and year
df['Lifetime'] = (df.Plan_Cancel_Date.dt.year -
df.Plan_Start_Date.dt.year)*12 + (df.Plan_Cancel_Date.dt.month -
df.Plan_Start_Date.dt.month)
#calculat the total lifetime of each customer
df['Lifetime_Revenue'] = df['Monthly_Pay'] * df['Lifetime']
dfsort = df.sort_values(['Cohort'])
dfsort.head(10)
#Calculate the total revenue of each customer
I have tried to Create a retention column from the Plan_Start_Date, similar to how Greg structured his:
dfsort['Retention'] = dfsort.groupby(level=0)['Plan_Start_Date'].min().apply(lambda x:
x.strftime('%Y-%m'))
But that would just repeat the value of the ['Cohort'] on my dataset.
And in turn, when I try to create an index hierarchy to map out retention by:
grouped = dfsort.groupby(['Cohort', 'Retention'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
cohorts.head()
instead of looking like:
Total_Users
Cohort Retention
-------------------------------
2014-01 2014-01 3
2014-02 3
2014-03 3
...
2015-01 1
2014-02 2014-01 2
2014-02 2
It looks like:
Total_Users
Cohort Retention
-------------------------------
2014-1 2014-1 3
2014-2 2014-2 2
2014-3 2014-3 1
...
I know I am grouping wrong, and creating the retention column, but I am at a loss on how to fix it. Anyone able to help a rookie out?
You can use multi_indexing and then grouping on 2 columns.
dfsort = dfsort.set_index(['Cohort', 'Retention'])
dfsort.groupby(['Cohort', 'Retention']).count()
However, in your data, you only have one 'Retention' date for each cohort, which is why you don't see different Retention dates.
Cohort Retention
---------------------
2014-01 2014-01
2014-01
2014-01
2014-02 2014-02
2014-02
Maybe you want to look at how you calculated the Cohorts and Retentions.