group pandas time series by month across years - python

I have the following working example where I calculate a normal distribution for every month of this timeseries. What I am looking for is an aggregated distribution that gives back 12 values, so for every month calculated across years. In other words, the subset for January includes the data form January 2011, 2012, 2013, 2014, from which the distribution is calculated from.
from scipy.stats import norm
import pandas as pd
import numpy as np
def some_function(data):
mu, std = norm.fit(data)
a = mu * 3
b = std * 5
return a, b
rng = pd.date_range('1/1/2011', periods=4*365, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.groupby(pd.TimeGrouper('M')).apply(some_function).apply(pd.Series).rename(columns={0: 'mu', 1: 'std'})
Cheers

You can use the year attribute on the datetime index:
In [11]: ts.groupby(ts.index.year).apply(some_function).apply(pd.Series).rename(columns={0: 'mu', 1: 'std'})
Out[11]:
mu std
2011 0.110566 4.827900
2012 -0.094430 4.950958
2013 -0.097986 4.965611
2014 -0.078819 4.709263

Related

Change data interval and add the average to the original data

I have a dataset that includes a country's temperature in 2020 and the projected temperature rise in 2050. I'm hoping to create a dataset that assumes the linear growth of temperature between 20201 and 2050 for this country. Take the sample df as an example. The temperature in 2020 for country A is 5 degree; by 2050, the temperature is projected to rise by 3 degree. In other words, the temperature would rise by 0.1 degree per year.
Country Temperature 2020 Temperature 2050
A 5 3
The desired output is df2
Country Year Temperature
A 2020 5
A 2021 5.1
A 2022 5.2
I tried to use resample but it seems to only work for scenario when the frequency is within a year (month, quarter). I also tried interpolate. But neither works.
df = df.reindex(pd.date_range(start='20211231', end='20501231', freq='12MS'))
df2 = df.interpolate(method='linear')
You can use something like this:
import numpy as np
import pandas as pd
def interpolate(df, start, stop):
a = np.empty((stop - start, df.shape[0]))
a[1:-1] = np.nan
a[0] = df[f'Temperature {start}']
a[-1] = df[f'Temperature {stop}']
df2 = pd.DataFrame(a, index=pd.date_range(start=f'{start+1}', end=f'{stop+1}', freq='Y'))
return df2.interpolate(method='linear')
df = pd.DataFrame([["A", 5, 3]], columns=["Country", f"Temperature 2020", f"Temperature 2050"])
df[f"Temperature 2050"] += df[f"Temperature 2020"]
print(interpolate(df, 2020, 2050))
This will output
2021-01-01 5.000000
2022-01-01 5.103448
2023-01-01 5.206897
2024-01-01 5.310345
2025-01-01 5.413793
2026-01-01 5.517241
2027-01-01 5.620690
2028-01-01 5.724138
2029-01-01 5.827586
2030-01-01 5.931034
2031-01-01 6.034483
2032-01-01 6.137931
2033-01-01 6.241379
2034-01-01 6.344828
2035-01-01 6.448276
2036-01-01 6.551724
2037-01-01 6.655172
2038-01-01 6.758621
2039-01-01 6.862069
2040-01-01 6.965517
2041-01-01 7.068966
2042-01-01 7.172414
2043-01-01 7.275862
2044-01-01 7.379310
2045-01-01 7.482759
2046-01-01 7.586207
2047-01-01 7.689655
2048-01-01 7.793103
2049-01-01 7.896552
2050-01-01 8.000000

Calculating growth rates on specific level of multilevel index in Pandas

I have a dataset that I want to use to calculate the average quarterly growth rate, broken down by each year in the dataset.
Right now I have a dataframe with a multi-level grouping, and I'd like to apply the gmean function from scipy.stats to each year within the dataset.
The code I use to get the quarterly growth rates looks like this:
df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)
Which gives me this as a result:
So basically I want the geometric mean of (1.162409, 1.659756, 1.250600) for 2014, and the other quarterly growth rates for every other year.
Instinctively, I want to do something like this:
(df.groupby(df.index.year).resample('Q')['Sales'].sum() / df.groupby(df.index.year).resample('Q')['Sales'].sum().shift(1)).apply(gmean, level=0)
But this doesn't work.
I don't know what your data looks like so I'm gonna make some random sample data:
dates = pd.date_range('2014-01-01', '2017-12-31')
n = 5000
np.random.seed(1)
df = pd.DataFrame({
'Order Date': np.random.choice(dates, n),
'Sales': np.random.uniform(1, 100, n)
})
Order Date Sales
0 2016-11-27 82.458720
1 2014-08-24 66.790309
2 2017-01-01 75.387001
3 2016-06-24 9.272712
4 2015-12-17 48.278467
And the code:
# Total sales per quarter
q = df.groupby(pd.Grouper(key='Order Date', freq='Q'))['Sales'].sum()
# Q-over-Q growth rate
q = (q / q.shift()).fillna(1)
# Y-over-Y growth rate
from scipy.stats import gmean
y = q.groupby(pd.Grouper(freq='Y')).agg(gmean) - 1
y.index = y.index.year
y.index.name = 'Year'
y.to_frame('Avg. Quarterly Growth').style.format('{:.1%}')
Result:
Avg. Quarterly Growth
Year
2014 -4.1%
2015 -0.7%
2016 3.5%
2017 -1.1%

Calculating weighted average from my dataframe

I am trying to calculate the weighted average of amount of times a social media post was made on a given weekday between 2009- 2018.
This is the code I have:
weight = fb_posts2[fb_posts2['title']=='status'].groupby('year',as_index=False).apply(lambda x: (x.count())/x.sum())
What i am trying to do is to groupby year and weekday, count the number of time each weekday has occurred in a year and divide that by the total number of posts in each year. The idea is to return a dataframe with a weighted average of how many times each weekday occurred between 2009 and 2018.
This is a sample of the dataframe I am interacting with:
Use .value_counts() with the normalize argument, grouping only on year.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'year': np.random.choice([2010, 2011], 1000),
'weekday': np.random.choice(list('abcdefg'), 1000),
'val': np.random.normal(1, 10, 1000)})
Code:
df.groupby('year').weekday.value_counts(normalize=True)
Output:
year weekday
2010 d 0.152083
f 0.147917
g 0.147917
c 0.143750
e 0.139583
b 0.137500
a 0.131250
2011 d 0.182692
a 0.163462
e 0.153846
b 0.148077
c 0.128846
f 0.111538
g 0.111538
Name: weekday, dtype: float64

Month,Year with Value Plot,Pandas and MatPlotLib

i have DataFrame with Month,Year and Value and i want to do a TimeSeries Plot.
Sample:
month year Value
12 2016 0.006437804129357764
1 2017 0.013850880792606646
2 2017 0.013330349031207292
3 2017 0.07663058273768052
4 2017 0.7822831457266424
5 2017 0.8089573099244689
6 2017 1.1634845000200715
im trying to plot this Value data with Year and Month, Year and Month in X-Axis and Value in Y-Axis.
One way is this:
import pandas as pd
import matplotlib.pyplot as plt
df['date'] = df['month'].map(str)+ '-' +df['year'].map(str)
df['date'] = pd.to_datetime(df['date'], format='%m-%Y').dt.strftime('%m-%Y')
fig, ax = plt.subplots()
plt.plot_date(df['date'], df['Value'])
plt.show()
You need to set a DateTime index for pandas to properly plot the axis. A one line modification for your dataframe (assuming you don't need year and month anymore as columns and that first day of each month is correct) would do:
df.set_index(pd.to_datetime({
'day': 1,
'month': df.pop('month'),
'year': df.pop('year')
}), inplace=True)
df.Value.plot()

How to find out about what N the resample function in pandas did its job?

I use the python module pandas and its function resample to calculate means of a dataset. I wonder how I can get to know about what N the resampling for each day/each month takes places.
In the below given example I calculate means for the three month January, Feb. and March.
The answer to my question in that case is: N for January = 31, N for February = 29, N for March = 31. Is there a way to get that information about N for more complex data?
import pandas as pd
import numpy as np
#create dates as index
dates = pd.date_range('1/1/2000', periods=91)
index = pd.Index(dates, name = 'dates')
#create DataFrame df
df = pd.DataFrame(np.random.randn(91, 1), index, columns=['A'])
print df['A']
#calculate monthly_mean
monthly_mean = df.resample('M', how='mean')
Thanks in advance.
You could use how='count', IIUC:
>>> df.resample('M', how='count')
2000-01-31 A 31
2000-02-29 A 29
2000-03-31 A 31
dtype: int64

Categories

Resources