Grouping data by specific years in python

Grouping data by specific years in python - python

I want to create a dataframe that is grouped by region and date which shows the average age of a region during specific years. so my coloumns would look something like
region, year, average age
so far I have:
#specify aggregation functions to column'age'
ageAverage = {'age':{'average age':'mean'}}
#groupby and apply functions
ageDataFrame = data.groupby(['Region', data.Date.dt.year]).agg(ageAverage)
This works great, but how can I make it so that I only group data from specific years? say for example between 2010 and 2015?

You need filter first by between:
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])
.agg(ageAverage))
Also in last version of pandas 0.22.0 get:
SpecificationError: cannot perform renaming for age with a nested dictionary
Correct solution is specify column in list after groupby and aggregate by tuple - first value is new column name and second aggregate function:
np.random.seed(123)
rng = pd.date_range('2009-04-03', periods=10, freq='13M')
data = pd.DataFrame({'Date': rng,
'Region':['reg1'] * 3 + ['reg2'] * 7,
'average age': np.random.randint(20, size=10)})
print (data)
Date Region average age
0 2009-04-30 reg1 13
1 2010-05-31 reg1 2
2 2011-06-30 reg1 2
3 2012-07-31 reg2 6
4 2013-08-31 reg2 17
5 2014-09-30 reg2 19
6 2015-10-31 reg2 10
7 2016-11-30 reg2 1
8 2017-12-31 reg2 0
9 2019-01-31 reg2 17
ageAverage = {('age','mean')}
#groupby and apply functions
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])['average age']
.agg(ageAverage))
print (ageDataFrame)
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10

Two variations using #jezrael's data (thx)
These are very close to what #jezrael has already shown. Only view this as a demonstration of what else can be done. As pointed out in the comments by #jezrael, it is better to pre-filter first as it reduces overall processing.
pandas.IndexSlice
instead of prefiltering with between
data.groupby(
['Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[pd.IndexSlice[:, 2010:2015], :]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
between as part of the groupby
data.groupby(
[data.Date.dt.year.between(2010, 2015),
'Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[True]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10

Related

Read certain column in excel given two other values match using pandas

Month Year Open High Low Close/Price Volume
6 2019 86.78 87.11 86.06 86.55 1507828
6 2019 86.63 87.23 84.81 85.06 2481284
6 2019 85.38 85.81 84.75 85.33 2034693
6 2019 85.65 86.86 85.13 86.43 1394847
6 2019 86.66 87.74 86.66 87.55 3025379
7 2019 88.84 89.72 87.77 88.45 4017249
7 2019 89.21 90 87.95 88.87 2237183
7 2019 89.14 91.08 89.14 90.67 1647124
7 2019 90.39 90.95 89.07 90.59 3227673
I want to get the monthly average of: Open High Low Close/Price
How do i set two values (Month, Year) as parameters for getting a value that is in another column?
df = pd.read_excel('DatosUnited.xlsx')
month = df.groupby('Month')
year = df.groupby('Year')
june2019 = month.get_group("6")
year2019 = year.get_group('2019')
I tried something like this, but i dont know how to use both as a filter simultaneously

You can use .groupby() with multiple columns, and then you can use .mean() to get the desired averages:
df.groupby(["Month", "Year"]).mean()
This outputs:
Open High Low Close/Price Volume
Month Year
6 2019 86.220 86.9500 85.4820 86.184 2088806.20
7 2019 89.395 90.4375 88.4825 89.645 2782307.25

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?

The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000

Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

Python: Calculate 5-year rolling CAGR of values that need to be grouped from a dataframe

I have a dataframe with historical market caps for which I need to compute their 5-year compound annual growth rates (CAGRs). However, the dataframe has hundreds of companies with 20 years of values each, so I need to be able to isolate each company's data to compute their CAGRs. How do I go about doing this?
The function to calculate a CAGR is: (end/start)^(1/# years)-1. I have never used .groupby() or .apply(), so I don't know how to implement the CAGR equation for rolling values.
Here is a screenshot of part of the dataframe so you have a visual representation of what I am trying to use:
Screeshot of dataframe.
Any guidance would be greatly appreciated!

Assuming there is 1 value per company per year. You can reduce the date to year. This is a lot simpler. No need for groupby or apply.
Say your dataframe is name df. First, reduce date to year:
df['year'] = df['Date'].dt.year
Second, add year+5
df['year+5'] = df['year'] + 5
Third, merge the 'df' with itself:
df_new = pandas.merge(df, df, how='inner', left_on=['Instrument', 'year'], right_on=['Instrument','year+5'], suffixes=['_start', '_end'])
Finally, calculate rolling CAGR
df_new['CAGR'] = (df_new['Company Market Cap_end']/df_new['Company Market Cap_start'])**(0.2)-1

Setting up a toy example:
import numpy as np
import pandas as pd
idx_level_0 = np.repeat(["company1", "company2", "company3"], 5)
idx_level_1 = np.tile([2015, 2016, 2017, 2018, 2019], 3)
values = np.random.randint(low=1, high=100, size=15)
df = pd.DataFrame({"values": values}, index=[idx_level_0, idx_level_1])
df.index.names = ["company", "year"]
print(df)
values
company year
company1 2015 19
2016 61
2017 87
2018 55
2019 46
company2 2015 1
2016 68
2017 50
2018 93
2019 84
company3 2015 11
2016 84
2017 54
2018 21
2019 55
I suggest to use groupby to group by individual companies. You then could apply your computation via a lambda function. The result is basically a one-liner.
# actual computation for a two-year period
cagr_period = 2
df["cagr"] = df.groupby("company").apply(lambda x, period: ((x.pct_change(period) + 1) ** (1/period)) - 1, cagr_period)
print(df)
values cagr
company year
company1 2015 19 NaN
2016 61 NaN
2017 87 1.139848
2018 55 -0.050453
2019 46 -0.272858
company2 2015 1 NaN
2016 68 NaN
2017 50 6.071068
2018 93 0.169464
2019 84 0.296148
company3 2015 11 NaN
2016 84 NaN
2017 54 1.215647
2018 21 -0.500000
2019 55 0.009217

Operation across observations and year is returned NaN

I have a panel dataset with a set of countries [Italy and US] for 3 years and two numeric variables ['Var1', 'Var2']. I would like to calculate the rate of change in the last three years Ex: the value for Var1 in 2019 minus the value of Var1 in 2017 divided by Var1 in 2017.
I do not understand why my code (below) returns NaN errors?
data = {'Year':[2017, 2018, 2019, 2017, 2018, 2019], 'Country':['Italy', 'Italy', 'Italy', 'US' , 'US', 'US'], 'Var1':[23,75,45, 32,13,14], 'Var2':[21,75,47, 30,11,18]}
trend = pd.DataFrame(data)
list = ['Var1', 'Var2']
for col in list:
trend[col + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019][col]- trend.loc[trend['Year']==2017][col])/trend.loc[trend['Year']==2017][col])*100
trend

Check if this gives what you want. It is much simpler to understand.
trend['Var1_3_Year_%_Change'] = trend.groupby('Country')['Var1'].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)
trend['Var2_3_Year_%_Change'] = trend.groupby('Country')['Var2'].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)
trend['Var1_yearly'] = trend.groupby('Country')['Var1'].apply(lambda x : ((x-x.shift()))/x.shift()*100)
trend['Var2_yearly'] = trend.groupby('Country')['Var2'].apply(lambda x : ((x-x.shift()))/x.shift()*100)
Output
Year Country Var1 Var2 Var1_3_Year_%_Change Var2_3_Year_%_Change Var1_yearly Var2_yearly
2017 Italy 23 21 0.000000 0.000000 NaN NaN
2018 Italy 75 75 226.086957 257.142857 226.086957 257.142857
2019 Italy 45 47 95.652174 123.809524 -40.000000 -37.333333
2017 US 32 30 0.000000 0.000000 NaN NaN
2018 US 13 11 -59.375000 -63.333333 -59.375000 -63.333333
2019 US 14 18 -56.250000 -40.000000 7.692308 63.636364
If it has to be done with for loop, use
var= ['Var1','Var2']
for col in var:
trend[col + ' (3 Year % Change)'] = trend.groupby('Country')[col].apply(lambda x : ((x-x.iloc[0]))/x.iloc[0]*100)

There are a few things going wrong here with your code:
You are trying to divide pd.series and not only their arrays, and they are carrying their index, which causes the division to become NaN
If you actually pass the values, for instance by using .values after the columns filters, you will bump into a ValueError because you want two values to insert into the entire DataFrame and pandas won't like that (length should be the same). This exemplifies it:
trend.loc['Var1' + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019, 'Var1'].values - \
trend.loc[trend['Year']==2017, 'Var1'].values)/\
trend.loc[trend['Year']==2017, 'Var1'].values)*100
ValueError: cannot set a row with mismatched columns
Not sure if you are using list as an actual variable name, but that is a reserved python word. It is not the best idea. You can read about it here
If you want to compare the values with 2017 values in your sample, you can use
groupby+shift, based on how many years to shift:
for col in ['Var1','Var2']:
trend[col + ' (3 Year % Change)'] = (trend[col] - trend.groupby('Country').shift(2)[col])/trend.groupby('Country').shift(2)[col]
Out[1]:
Year Country Var1 Var2 Var1 (3 Year % Change) Var2 (3 Year % Change)
0 2017 Italy 23 21 NaN NaN
1 2018 Italy 75 75 NaN NaN
2 2019 Italy 45 47 0.956522 1.238095
3 2017 US 32 30 NaN NaN
4 2018 US 13 11 NaN NaN
5 2019 US 14 18 -0.562500 -0.400000

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.

The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.

Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1

Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.

You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping data by specific years in python - python

Related

Read certain column in excel given two other values match using pandas

Count ratios conditional on 2 columns

Python: Calculate 5-year rolling CAGR of values that need to be grouped from a dataframe

Operation across observations and year is returned NaN

grouping time-series data based on starting and ending date

Categories

Resources