Pandas Method Chaining: getting KeyError on calculated column - python

I’m scraping web data to get US college football poll top 25 information that I store in a Pandas dataframe. The data has multiple years of poll information, with preseason and final polls for each year. Each poll ranks teams from 1 to 25. Team ranks are determined by the voting points each team received; the team with most points is ranked 1, etc. Both rank and points are included in the dataset. Here's the head of the raw data df:
cols = ['Year','Type', 'Team (FPV)', 'Rank', 'Pts']
all_wks_raw[cols].head()
The dataframe has columns for Rank and Pts (Points). The Rank column (dytpe object) contains numeric ranks of 1-25 plus “RV” for teams that received points but did not rank in the top 25. The Pts column is dtype int64. Since Pts for teams that did not make the top 25 are included in the data, I’m able to re-rank the teams based on Pts and thus extend rankings beyond the top 25. The resulting revrank column ranks teams from 1 to between 37 and 61, depending how many teams received points in that poll. Revrank is the first new column I create.
The revrank column should equal the Rank column for the first 25 teams, but before I can test it I need to create a new column that converts Rank to numeric. The result is rank_int, which is my second created column. Then I try to create a third column that calculates the difference between the two created columns, and this is where I get the KeyError. Here's the chain:
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)
Are the results of the first two assignments not being passed to the third? Am I missing something in the syntax? Thanks for the help.
Edited 7/20/2022 to add complete code; note that this code scrapes data from the College Poll Archive web site:
dict = {1119: [2016, '2016 Final AP Football Poll', 'Final'], 1120: [2017, '2017 Preseason AP Football Poll', 'Preseason'],
1135: [2017, '2017 Final AP Football Poll', 'Final'], 1136: [2018, '2018 Preseason AP Football Poll', 'Preseason'],
1151: [2018, '2018 Final AP Football Poll', 'Final'], 1152: [2019, '2019 Preseason AP Football Poll', 'Preseason']}
#get one week of poll data from College Poll Archive ID parameter
def getdata(id):
coldefs = {'ID':key, 'Year': value[0], 'Title': value[1], 'Type':value[2]} #define dictionary of scalar columns to add to dataframe
urlseg = 'https://www.collegepollarchive.com/football/ap/seasons.cfm?appollid='
url = urlseg + str(id)
dfs = pd.read_html(url)
df = dfs[0].assign(**coldefs)
return df
all_wks_raw = pd.DataFrame()
for key, value in dict.items():
print(key, value[0], value[2])
onewk = getdata(key)
all_wks_raw = all_wks_raw.append(onewk)
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)

If accessing a column that doesn't yet exist, that must be done through a lambda:
dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)
Output:
Team (FPV) Rank Pts Year Type rank_int gprank ck_rank
0 LSU (62) 1 1550 2016 final 1 1.0 0.0
1 Clemson 2 1487 2016 final 2 2.0 0.0
2 Ohio State 3 1426 2016 final 3 3.0 0.0
3 Georgia 4 1336 2016 final 4 4.0 0.0
4 Oregon 5 1249 2016 final 5 5.0 0.0
5 Florida 6 1211 2016 final 6 6.0 0.0
6 Oklahoma 7 1179 2016 final 7 7.0 0.0
7 Alabama 8 1159 2016 final 8 8.0 0.0
8 Penn State 9 1038 2016 final 9 9.0 0.0
9 Minnesota 10 952 2016 final 10 10.0 0.0
10 Wisconsin 11 883 2016 final 11 11.0 0.0
11 Notre Dame 12 879 2016 final 12 12.0 0.0
12 Baylor 13 827 2016 final 13 13.0 0.0
13 Auburn 14 726 2016 final 14 14.0 0.0
14 Iowa 15 699 2016 final 15 15.0 0.0
15 Utah 16 543 2016 final 16 16.0 0.0
16 Memphis 17 528 2016 final 17 17.0 0.0
17 Michigan 18 468 2016 final 18 18.0 0.0
18 Appalachian State 19 466 2016 final 19 19.0 0.0
19 Navy 20 415 2016 final 20 20.0 0.0
20 Cincinnati 21 343 2016 final 21 21.0 0.0
21 Air Force 22 209 2016 final 22 22.0 0.0
22 Boise State 23 188 2016 final 23 23.0 0.0
23 UCF 24 78 2016 final 24 24.0 0.0
24 Texas 25 69 2016 final 25 25.0 0.0
25 Texas A&M RV 54 2016 final 0 26.0 26.0
26 Florida Atlantic RV 46 2016 final 0 27.0 27.0
27 Washington RV 39 2016 final 0 28.0 28.0
28 Virginia RV 28 2016 final 0 29.0 29.0
29 USC RV 16 2016 final 0 30.0 30.0
30 San Diego State RV 13 2016 final 0 31.0 31.0
31 Arizona State RV 12 2016 final 0 32.0 32.0
32 SMU RV 10 2016 final 0 33.0 33.0
33 Tennessee RV 8 2016 final 0 34.0 34.0
34 California RV 6 2016 final 0 35.0 35.0
35 Kansas State RV 2 2016 final 0 36.0 36.0
36 Kentucky RV 2 2016 final 0 36.0 36.0
37 Louisiana RV 2 2016 final 0 36.0 36.0
38 Louisiana Tech RV 2 2016 final 0 36.0 36.0
39 North Dakota State RV 2 2016 final 0 36.0 36.0
40 Hawaii NR 0 2016 final 0 41.0 41.0
41 Louisville NR 0 2016 final 0 41.0 41.0
42 Oklahoma State NR 0 2016 final 0 41.0 41.0

Adding to BeRT2me's answer, when chaining, lambda's are pretty much always the way to go. When you use the original dataframe name, pandas looks at the dataframe as it was before the statement was executed. To avoid confusion, go with:
df = df.assign(rank_int = lambda x: pd.to_numeric(x['Rank'], errors='coerce').fillna(0).astype(int),
gprank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
The x you define is the dataframe at that state in the chain.
This helps especially when your chains get longer. E.g, if you filter out some rows or aggregate you get different results (or maybe error) depending what you're trying to do.
For example, if you were just looking at the relative rank of 3 teams:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 3.0
2 c 2 2 2016 final 2.0
3 d 3 3 2016 final 1.0
Whereas:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives a different ranking:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 4.0
2 c 2 2 2016 final 3.0
3 d 3 3 2016 final 2.0
If you want to go deeper, I'd recommend https://tomaugspurger.github.io/method-chaining.html to go on your reading list.

Related

Python Program to Sum of each Row and each Column of a Matrix

I want to create a Python script to calculate the x value and y value (like matrix).
content of "budget.txt" file:
Budget Jan Feb Mar Apr May Jun Sum
Milk 10 20 31 52 7 11
Eggs 1 5 1 16 4 58
Bread 22 36 17 8 21 16
Butter 4 5 8 11 36 2
Total
The script will calculate the budget.txt file and show results in column "Sum" and Row "Total".
Here is my code:
import sys
budget_file = sys.arvg[1]
df = open(budget_file).read()
print(df)
Output: I can read the file. Now my question is how to sum the values of row and column-wise?
Given your dataframe, I would approach the calculations in two steps:
Compute the Sum of each row of the data frame (see How To Sum DataFrame Rows
Compute the Sum of each Column
Given the following dataframe:
Budget Jan Feb Mar Apr May Jun
0 Milk 10 20 31 52 7 11
1 Eggs 1 5 1 16 4 58
2 Bread 22 36 17 8 21 16
3 Butter 4 5 8 11 36 2
To Compute the Sum across the rows of a dataframe
#list columns to sum across
df['Sum'] = df[list(df.columns)[1:]].sum(axis=1)
This produces the following df:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10.0 20.0 31.0 52.0 7.0 11.0 131.0
1 Eggs 1.0 5.0 1.0 16.0 4.0 58.0 85.0
2 Bread 22.0 36.0 17.0 8.0 21.0 16.0 120.0
3 Butter 4.0 5.0 8.0 11.0 36.0 2.0 66.0
To add a row with the sum of all columns:
df.append(pd.Series(df.sum(),name='Total'))
This will yield a dataframe like:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10 20 31 52 7 11 262.0
1 Eggs 1 5 1 16 4 58 170.0
2 Bread 22 36 17 8 21 16 240.0
3 Butter 4 5 8 11 36 2 132.0
Total MilkEggsBreadButter 37 66 57 87 68 87 804.0
If the string summation under budget is bothersome you can reset that specific cell to an empty string.

Python: Merge on 2 columns

I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?

How to append keys which are not previously in dataframe 1 but are in dataframe 2 against each name

I have a Dataframe df1 like this
id name day marks mean_marks
0 1 John Wed 28 28
1 1 John Fri 30 30
2 2 Alex Fri 40 50
3 2 Alex Fri 60 50
and another dataframe df2 as:
day we
0 Mon 29
1 Wed 21
2 Fri 31
now when i do :
z = pd.merge(df1, df2, how='outer', on=['day']).fillna(0)
i got:
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Fri 30.0 30.0 31
2 2.0 Alex Fri 40.0 50.0 31
3 2.0 Alex Fri 60.0 50.0 31
4 0.0 0 Mon 0.0 0.0 29
but i wanted something which would look like :
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Mon 0.0 0.0 29
2 1.0 John Fri 30.0 30.0 31
3 2.0 Alex Mon 0.0 0.0 29
4 2.0 Alex Wed 0.0 0.0 21
5 2.0 Alex Fri 40.0 50.0 31
6 2.0 Alex Fri 60.0 50.0 31
that is 'day' which are not previously in df1 but are in df2 should be appended to day against each name.
Can someone please help me with this.
You might need a cross join to create all combinations of days per id and name , then merge should work:
u = df1[['id','name']].drop_duplicates().assign(k=1).merge(df2.assign(k=1),on='k')
out = df1.merge(u.drop('k',1),on=['day','name','id'],how='outer').fillna(0)
print(out.sort_values(['id','name']))
id name day marks mean_marks we
0 1 John Wed 28.0 28.0 21
1 1 John Fri 30.0 30.0 31
4 1 John Mon 0.0 0.0 29
2 2 Alex Fri 40.0 50.0 31
3 2 Alex Fri 60.0 5.0 31
5 2 Alex Mon 0.0 0.0 29
6 2 Alex Wed 0.0 0.0 21
The following code should do it:
z = df1.groupby(['name']).apply(lambda grp: grp.merge(df2, how='outer', on='day').
fillna({'name': grp.name, 'id': grp.id})).reset_index(drop=True).fillna(0)
It gives the following output:
id name day marks mean_marks we
0 2.0 Alex Fri 40 50 31
1 2.0 Alex Fri 60 50 31
2 2.0 Alex Mon 0 0 29
3 2.0 Alex Wed 0 0 21
4 1.0 John Wed 28 28 21
5 1.0 John Fri 30 30 31
6 1.0 John Mon 0 0 29
you can create df3 with all names and day combination:
df3 = pd.DataFrame([[name, day] for name in df1.name.unique() for day in df2.day.unique()], columns=['name', 'day'])
Then add id's from df1:
df3 = df3.merge(df1[['id', 'name']]).drop_duplicates()[['id', 'name', 'day']]
Then add marks and mean marks from df1:
df3 = df3.merge(df1, how='left')
Then merge:
z = df3.merge(df2, how='outer', on=['day']).fillna(0).sort_values('id')
Out:
id name day marks mean_marks we
0 1 John Mon 0.0 0.0 29
2 1 John Wed 28.0 28.0 21
4 1 John Fri 30.0 30.0 31
1 2 Alex Mon 0.0 0.0 29
3 2 Alex Wed 0.0 0.0 21
5 2 Alex Fri 40.0 50.0 31
6 2 Alex Fri 60.0 50.0 31
To have the result ordered by weekdays (within each group by id), we should
convert day column in both DataFrames to Categorical type.
I think, it is better than in your original concept, where you don't care
about the days order.
To do it, run:
wDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
days = pd.Categorical(wDays, wDays, ordered=True)
df1.day = df1.day.astype(days)
df2.day = df2.day.astype(days)
Then define the following function, performing merge within a group
by id and filling NaN values (using either ffill or fillna):
def myMerge(grp):
res = pd.merge(grp, df2, how='right', on=['day'])
res[['id', 'name']] = res[['id', 'name']].ffill()
res[['marks', 'mean_marks']] = res[['marks', 'mean_marks']].fillna(0)
return res.sort_values('day')
Then group df1 by id and apply the above function to each group:
df1.groupby('id', sort=False).apply(myMerge).reset_index(drop=True)
The final step above is reset_index, to re-create "ordinary" index.
I added also sort=False to keep your desired (original) order of groups.

Nested for loop python pandas not functioning as desired

Code to generate random database for question (minimum reproducible issue):
df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015',
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random
Output df_random
0 1 2 order_date customer_id
0 0.018473 0.970257 0.605428 2015-01-01 12
... ... ... ... ... ...
1999 0.800139 0.746605 0.551530 2020-06-22 11
Code to extract mean unique transactions month and year wise
for y in (2015,2019):
for x in (1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df2.sort_values(['customer_id','order_date'],inplace=True)
df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
df_mean=round(df2['days'].mean(),2)
data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Expected output
Mean Month Year
0 5.00 1 2015
.......................
11 6.62 12 2015
..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec
36 6.03 1 2018
..........................
47 6.76 12 2018
48 8.40 1 2019
.......................
48 8.40 12 2019
Basically I want single dataframe starting from 2015 Jan month to 2019 December
Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.
Please help
Try this:
data2 = pd.DataFrame([])
for y in range(2015,2020):
for x in range(1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Try this :
df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output
Output
0 1 2 month year
0 0.494818 0.476514 0.496059 1 2015
1 0.451611 0.437638 0.536607 2 2015
2 0.476262 0.567519 0.528129 3 2015
3 0.519229 0.475887 0.612433 4 2015
4 0.464781 0.430593 0.445455 5 2015
... ... ... ... ... ...
61 0.416540 0.564928 0.444234 2 2020
62 0.553787 0.423576 0.422580 3 2020
63 0.524872 0.470346 0.560194 4 2020
64 0.530440 0.469957 0.566077 5 2020
65 0.584474 0.487195 0.557567 6 2020
Avoid any looping and simply include year and month in groupby calculation:
np.random.seed(1022020)
...
# ASSIGN MONTH AND YEAR COLUMNS, THEN SORT COLUMNS
df_random = (df_random.assign(month = lambda x: x['order_date'].dt.month,
year = lambda x: x['order_date'].dt.year)
.sort_values(['customer_id', 'order_date']))
# GROUP BY CALCULATION
df_random["days"] = (df_random.groupby(["customer_id", "year", "month"])["order_date"]
.apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")))
# FINAL MEAN AGGREGATION BY YEAR AND MONTH
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"].mean().round(2)
.rename(columns={"days":"mean"}))
print(final_df.head())
# year month mean
# 0 2015 1 8.43
# 1 2015 2 5.87
# 2 2015 3 4.88
# 3 2015 4 10.43
# 4 2015 5 8.12
print(final_df.tail())
# year month mean
# 61 2020 2 8.27
# 62 2020 3 8.41
# 63 2020 4 8.81
# 64 2020 5 9.12
# 65 2020 6 7.00
For multiple aggregates, replace the single groupby.mean() to groupby.agg():
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"]
.agg(['count', 'min', 'mean', 'median', 'max'])
.rename(columns={"days":"mean"}))
print(final_df.head())
# count min mean median max
# year month
# 2015 1 14 1.0 8.43 5.0 25.0
# 2 15 1.0 5.87 5.0 17.0
# 3 16 1.0 4.88 5.0 9.0
# 4 14 1.0 10.43 7.5 23.0
# 5 17 2.0 8.12 8.0 17.0
print(final_df.tail())
# count min mean median max
# year month
# 2020 2 15 1.0 8.27 6.0 21.0
# 3 17 1.0 8.41 7.0 16.0
# 4 16 1.0 8.81 7.0 20.0
# 5 16 1.0 9.12 7.0 22.0
# 6 7 2.0 7.00 7.0 17.0

Find a differencing between year for sales

I have a data that analysing sales. I made some progress and this is the last part I did that show each store sales total for each year (2016-2017-2018).
Store_Key Year count Total_Sales
0 5.0 2016 28 6150.0
1 5.0 2017 39 8350.0
2 5.0 2018 27 5150.0
3 7.0 2016 3664 105370.0
4 7.0 2017 3736 116334.0
5 7.0 2018 3863 99375.0
6 10.0 2016 3930 79904.0
7 10.0 2017 3981 91227.0
8 10.0 2018 4432 97226.0
9 11.0 2016 4084 91156.0
10 11.0 2017 4220 99565.0
11 11.0 2018 4735 113584.0
12 16.0 2016 4257 135655.0
13 16.0 2017 4422 144725.0
14 16.0 2018 4630 133820.0
I want to see each store's sales difference between years. So I used pivot table and show each year with a difference column.
Store_Key 2016 2017 2018
5.0 6150.0 8350.0 5150.0
7.0 105370.0 116334.0 99375.0
10.0 79904.0 91227.0 97226.0
11.0 91156.0 99565.0 113584.0
16.0 135655.0 144725.0 133820.0
18.0 237809.0 245645.0 88167.0
20.0 110225.0 131999.0 83302.0
24.0 94087.0 101062.0 108888.0
If stores were constant, I would quickly find the difference when using the difference between columns, but unfortunately each year so many new stores are founding and shutting down.
So my question is: is there any way to get difference in stores with showing new stores and closing stores?
I can find stores with NULL values and separate it but I would love to check if there are some better options.
To get the difference between 2017 and 2016, you can do :
df['evolution'] = df['2017'] - df['2016']
If you would like to drop lines where there is at least one NaN value, you can remove those lines like this :
df.dropna(axis=0, how='any', inplace=False)
If you have 0 instead of NaN, you can do:
import numpy as np
df.replace(0, np.nan)

Categories

Resources