Find a differencing between year for sales - python

I have a data that analysing sales. I made some progress and this is the last part I did that show each store sales total for each year (2016-2017-2018).
Store_Key Year count Total_Sales
0 5.0 2016 28 6150.0
1 5.0 2017 39 8350.0
2 5.0 2018 27 5150.0
3 7.0 2016 3664 105370.0
4 7.0 2017 3736 116334.0
5 7.0 2018 3863 99375.0
6 10.0 2016 3930 79904.0
7 10.0 2017 3981 91227.0
8 10.0 2018 4432 97226.0
9 11.0 2016 4084 91156.0
10 11.0 2017 4220 99565.0
11 11.0 2018 4735 113584.0
12 16.0 2016 4257 135655.0
13 16.0 2017 4422 144725.0
14 16.0 2018 4630 133820.0
I want to see each store's sales difference between years. So I used pivot table and show each year with a difference column.
Store_Key 2016 2017 2018
5.0 6150.0 8350.0 5150.0
7.0 105370.0 116334.0 99375.0
10.0 79904.0 91227.0 97226.0
11.0 91156.0 99565.0 113584.0
16.0 135655.0 144725.0 133820.0
18.0 237809.0 245645.0 88167.0
20.0 110225.0 131999.0 83302.0
24.0 94087.0 101062.0 108888.0
If stores were constant, I would quickly find the difference when using the difference between columns, but unfortunately each year so many new stores are founding and shutting down.
So my question is: is there any way to get difference in stores with showing new stores and closing stores?
I can find stores with NULL values and separate it but I would love to check if there are some better options.

To get the difference between 2017 and 2016, you can do :
df['evolution'] = df['2017'] - df['2016']
If you would like to drop lines where there is at least one NaN value, you can remove those lines like this :
df.dropna(axis=0, how='any', inplace=False)
If you have 0 instead of NaN, you can do:
import numpy as np
df.replace(0, np.nan)

Related

Pandas Method Chaining: getting KeyError on calculated column

I’m scraping web data to get US college football poll top 25 information that I store in a Pandas dataframe. The data has multiple years of poll information, with preseason and final polls for each year. Each poll ranks teams from 1 to 25. Team ranks are determined by the voting points each team received; the team with most points is ranked 1, etc. Both rank and points are included in the dataset. Here's the head of the raw data df:
cols = ['Year','Type', 'Team (FPV)', 'Rank', 'Pts']
all_wks_raw[cols].head()
The dataframe has columns for Rank and Pts (Points). The Rank column (dytpe object) contains numeric ranks of 1-25 plus “RV” for teams that received points but did not rank in the top 25. The Pts column is dtype int64. Since Pts for teams that did not make the top 25 are included in the data, I’m able to re-rank the teams based on Pts and thus extend rankings beyond the top 25. The resulting revrank column ranks teams from 1 to between 37 and 61, depending how many teams received points in that poll. Revrank is the first new column I create.
The revrank column should equal the Rank column for the first 25 teams, but before I can test it I need to create a new column that converts Rank to numeric. The result is rank_int, which is my second created column. Then I try to create a third column that calculates the difference between the two created columns, and this is where I get the KeyError. Here's the chain:
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)
Are the results of the first two assignments not being passed to the third? Am I missing something in the syntax? Thanks for the help.
Edited 7/20/2022 to add complete code; note that this code scrapes data from the College Poll Archive web site:
dict = {1119: [2016, '2016 Final AP Football Poll', 'Final'], 1120: [2017, '2017 Preseason AP Football Poll', 'Preseason'],
1135: [2017, '2017 Final AP Football Poll', 'Final'], 1136: [2018, '2018 Preseason AP Football Poll', 'Preseason'],
1151: [2018, '2018 Final AP Football Poll', 'Final'], 1152: [2019, '2019 Preseason AP Football Poll', 'Preseason']}
#get one week of poll data from College Poll Archive ID parameter
def getdata(id):
coldefs = {'ID':key, 'Year': value[0], 'Title': value[1], 'Type':value[2]} #define dictionary of scalar columns to add to dataframe
urlseg = 'https://www.collegepollarchive.com/football/ap/seasons.cfm?appollid='
url = urlseg + str(id)
dfs = pd.read_html(url)
df = dfs[0].assign(**coldefs)
return df
all_wks_raw = pd.DataFrame()
for key, value in dict.items():
print(key, value[0], value[2])
onewk = getdata(key)
all_wks_raw = all_wks_raw.append(onewk)
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)
If accessing a column that doesn't yet exist, that must be done through a lambda:
dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)
Output:
Team (FPV) Rank Pts Year Type rank_int gprank ck_rank
0 LSU (62) 1 1550 2016 final 1 1.0 0.0
1 Clemson 2 1487 2016 final 2 2.0 0.0
2 Ohio State 3 1426 2016 final 3 3.0 0.0
3 Georgia 4 1336 2016 final 4 4.0 0.0
4 Oregon 5 1249 2016 final 5 5.0 0.0
5 Florida 6 1211 2016 final 6 6.0 0.0
6 Oklahoma 7 1179 2016 final 7 7.0 0.0
7 Alabama 8 1159 2016 final 8 8.0 0.0
8 Penn State 9 1038 2016 final 9 9.0 0.0
9 Minnesota 10 952 2016 final 10 10.0 0.0
10 Wisconsin 11 883 2016 final 11 11.0 0.0
11 Notre Dame 12 879 2016 final 12 12.0 0.0
12 Baylor 13 827 2016 final 13 13.0 0.0
13 Auburn 14 726 2016 final 14 14.0 0.0
14 Iowa 15 699 2016 final 15 15.0 0.0
15 Utah 16 543 2016 final 16 16.0 0.0
16 Memphis 17 528 2016 final 17 17.0 0.0
17 Michigan 18 468 2016 final 18 18.0 0.0
18 Appalachian State 19 466 2016 final 19 19.0 0.0
19 Navy 20 415 2016 final 20 20.0 0.0
20 Cincinnati 21 343 2016 final 21 21.0 0.0
21 Air Force 22 209 2016 final 22 22.0 0.0
22 Boise State 23 188 2016 final 23 23.0 0.0
23 UCF 24 78 2016 final 24 24.0 0.0
24 Texas 25 69 2016 final 25 25.0 0.0
25 Texas A&M RV 54 2016 final 0 26.0 26.0
26 Florida Atlantic RV 46 2016 final 0 27.0 27.0
27 Washington RV 39 2016 final 0 28.0 28.0
28 Virginia RV 28 2016 final 0 29.0 29.0
29 USC RV 16 2016 final 0 30.0 30.0
30 San Diego State RV 13 2016 final 0 31.0 31.0
31 Arizona State RV 12 2016 final 0 32.0 32.0
32 SMU RV 10 2016 final 0 33.0 33.0
33 Tennessee RV 8 2016 final 0 34.0 34.0
34 California RV 6 2016 final 0 35.0 35.0
35 Kansas State RV 2 2016 final 0 36.0 36.0
36 Kentucky RV 2 2016 final 0 36.0 36.0
37 Louisiana RV 2 2016 final 0 36.0 36.0
38 Louisiana Tech RV 2 2016 final 0 36.0 36.0
39 North Dakota State RV 2 2016 final 0 36.0 36.0
40 Hawaii NR 0 2016 final 0 41.0 41.0
41 Louisville NR 0 2016 final 0 41.0 41.0
42 Oklahoma State NR 0 2016 final 0 41.0 41.0
Adding to BeRT2me's answer, when chaining, lambda's are pretty much always the way to go. When you use the original dataframe name, pandas looks at the dataframe as it was before the statement was executed. To avoid confusion, go with:
df = df.assign(rank_int = lambda x: pd.to_numeric(x['Rank'], errors='coerce').fillna(0).astype(int),
gprank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
The x you define is the dataframe at that state in the chain.
This helps especially when your chains get longer. E.g, if you filter out some rows or aggregate you get different results (or maybe error) depending what you're trying to do.
For example, if you were just looking at the relative rank of 3 teams:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 3.0
2 c 2 2 2016 final 2.0
3 d 3 3 2016 final 1.0
Whereas:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives a different ranking:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 4.0
2 c 2 2 2016 final 3.0
3 d 3 3 2016 final 2.0
If you want to go deeper, I'd recommend https://tomaugspurger.github.io/method-chaining.html to go on your reading list.

Python: Merge on 2 columns

I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?

How to append keys which are not previously in dataframe 1 but are in dataframe 2 against each name

I have a Dataframe df1 like this
id name day marks mean_marks
0 1 John Wed 28 28
1 1 John Fri 30 30
2 2 Alex Fri 40 50
3 2 Alex Fri 60 50
and another dataframe df2 as:
day we
0 Mon 29
1 Wed 21
2 Fri 31
now when i do :
z = pd.merge(df1, df2, how='outer', on=['day']).fillna(0)
i got:
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Fri 30.0 30.0 31
2 2.0 Alex Fri 40.0 50.0 31
3 2.0 Alex Fri 60.0 50.0 31
4 0.0 0 Mon 0.0 0.0 29
but i wanted something which would look like :
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Mon 0.0 0.0 29
2 1.0 John Fri 30.0 30.0 31
3 2.0 Alex Mon 0.0 0.0 29
4 2.0 Alex Wed 0.0 0.0 21
5 2.0 Alex Fri 40.0 50.0 31
6 2.0 Alex Fri 60.0 50.0 31
that is 'day' which are not previously in df1 but are in df2 should be appended to day against each name.
Can someone please help me with this.
You might need a cross join to create all combinations of days per id and name , then merge should work:
u = df1[['id','name']].drop_duplicates().assign(k=1).merge(df2.assign(k=1),on='k')
out = df1.merge(u.drop('k',1),on=['day','name','id'],how='outer').fillna(0)
print(out.sort_values(['id','name']))
id name day marks mean_marks we
0 1 John Wed 28.0 28.0 21
1 1 John Fri 30.0 30.0 31
4 1 John Mon 0.0 0.0 29
2 2 Alex Fri 40.0 50.0 31
3 2 Alex Fri 60.0 5.0 31
5 2 Alex Mon 0.0 0.0 29
6 2 Alex Wed 0.0 0.0 21
The following code should do it:
z = df1.groupby(['name']).apply(lambda grp: grp.merge(df2, how='outer', on='day').
fillna({'name': grp.name, 'id': grp.id})).reset_index(drop=True).fillna(0)
It gives the following output:
id name day marks mean_marks we
0 2.0 Alex Fri 40 50 31
1 2.0 Alex Fri 60 50 31
2 2.0 Alex Mon 0 0 29
3 2.0 Alex Wed 0 0 21
4 1.0 John Wed 28 28 21
5 1.0 John Fri 30 30 31
6 1.0 John Mon 0 0 29
you can create df3 with all names and day combination:
df3 = pd.DataFrame([[name, day] for name in df1.name.unique() for day in df2.day.unique()], columns=['name', 'day'])
Then add id's from df1:
df3 = df3.merge(df1[['id', 'name']]).drop_duplicates()[['id', 'name', 'day']]
Then add marks and mean marks from df1:
df3 = df3.merge(df1, how='left')
Then merge:
z = df3.merge(df2, how='outer', on=['day']).fillna(0).sort_values('id')
Out:
id name day marks mean_marks we
0 1 John Mon 0.0 0.0 29
2 1 John Wed 28.0 28.0 21
4 1 John Fri 30.0 30.0 31
1 2 Alex Mon 0.0 0.0 29
3 2 Alex Wed 0.0 0.0 21
5 2 Alex Fri 40.0 50.0 31
6 2 Alex Fri 60.0 50.0 31
To have the result ordered by weekdays (within each group by id), we should
convert day column in both DataFrames to Categorical type.
I think, it is better than in your original concept, where you don't care
about the days order.
To do it, run:
wDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
days = pd.Categorical(wDays, wDays, ordered=True)
df1.day = df1.day.astype(days)
df2.day = df2.day.astype(days)
Then define the following function, performing merge within a group
by id and filling NaN values (using either ffill or fillna):
def myMerge(grp):
res = pd.merge(grp, df2, how='right', on=['day'])
res[['id', 'name']] = res[['id', 'name']].ffill()
res[['marks', 'mean_marks']] = res[['marks', 'mean_marks']].fillna(0)
return res.sort_values('day')
Then group df1 by id and apply the above function to each group:
df1.groupby('id', sort=False).apply(myMerge).reset_index(drop=True)
The final step above is reset_index, to re-create "ordinary" index.
I added also sort=False to keep your desired (original) order of groups.

Looking to merge/concatenate/groupby different rows in Pandas dataframe

I will be iterating through a large list of dataframes of baseball statistics of different players. This data is indexed by year. What I am looking to do is group year while keeping salary the same and adding WAR. Also, I am looking to drop rows that are not single years. In my data set these entries are strings.
to group
for x in clean_stats_list:
x.groupby("Year")
to eliminate rows
for x in clean_stats_list:
for i in x['Year']:
if len(i) > 4:
x['Year'][i].drop()
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 3.8 2015 $10,000,000
9 0.2 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333
14 34.3 11 Seasons $96,952,999
16 25.4 CIN (8 yrs) $37,453,000
17 8.8 SFG (3 yrs) $59,499,999
This is what I am expecting to achieve:
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 4.0 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333
To filter out based on length of column Year, why don't you try creating a mask and then select based on it.
Code:
mask_df = your_df['Year'].str.len() == 4
your_df_cleaned = your_df.loc[mask_df]
You can use regex for validate years for avoid filter values with length 4 and not years with Series.str.contains and boolean indexing:
#https://stackoverflow.com/a/4374209
#validate between 1000-2999
df1 = df[df['Year'].str.contains('^[12][0-9]{3}$')]
#validate between 0000-9999
#df1 = df[df['Year'].str.contains('^\d{4}$')]
print (df1)
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 3.8 2015 $10,000,000
9 0.2 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.
First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1
I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

Categories

Resources