Moving average by column / year - python, pandas - python

I need to built a moving average over column "total_medals" by country [noc] for all previous years - my daata looks like:
medal Bronze Gold Medal Silver **total_medals**
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
I want to have a moving average per country and year (i.e. for ALG: 1984 Avg (total_medals)=2.0; 1992 Avg(total_medals) = (2.0+6.0)/2 = 4.0; 1996 Acg(total_medals) = (2.0+6.0+7.0)/3 = 5.0) - moving average should appear in new column (next to total_medals).
Additionally, for each country & year combination new column called "performance" should be the fraction of "total_medals" divided by "moving average"

Sample dataframe:
print(df)
medal Bronze Gold Medal Silver
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 NaN 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 NaN 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
Use DataFrame.groupby + expanding:
df['total_mean']=df.groupby(level=0,sort=False).Silver.apply(lambda x: x.expanding(1).mean())
print(df)
medal Bronze Gold Medal Silver total_medals
noc year
ALG 1984 2.0 NaN NaN NaN 2.0 2.000000
1992 4.0 2.0 NaN NaN 6.0 4.000000
1996 2.0 1.0 NaN 4.0 7.0 5.000000
ANZ 1984 2.0 15.0 NaN 2.0 19.0 19.000000
1992 3.0 5.0 NaN 2.0 10.0 14.500000
1996 1.0 2.0 NaN 2.0 5.0 11.333333
ARG 1984 2.0 6.0 NaN 3.0 11.0 11.000000
1992 5.0 3.0 NaN 24.0 32.0 21.500000
1992 3.0 7.0 NaN 5.0 15.0 19.333333
bonze lagged
s=df.groupby('noc').apply(lambda x: x['Bronze']/x['total_medals'].shift())
s.index=s.index.droplevel()
df['bronze_lagged']=s
You could create a function for this...
def lagged_medals(type_of_medal):
s=df.groupby('noc').apply(lambda x: x[type_of_medal]/x['total_medals'].shift())
s.index=s.index.droplevel()
df[f'{type_of_medal}_lagged']=s
lagged_medals('Silver')
#print(df)

Related

How to use grouped rows in pandas

Hello I have table with MultiIndex:
Lang C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Google Google 2.0 24.0 1.0 27
ASW Cristiano NaN NaN 5.0 5
Facebook Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Piter 8.0 NaN NaN 8
Google Cristiano NaN NaN 1.0 1
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
I am trying use this code
out = df.groupby(level=0).apply(lambda g: g.sort_values('All', ascending=False)
But It adds one more level index, how Can I use code without adding index?
I don't want to add and then delete indexes
thank You in Advance!
Add group_keys=False parameter in DataFrame.groupby:
out = (df.groupby(level=0, group_keys=False)
.apply(lambda g: g.sort_values('All', ascending=False)))
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1
Better/faster/simplier solution is sorting by level of MultiIndex and column:
out = df.sort_values(['Corp','All'], ascending=[True, False])
print (out)
C++ java python All
Corp Name
ASW ASW 0.0 0.0 5.0 5
Cristiano NaN NaN 5.0 5
Facebook Facebook 8.0 1.0 5.0 14
Piter 8.0 NaN NaN 8
Cristiano NaN NaN 3.0 3
Michael NaN 1.0 2.0 3
Google Google 2.0 24.0 1.0 27
Michael NaN 24.0 NaN 24
Piter 2.0 NaN NaN 2
Cristiano NaN NaN 1.0 1

How to find the cumprod for dataframe by reserving the row values?

I have dataframe df:
0 1 2 3 4 5 6
Row Labels
2017 A1 2.0 2.0 NaN 2.0 NaN 2.0 NaN
2017 A2 2.0 2.0 2.0 NaN 2.0 2.0 NaN
2017 A3 2.0 2.0 2.0 2.0 2.0 2.0 NaN
2017 A4 2.0 2.0 2.0 2.0 2.0 2.0 NaN
2018 A1 2.0 2.0 2.0 2.0 NaN NaN NaN
2019 A2 2.0 2.0 2.0 NaN NaN NaN NaN
2020 A3 2.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
I have to find the cumprod of the dataframe by reversing row values:
I tried this code ;
df1 = df[::-1].cumprod(axis=1)[::-1]
i got output like this ,
0 1 2 3 4 5 6
Row Labels
2017 A1 2.0 4.0 NaN 8.0 NaN 16.0 NaN
2017 A2 2.0 4.0 8.0 NaN 16.0 32.0 NaN
2017 A3 2.0 4.0 8.0 16.0 32.0 64.0 NaN
2017 A4 2.0 4.0 8.0 16.0 32.0 64.0 NaN
2018 A1 2.0 4.0 8.0 16.0 NaN NaN NaN
2019 A2 2.0 4.0 8.0 NaN NaN NaN NaN
2020 A3 2.0 4.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
But expected output is ;
0 1 2 3 4 5 6
Row Labels
2017 A1 16.0 8.0 NaN 4.0 NaN 2.0 NaN
2017 A2 32.0 16.0 8.0 NaN 4.0 2.0 NaN
2017 A3 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2017 A4 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2018 A1 16.0 8.0 4.0 2.0 NaN NaN NaN
2019 A2 8.0 4.0 2.0 NaN NaN NaN NaN
2020 A3 4.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN
Thank You For Your Time :)
Use DataFrame.iloc with first : for select all rows and ::-1 for swapping by columns:
df1 = df.iloc[:, ::-1].cumprod(axis=1).iloc[:, ::-1]
print (df1)
0 1 2 3 4 5 6
Row Labels
2017 A1 16.0 8.0 NaN 4.0 NaN 2.0 NaN
2017 A2 32.0 16.0 8.0 NaN 4.0 2.0 NaN
2017 A3 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2017 A4 64.0 32.0 16.0 8.0 4.0 2.0 NaN
2018 A1 16.0 8.0 4.0 2.0 NaN NaN NaN
2019 A2 8.0 4.0 2.0 NaN NaN NaN NaN
2020 A3 4.0 2.0 NaN NaN NaN NaN NaN
2021 A4 2.0 NaN NaN NaN NaN NaN NaN

How to filter pandas table based on multiple values from differen columns? [duplicate]

This question already has answers here:
Select rows in pandas MultiIndex DataFrame
(5 answers)
Closed 3 years ago.
I have a pandas table in the following format [df], indexed by 'noc' and 'year'. How can I access a 'noc, year combination' and save the entry of 'total_medals' to a list?
medal Bronze Gold Medal Silver total_medals
noc year
ALG 1984 2.0 NaN NaN NaN 2.0 2.000000
1992 4.0 2.0 NaN NaN 6.0 4.000000
1996 2.0 1.0 NaN 4.0 7.0 5.000000
ANZ 1984 2.0 15.0 NaN 2.0 19.0 19.000000
1992 3.0 5.0 NaN 2.0 10.0 14.500000
1996 1.0 2.0 NaN 2.0 5.0 11.333333
ARG 1984 2.0 6.0 NaN 3.0 11.0 11.000000
1992 5.0 3.0 NaN 24.0 32.0 21.500000
1996 3.0 7.0 NaN 5.0 15.0 19.333333
For example: I want to acccess the 'total_medals' of ARG in 1992 (which is 21.5) ans save this to a new list.
There is MultiIndex in index values, so you can select values by tuples in DataFrame.loc:
a = df.loc[('ARG',1992), 'total_medals']
print (a)
21.5

Why is pandas df.diff(2) different than df.diff().diff()?

According to Ender's Applied Econometric Time Series, the second difference of a variable y is defined as:
Pandas provides the diff function that receives "periods" as an argument. Nevertheless, df.diff(2) gives a different result than df.diff().diff().
Code excerpt showing the above:
In [8]: df
Out[8]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 16.0 6.0 256.0 216.0 65536 4352
1991 17.0 7.0 289.0 343.0 131072 5202
1992 6.0 -4.0 36.0 -64.0 64 252
1993 7.0 -3.0 49.0 -27.0 128 392
1994 8.0 -2.0 64.0 -8.0 256 576
1995 13.0 3.0 169.0 27.0 8192 2366
1996 10.0 0.5 100.0 0.5 1024 1100
1997 11.0 1.0 121.0 1.0 2048 1452
1998 4.0 -6.0 16.0 -216.0 16 80
1999 5.0 -5.0 25.0 -125.0 32 150
2000 18.0 8.0 324.0 512.0 262144 6156
2001 3.0 -7.0 9.0 -343.0 8 36
2002 0.5 -10.0 0.5 -1000.0 48 20
2003 1.0 -9.0 1.0 -729.0 2 2
2004 14.0 4.0 196.0 64.0 16384 2940
2005 15.0 5.0 225.0 125.0 32768 3600
2006 12.0 2.0 144.0 8.0 4096 1872
2007 9.0 -1.0 81.0 -1.0 512 810
2008 2.0 -8.0 4.0 -512.0 4 12
2009 19.0 9.0 361.0 729.0 524288 7220
In [9]: df.diff(2)
Out[9]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -10.0 -10.0 -220.0 -280.0 -65472.0 -4100.0
1993 -10.0 -10.0 -240.0 -370.0 -130944.0 -4810.0
1994 2.0 2.0 28.0 56.0 192.0 324.0
1995 6.0 6.0 120.0 54.0 8064.0 1974.0
1996 2.0 2.5 36.0 8.5 768.0 524.0
1997 -2.0 -2.0 -48.0 -26.0 -6144.0 -914.0
1998 -6.0 -6.5 -84.0 -216.5 -1008.0 -1020.0
1999 -6.0 -6.0 -96.0 -126.0 -2016.0 -1302.0
2000 14.0 14.0 308.0 728.0 262128.0 6076.0
2001 -2.0 -2.0 -16.0 -218.0 -24.0 -114.0
2002 -17.5 -18.0 -323.5 -1512.0 -262096.0 -6136.0
2003 -2.0 -2.0 -8.0 -386.0 -6.0 -34.0
2004 13.5 14.0 195.5 1064.0 16336.0 2920.0
2005 14.0 14.0 224.0 854.0 32766.0 3598.0
2006 -2.0 -2.0 -52.0 -56.0 -12288.0 -1068.0
2007 -6.0 -6.0 -144.0 -126.0 -32256.0 -2790.0
2008 -10.0 -10.0 -140.0 -520.0 -4092.0 -1860.0
2009 10.0 10.0 280.0 730.0 523776.0 6410.0
In [10]: df.diff().diff()
Out[10]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -12.0 -12.0 -286.0 -534.0 -196544.0 -5800.0
1993 12.0 12.0 266.0 444.0 131072.0 5090.0
1994 0.0 0.0 2.0 -18.0 64.0 44.0
1995 4.0 4.0 90.0 16.0 7808.0 1606.0
1996 -8.0 -7.5 -174.0 -61.5 -15104.0 -3056.0
1997 4.0 3.0 90.0 27.0 8192.0 1618.0
1998 -8.0 -7.5 -126.0 -217.5 -3056.0 -1724.0
1999 8.0 8.0 114.0 308.0 2048.0 1442.0
2000 12.0 12.0 290.0 546.0 262096.0 5936.0
2001 -28.0 -28.0 -614.0 -1492.0 -524248.0 -12126.0
2002 12.5 12.0 306.5 198.0 262176.0 6104.0
2003 3.0 4.0 9.0 928.0 -86.0 -2.0
2004 12.5 12.0 194.5 522.0 16428.0 2956.0
2005 -12.0 -12.0 -166.0 -732.0 2.0 -2278.0
2006 -4.0 -4.0 -110.0 -178.0 -45056.0 -2388.0
2007 0.0 0.0 18.0 108.0 25088.0 666.0
2008 -4.0 -4.0 -14.0 -502.0 3076.0 264.0
2009 24.0 24.0 434.0 1752.0 524792.0 8006.0
In [11]: df.diff(2) - df.diff().diff()
Out[11]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 2.0 2.0 66.0 254.0 131072.0 1700.0
1993 -22.0 -22.0 -506.0 -814.0 -262016.0 -9900.0
1994 2.0 2.0 26.0 74.0 128.0 280.0
1995 2.0 2.0 30.0 38.0 256.0 368.0
1996 10.0 10.0 210.0 70.0 15872.0 3580.0
1997 -6.0 -5.0 -138.0 -53.0 -14336.0 -2532.0
1998 2.0 1.0 42.0 1.0 2048.0 704.0
1999 -14.0 -14.0 -210.0 -434.0 -4064.0 -2744.0
2000 2.0 2.0 18.0 182.0 32.0 140.0
2001 26.0 26.0 598.0 1274.0 524224.0 12012.0
2002 -30.0 -30.0 -630.0 -1710.0 -524272.0 -12240.0
2003 -5.0 -6.0 -17.0 -1314.0 80.0 -32.0
2004 1.0 2.0 1.0 542.0 -92.0 -36.0
2005 26.0 26.0 390.0 1586.0 32764.0 5876.0
2006 2.0 2.0 58.0 122.0 32768.0 1320.0
2007 -6.0 -6.0 -162.0 -234.0 -57344.0 -3456.0
2008 -6.0 -6.0 -126.0 -18.0 -7168.0 -2124.0
2009 -14.0 -14.0 -154.0 -1022.0 -1016.0 -1596.0
Why are they different? Which one corresponds to the one defined in Ender's book?
This is precisely because
Δ2 yt = yt - 2 yt - 1 + yt - 2 ≠ yt - yt - 2.
The left hand side is df.diff().diff(), whereas the right hand side is df.diff(2). For the difference in difference, you want the left hand side.
Consider;
df
a
b
c
d
df.diff() is
NaN
b - a
c - b
d - c
df.diff(2) is
NaN
NaN
c - a
d - b
df.diff().diff() is
NaN
NaN
(c - b) - (b - a) = c - 2b + a
(d - c) - (c - b) = d - 2c + b
They're not the same, mathematically.

How to access prior rows within a multiindex Panda dataframe

How to reach within a Datetime indexed multilevel Dataframe such as the following: This is downloaded Fin data.
The tough part is getting inside the frame and accessing non adjacent rows of a particular inner level, without specifying explicitly the outer level date, since I have thousands of such rows..
ABC DEF GHI \
Date STATS
2012-07-19 00:00:00 NaN NaN NaN
investment 4 9 13
price 5 8 1
quantity 12 9 8
So the 2 formulas i am searching could be summarized as
X(today row) = quantity(prior row)*price(prior row)
or
X(today row) = quantity(prior row)*price(today)
The difficulty is how to formulate the access to those rows using numpy or panda for a multilevel index, and the rows are not adjacent.
In the end i would end up with this:
ABC DEF GHI XN
Date STATS
2012-07-19 00:00:00 NaN NaN NaN
investment 4 9 13 X1
price 5 8 1
quantity 12 9 8
2012-07-18 00:00:00 NaN NaN NaN
investment 1 2 3 X2
price 2 3 4
quantity 18 6 7
X1= (18*2)+(6*3)+(7*4) (quantity_day_2 *price_day_2 data)
or for the other formula
X1= (18*5)+(6*8)+(7*1) (quantity_day_2 *price_day_1 data)
Could I use a groupby?
If need add output to original DataFrame, then it is more complicated:
print (df)
ABC DEF GHI
Date STATS
2012-07-19 NaN NaN NaN
investment 4.0 9.0 13.0
price 5.0 8.0 1.0
quantity 12.0 9.0 8.0
2012-07-18 NaN NaN NaN
investment 1.0 2.0 3.0
price 2.0 3.0 4.0
quantity 18.0 6.0 7.0
2012-07-17 NaN NaN NaN
investment 1.0 2.0 3.0
price 0.0 1.0 4.0
quantity 5.0 1.0 0.0
df.sort_index(inplace=True)
#rename value in level to investment - align data in final concat
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:].rename(index={'price':'investment'})
q = df.loc[idx[:,'quantity'],:].rename(index={'quantity':'investment'})
print (p)
ABC DEF GHI
Date STATS
2012-07-17 investment 0.0 1.0 4.0
2012-07-18 investment 2.0 3.0 4.0
2012-07-19 investment 5.0 8.0 1.0
print (q)
ABC DEF GHI
Date STATS
2012-07-17 investment 5.0 1.0 0.0
2012-07-18 investment 18.0 6.0 7.0
2012-07-19 investment 12.0 9.0 8.0
#multiple and concat to original df
print (p * q)
ABC DEF GHI
Date STATS
2012-07-17 investment 0.0 1.0 0.0
2012-07-18 investment 36.0 18.0 28.0
2012-07-19 investment 60.0 72.0 8.0
a = (p * q).sum(axis=1).rename('col1')
print (pd.concat([df, a], axis=1))
ABC DEF GHI col1
Date STATS
2012-07-17 NaN NaN NaN NaN
investment 1.0 2.0 3.0 1.0
price 0.0 1.0 4.0 NaN
quantity 5.0 1.0 0.0 NaN
2012-07-18 NaN NaN NaN NaN
investment 1.0 2.0 3.0 82.0
price 2.0 3.0 4.0 NaN
quantity 18.0 6.0 7.0 NaN
2012-07-19 NaN NaN NaN NaN
investment 4.0 9.0 13.0 140.0
price 5.0 8.0 1.0 NaN
quantity 12.0 9.0 8.0 NaN
#shift with Multiindex - not supported yet - first create Datatimeindex with unstack
#, then shift and last reshape to original by stack
#multiple and concat to original df
print (p.unstack().shift(-1, freq='D').stack() * q)
ABC DEF GHI
Date STATS
2012-07-16 investment NaN NaN NaN
2012-07-17 investment 10.0 3.0 0.0
2012-07-18 investment 90.0 48.0 7.0
2012-07-19 investment NaN NaN NaN
b = (p.unstack().shift(-1, freq='D').stack() * q).sum(axis=1).rename('col2')
print (pd.concat([df, b], axis=1))
ABC DEF GHI col2
Date STATS
2012-07-16 investment NaN NaN NaN 0.0
2012-07-17 NaN NaN NaN NaN
investment 1.0 2.0 3.0 13.0
price 0.0 1.0 4.0 NaN
quantity 5.0 1.0 0.0 NaN
2012-07-18 NaN NaN NaN NaN
investment 1.0 2.0 3.0 145.0
price 2.0 3.0 4.0 NaN
quantity 18.0 6.0 7.0 NaN
2012-07-19 NaN NaN NaN NaN
investment 4.0 9.0 13.0 0.0
price 5.0 8.0 1.0 NaN
quantity 12.0 9.0 8.0 NaN
You can use:
#add new datetime with data for better testing
print (df)
ABC DEF GHI
Date STATS
2012-07-19 NaN NaN NaN
investment 4.0 9.0 13.0
price 5.0 8.0 1.0
quantity 12.0 9.0 8.0
2012-07-18 NaN NaN NaN
investment 1.0 2.0 3.0
price 2.0 3.0 4.0
quantity 18.0 6.0 7.0
2012-07-17 NaN NaN NaN
investment 1.0 2.0 3.0
price 0.0 1.0 4.0
quantity 5.0 1.0 0.0
#lexsorted Multiindex
df.sort_index(inplace=True)
#select data and remove last level, because:
#1. need shift
#2. easier working
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:]
p.index = p.index.droplevel(-1)
q = df.loc[idx[:,'quantity'],:]
q.index = q.index.droplevel(-1)
print (p)
ABC DEF GHI
Date
2012-07-17 0.0 1.0 4.0
2012-07-18 2.0 3.0 4.0
2012-07-19 5.0 8.0 1.0
print (q)
ABC DEF GHI
Date
2012-07-17 5.0 1.0 0.0
2012-07-18 18.0 6.0 7.0
2012-07-19 12.0 9.0 8.0
print (p * q)
ABC DEF GHI
Date
2012-07-17 0.0 1.0 0.0
2012-07-18 36.0 18.0 28.0
2012-07-19 60.0 72.0 8.0
print ((p * q).sum(axis=1).to_frame().rename(columns={0:'col1'}))
col1
Date
2012-07-17 1.0
2012-07-18 82.0
2012-07-19 140.0
#shift row with -1, because lexsorted df
print (p.shift(-1, freq='D') * q)
ABC DEF GHI
Date
2012-07-16 NaN NaN NaN
2012-07-17 10.0 3.0 0.0
2012-07-18 90.0 48.0 7.0
2012-07-19 NaN NaN NaN
print ((p.shift(-1, freq='D') * q).sum(axis=1).to_frame().rename(columns={0:'col2'}))
col2
Date
2012-07-16 0.0
2012-07-17 13.0
2012-07-18 145.0
2012-07-19 0.0

Categories

Resources