Groupby with condition - python

I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01

I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.

Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29

Related

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

Rolling Mean in Pandas

I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0
I tried a GroupBy but I dind't get it
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
I think you can use groupby with rolling (need at least pandas 0.18.1):
s = df.groupby('D').rolling(3)['E'].mean()
print (s)
D
14937 0 NaN
2 NaN
4 34.916667
6 33.833333
8 29.666667
10 11.250000
19054 1 NaN
3 NaN
5 13.250000
7 23.333333
9 23.666667
11 19.000000
Name: E, dtype: float64
Then set_index by D with swaplevel for same order for matching output:
df = df.set_index('D', append=True).swaplevel(0,1)
df['E'] = s
Last reset_index and reorder columns:
df = df.reset_index(level=0).sort_values(['D','C'])
df = df[['A','B','C','D','E']]
print (df)
A B C D E
0 23 2015 1 14937 NaN
2 23 2015 2 14937 NaN
4 23 2015 3 14937 34.916667
6 23 2015 4 14937 33.833333
8 23 2015 5 14937 29.666667
10 23 2015 6 14937 11.250000
1 23 2015 1 19054 NaN
3 23 2015 2 19054 NaN
5 23 2015 3 19054 13.250000
7 23 2015 4 19054 23.333333
9 23 2015 5 19054 23.666667
11 23 2015 6 19054 19.000000

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

Mean value based on another column group

I have a dataframe (2000 rows, 5 columns):
year month day GroupBy_Day
0 2013 11 6 3
1 2013 11 7 10
2 2013 11 8 4
3 2013 11 9 4
4 2013 11 10 4
...
24 2013 12 1 5
25 2013 12 2 4
26 2013 12 3 5
27 2013 12 4 2
28 2013 12 5 7
29 2013 12 6 1
I already grouped my elements and got the count for each days (column GroupBy_Day). I need to get the mean count by day (e.g, for all days 6, we have a mean of (3+1)/2 = 2 occurence), and substract this value to GroupBy_Day in a new column.

Categories

Resources