I have a dataframe:
import pandas as pd
import numpy as np
ycap = [2015, 2016, 2017]
df = pd.DataFrame({'a': np.repeat(ycap, 5),
'b': np.random.randn(15)})
a b
0 2015 0.436967
1 2015 -0.539453
2 2015 -0.450282
3 2015 0.907723
4 2015 -2.279188
5 2016 1.468736
6 2016 -0.169522
7 2016 0.003501
8 2016 0.182321
9 2016 0.647310
10 2017 0.679443
11 2017 -0.154405
12 2017 -0.197271
13 2017 -0.153552
14 2017 0.518803
I would like to add column c, that would look like following:
a b c
0 2015 -0.826946 2014
1 2015 0.275072 2013
2 2015 0.735353 2012
3 2015 1.391345 2011
4 2015 0.389524 2010
5 2016 -0.944750 2015
6 2016 -1.192546 2014
7 2016 -0.247521 2013
8 2016 0.521094 2012
9 2016 0.273950 2011
10 2017 -1.199278 2016
11 2017 0.839705 2015
12 2017 0.075951 2014
13 2017 0.663696 2013
14 2017 0.398995 2012
I try to achieve this using following, however 1, need to increment within the group. How could I do it? Thanks
gp = df.groupby('a')
df['c'] = gp['a'].apply(lambda x: x-1)
Subtract column a by Series created by cumcount and last subtract 1:
df['c'] = df['a'] - df.groupby('a').cumcount() - 1
print (df)
a b c
0 2015 0.285832 2014
1 2015 -0.223318 2013
2 2015 0.620920 2012
3 2015 -0.891164 2011
4 2015 -0.719840 2010
5 2016 -0.106774 2015
6 2016 -1.230357 2014
7 2016 0.747803 2013
8 2016 -0.002320 2012
9 2016 0.062715 2011
10 2017 0.805035 2016
11 2017 -0.385647 2015
12 2017 -0.457458 2014
13 2017 -1.589365 2013
14 2017 0.013825 2012
Detail:
print (df.groupby('a').cumcount())
0 0
1 1
2 2
3 3
4 4
5 0
6 1
7 2
8 3
9 4
10 0
11 1
12 2
13 3
14 4
dtype: int64
you can do it this way:
In [8]: df['c'] = df.groupby('a')['a'].transform(lambda x: x-np.arange(1, len(x)+1))
In [9]: df
Out[9]:
a b c
0 2015 0.436967 2014
1 2015 -0.539453 2013
2 2015 -0.450282 2012
3 2015 0.907723 2011
4 2015 -2.279188 2010
5 2016 1.468736 2015
6 2016 -0.169522 2014
7 2016 0.003501 2013
8 2016 0.182321 2012
9 2016 0.647310 2011
10 2017 0.679443 2016
11 2017 -0.154405 2015
12 2017 -0.197271 2014
13 2017 -0.153552 2013
14 2017 0.518803 2012
Related
I have this function written in python. I want this thing show only one value.
Here's the code
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
print(df)
And of course the output is this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999 366999
2 2012 361986 366999
3 2013 329461 366999
4 2014 355464 366999
5 2015 344998 366999
6 2016 274317 366999
7 2017 200916 366999
8 2018 217246 366999
9 2019 119830 366999
10 2020 66640 366999
Since it has the same value, I want the output like this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
What should I change or add to my code?
You can use shift to generate a mask that can be used to replace duplicate consecutive values:
df.loc[df['Max Prod'] == df['Max Prod'].shift(1), 'Max Prod'] = ''
Output:
>>> df
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
You could also have the function as:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = ''
df.iloc[0, -1] = df['Production (Ton)'].max()
print(df)
Given what you have now:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
df = df['Max Prod'].drop_duplicates()
df = df.fillna('')
print(df)
Output:
Year Production-(Ton) Max Prod
0 2010 339491 549713
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.
I have two pandas dataframes:
df1=pd.DataFrame({'month':['jun', 'jul', 'aug'],'a':[3,4,5], 'b':[2,3,4], 'c':[4,5,5]}).set_index('month')
a b c
month
jun 3 2 4
jul 4 3 5
aug 5 4 5
and
df2=pd.DataFrame({'year':[2009,2009,2009, 2010,2010,2010,2011,2011,2011],'month':['jun', 'jul', 'aug','jun', 'jul', 'aug','jun', 'jul', 'aug'],'a':[2,2,2,2,2,2,2,2,2], 'b':[1,2,3,4,5,6,7,8,9], 'c':[3,3,3,3,3,3,3,3,3]}).set_index('year')
month a b c
year
2009 jun 2 1 3
2009 jul 2 2 3
2009 aug 2 3 3
2010 jun 2 4 3
2010 jul 2 5 3
2010 aug 2 6 3
2011 jun 2 7 3
2011 jul 2 8 3
2011 aug 2 9 3
I would like to multiply df2's elements with df1's according to the months. Is there quick way to do it?
Thanks in adavance.
Use DataFrame.mul by months converted to MultiIndex by DataFrame.set_index
:
df = df2.set_index('month', append=True).mul(df1, level=1).reset_index(level=1)
print (df)
month a b c
year
2009 jun 6 2 12
2009 jul 8 6 15
2009 aug 10 12 15
2010 jun 6 8 12
2010 jul 8 15 15
2010 aug 10 24 15
2011 jun 6 14 12
2011 jul 8 24 15
2011 aug 10 36 15
I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398
Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398
I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01
I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.
Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29