I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398
Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398
Related
I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.
For example I have the following map:
{'df1': Jan Feb Mar
1 3 5
2 4 6
'df2': Jan Feb Mar
7 9 11
8 10 12
......}
And I want the following output:
Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
Does anyone knows if its possible to do it this way?
What I have tried is to iterate through DataFrames to try getting
{'df1': Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
'df2': Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
by using
for x in dfMap:
df = pd.melt(list(x.values()))
Then try to concat it with df1m =
pd.concat(df.values(), ignore_index=True)
Which gave me error
AttributeError: 'list' object has no attribute 'columns'
I am fairly new to programming and really wanted to learn, will be nice if anyone can explain how this works, and why list or dict_values object has no attribute 'columns'.
Thanks in advance!
You can concat and stack:
out = pd.concat(d.values()).stack().droplevel(0)
Or:
out = pd.concat(d.values()).melt()
Example:
df = pd.DataFrame(np.arange(1,10).reshape(-1,3),columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.iterrows():
d[f"df{e+1}"] = i.to_frame().T
print(d,'\n')
out = pd.concat(d.values()).stack().droplevel(0)
print(out)
{'df1': Jan Feb Mar
0 1 2 3, 'df2': Jan Feb Mar
1 4 5 6, 'df3': Jan Feb Mar
2 7 8 9}
Jan 1
Feb 2
Mar 3
Jan 4
Feb 5
Mar 6
Jan 7
Feb 8
Mar 9
dtype: int32
With melt:
out = pd.concat(d.values()).melt()
print(out)
variable value
0 Jan 1
1 Jan 4
2 Jan 7
3 Feb 2
4 Feb 5
5 Feb 8
6 Mar 3
7 Mar 6
8 Mar 9
EDIT, for edited question , try:
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
Example below:
df = pd.DataFrame(np.arange(1,13).reshape(3,-1).T,columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.groupby(df.index//2):
d[f"df{e+1}"] = i
print(d,'\n')
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
print(out)
{'df1': Jan Feb Mar
0 1 5 9
1 2 6 10, 'df2': Jan Feb Mar
2 3 7 11
3 4 8 12}
Jan 1
Jan 2
Feb 5
Feb 6
Mar 9
Mar 10
Jan 3
Jan 4
Feb 7
Feb 8
Mar 11
Mar 12
dtype: int32
Or you can also convert the dataframe names as int and then sort:
out = (pd.concat(d.values(),keys=[int(key[2:]) for key in d.keys()])
.stack().sort_index(level=[0,-1]).droplevel([0,1]))
I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
I have one dataframe which looks like below:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN NaN 1500
2 15 Dec 2017 15 Dec 2017 NaN NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN NaN 1700
4 21 Dec 2017 21 Dec 2017 NaN NaN 2000
5 22 Dec 2017 22 Dec 2017 NaN NaN 1000
In the above dataframe "Bal" column contains balance values and want to fill up the DR/CR values based on the next "Bal" amount.
I did it using simple python but seems like pandas can perform this action in very intelligent manner.
Expected Output:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN 500 1500
2 15 Dec 2017 15 Dec 2017 300 NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN 500 1700
4 21 Dec 2017 21 Dec 2017 NaN 300 2000
5 22 Dec 2017 22 Dec 2017 1000 NaN 1000
You could use a pd.mask. First calculate the difference of the balance by using diff. By using mask, fill one column by its absolute value if it's negative, and mask the np.nan values in the other column where it's positive.
diff = df['Bal'].diff()
df['DR'] = df['DR'].mask(diff < 0, diff.abs())
df['CR'] = df['CR'].mask(diff > 0, diff)
#Output
# Date_1 Date_2 DR CR Bal
#0 5 Dec 2017 5 Dec 2017 500.0 NaN 1000
#1 14 Dec 2017 14 Dec 2017 NaN 500.0 1500
#2 15 Dec 2017 15 Dec 2017 300.0 NaN 1200
#3 18 Dec 2017 18 Dec 2017 NaN 500.0 1700
#4 21 Dec 2017 21 Dec 2017 NaN 300.0 2000
#5 22 Dec 2017 22 Dec 2017 1000.0 NaN 1000
I have a dataframe:
import pandas as pd
import numpy as np
ycap = [2015, 2016, 2017]
df = pd.DataFrame({'a': np.repeat(ycap, 5),
'b': np.random.randn(15)})
a b
0 2015 0.436967
1 2015 -0.539453
2 2015 -0.450282
3 2015 0.907723
4 2015 -2.279188
5 2016 1.468736
6 2016 -0.169522
7 2016 0.003501
8 2016 0.182321
9 2016 0.647310
10 2017 0.679443
11 2017 -0.154405
12 2017 -0.197271
13 2017 -0.153552
14 2017 0.518803
I would like to add column c, that would look like following:
a b c
0 2015 -0.826946 2014
1 2015 0.275072 2013
2 2015 0.735353 2012
3 2015 1.391345 2011
4 2015 0.389524 2010
5 2016 -0.944750 2015
6 2016 -1.192546 2014
7 2016 -0.247521 2013
8 2016 0.521094 2012
9 2016 0.273950 2011
10 2017 -1.199278 2016
11 2017 0.839705 2015
12 2017 0.075951 2014
13 2017 0.663696 2013
14 2017 0.398995 2012
I try to achieve this using following, however 1, need to increment within the group. How could I do it? Thanks
gp = df.groupby('a')
df['c'] = gp['a'].apply(lambda x: x-1)
Subtract column a by Series created by cumcount and last subtract 1:
df['c'] = df['a'] - df.groupby('a').cumcount() - 1
print (df)
a b c
0 2015 0.285832 2014
1 2015 -0.223318 2013
2 2015 0.620920 2012
3 2015 -0.891164 2011
4 2015 -0.719840 2010
5 2016 -0.106774 2015
6 2016 -1.230357 2014
7 2016 0.747803 2013
8 2016 -0.002320 2012
9 2016 0.062715 2011
10 2017 0.805035 2016
11 2017 -0.385647 2015
12 2017 -0.457458 2014
13 2017 -1.589365 2013
14 2017 0.013825 2012
Detail:
print (df.groupby('a').cumcount())
0 0
1 1
2 2
3 3
4 4
5 0
6 1
7 2
8 3
9 4
10 0
11 1
12 2
13 3
14 4
dtype: int64
you can do it this way:
In [8]: df['c'] = df.groupby('a')['a'].transform(lambda x: x-np.arange(1, len(x)+1))
In [9]: df
Out[9]:
a b c
0 2015 0.436967 2014
1 2015 -0.539453 2013
2 2015 -0.450282 2012
3 2015 0.907723 2011
4 2015 -2.279188 2010
5 2016 1.468736 2015
6 2016 -0.169522 2014
7 2016 0.003501 2013
8 2016 0.182321 2012
9 2016 0.647310 2011
10 2017 0.679443 2016
11 2017 -0.154405 2015
12 2017 -0.197271 2014
13 2017 -0.153552 2013
14 2017 0.518803 2012