How to join or merge two dataframes based on different condition?

How to join or merge two dataframes based on different condition? - python

I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!

Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN

You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020

Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)

You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.

Related

Pandas : Groupby sum values

I am using this data frame in excel :
I'd like to show the total sales per year.
Year Sales
2021 7
2018 6
2018 787
2018 935
2018 1 059
2018 5
2018 72
2018 2
2018 3
2019 218
2019 256
2020 2
2018 4
2021 8
2019 14
2020 3
2018 3
2018 1
2020 34
I'm using this :
df.groupby(['Year'])['Sales'].agg('sum')
And the result :
2018.0 67879351 05957223431
2019.0 21825614
2020.0 2334
2021.0 78
Do you know why I don't have the sum of the values ?
Thanks

'Sales' column is of dtype object so convert it to numeric:
df['Sales']=pd.to_numeric(df['Sales'].replace(r"\s+",'',regex=True),errors='coerce')
#df['Sales'].replace(r"\s+",'',regex=True).astype(float)
Now calculte sum():
out=df.groupby(['Year'])['Sales'].sum()
output of out:
Year
2018 2877
2019 488
2020 39
2021 15
Name: Sales, dtype: int64

How to decide if value in row is valid or not based on multiple conditions?

I have some data in dataframe and want to check if the Year is valid or not if present in between start_year AND end_year
Year start_year end_year
2010 2012 2014
2013 2012 2015
2015 2015 2016
2009 2010 2012
2017 2016 2019
I want to add one more column (valid/invalid) specifying that the Year is valid or not
Year start_year end_year valid/invalid
2010 2012 2014 invalid
2013 2012 2015 valid
2015 2015 2016 valid
2009 2010 2012 invalid
2017 2016 2019 valid
How can we achieve this using python?

You can use np.where with Series.between
df["valid/invalid"] = np.where(df.Year.between(df.start_year,df.end_year),'valid','invalid')
df
Year start_year end_year valid/invalid
0 2010 2012 2014 invalid
1 2013 2012 2015 valid
2 2015 2015 2016 valid
3 2009 2010 2012 invalid
4 2017 2016 2019 valid

Check np.where
df['v/inv'] = np.where((df.Year>=df.start_year) & (df.Year<=df.end_year), 'valid','invalid')
df
Out[360]:
Year start_year end_year v/inv
0 2010 2012 2014 invalid
1 2013 2012 2015 valid
2 2015 2015 2016 valid
3 2009 2010 2012 invalid
4 2017 2016 2019 valid

If you want to stick to only using Pandas, then try the following solution which uses apply and replace -
df['valid/invalid'] = df.apply(lambda x: (x.Year>=x.start_year) and (x.Year<=x.end_year), axis=1).replace({True:'Valid',False:'Invalid'})
Year start_year end_year valid/invalid
0 2010 2012 2014 Invalid
1 2013 2012 2015 Valid
2 2015 2015 2016 Valid
3 2009 2010 2012 Invalid
4 2017 2016 2019 Valid
The first apply step gets you True or False if the year is in between (inclusive on both ends) the start and end year. Second step replaces the True and False with Valid or Invalid strings.

How to calculate bank statement debit/credit columns using balance column of pandas dataframe?

I have one dataframe which looks like below:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN NaN 1500
2 15 Dec 2017 15 Dec 2017 NaN NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN NaN 1700
4 21 Dec 2017 21 Dec 2017 NaN NaN 2000
5 22 Dec 2017 22 Dec 2017 NaN NaN 1000
In the above dataframe "Bal" column contains balance values and want to fill up the DR/CR values based on the next "Bal" amount.
I did it using simple python but seems like pandas can perform this action in very intelligent manner.
Expected Output:
Date_1 Date_2 DR CR Bal
0 5 Dec 2017 5 Dec 2017 500 NaN 1000
1 14 Dec 2017 14 Dec 2017 NaN 500 1500
2 15 Dec 2017 15 Dec 2017 300 NaN 1200
3 18 Dec 2017 18 Dec 2017 NaN 500 1700
4 21 Dec 2017 21 Dec 2017 NaN 300 2000
5 22 Dec 2017 22 Dec 2017 1000 NaN 1000

You could use a pd.mask. First calculate the difference of the balance by using diff. By using mask, fill one column by its absolute value if it's negative, and mask the np.nan values in the other column where it's positive.
diff = df['Bal'].diff()
df['DR'] = df['DR'].mask(diff < 0, diff.abs())
df['CR'] = df['CR'].mask(diff > 0, diff)
#Output
# Date_1 Date_2 DR CR Bal
#0 5 Dec 2017 5 Dec 2017 500.0 NaN 1000
#1 14 Dec 2017 14 Dec 2017 NaN 500.0 1500
#2 15 Dec 2017 15 Dec 2017 300.0 NaN 1200
#3 18 Dec 2017 18 Dec 2017 NaN 500.0 1700
#4 21 Dec 2017 21 Dec 2017 NaN 300.0 2000
#5 22 Dec 2017 22 Dec 2017 1000.0 NaN 1000

Percentage change with groupby python

I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398

Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398

substract incremented value in groupby

I have a dataframe:
import pandas as pd
import numpy as np
ycap = [2015, 2016, 2017]
df = pd.DataFrame({'a': np.repeat(ycap, 5),
'b': np.random.randn(15)})
a b
0 2015 0.436967
1 2015 -0.539453
2 2015 -0.450282
3 2015 0.907723
4 2015 -2.279188
5 2016 1.468736
6 2016 -0.169522
7 2016 0.003501
8 2016 0.182321
9 2016 0.647310
10 2017 0.679443
11 2017 -0.154405
12 2017 -0.197271
13 2017 -0.153552
14 2017 0.518803
I would like to add column c, that would look like following:
a b c
0 2015 -0.826946 2014
1 2015 0.275072 2013
2 2015 0.735353 2012
3 2015 1.391345 2011
4 2015 0.389524 2010
5 2016 -0.944750 2015
6 2016 -1.192546 2014
7 2016 -0.247521 2013
8 2016 0.521094 2012
9 2016 0.273950 2011
10 2017 -1.199278 2016
11 2017 0.839705 2015
12 2017 0.075951 2014
13 2017 0.663696 2013
14 2017 0.398995 2012
I try to achieve this using following, however 1, need to increment within the group. How could I do it? Thanks
gp = df.groupby('a')
df['c'] = gp['a'].apply(lambda x: x-1)

Subtract column a by Series created by cumcount and last subtract 1:
df['c'] = df['a'] - df.groupby('a').cumcount() - 1
print (df)
a b c
0 2015 0.285832 2014
1 2015 -0.223318 2013
2 2015 0.620920 2012
3 2015 -0.891164 2011
4 2015 -0.719840 2010
5 2016 -0.106774 2015
6 2016 -1.230357 2014
7 2016 0.747803 2013
8 2016 -0.002320 2012
9 2016 0.062715 2011
10 2017 0.805035 2016
11 2017 -0.385647 2015
12 2017 -0.457458 2014
13 2017 -1.589365 2013
14 2017 0.013825 2012
Detail:
print (df.groupby('a').cumcount())
0 0
1 1
2 2
3 3
4 4
5 0
6 1
7 2
8 3
9 4
10 0
11 1
12 2
13 3
14 4
dtype: int64

you can do it this way:
In [8]: df['c'] = df.groupby('a')['a'].transform(lambda x: x-np.arange(1, len(x)+1))
In [9]: df
Out[9]:
a b c
0 2015 0.436967 2014
1 2015 -0.539453 2013
2 2015 -0.450282 2012
3 2015 0.907723 2011
4 2015 -2.279188 2010
5 2016 1.468736 2015
6 2016 -0.169522 2014
7 2016 0.003501 2013
8 2016 0.182321 2012
9 2016 0.647310 2011
10 2017 0.679443 2016
11 2017 -0.154405 2015
12 2017 -0.197271 2014
13 2017 -0.153552 2013
14 2017 0.518803 2012

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join or merge two dataframes based on different condition? - python

Here is another way using pd.Series() and explode() df1['Start_date'] = pd.Series([df2['Start_date'].tolist()]) df1['Start_date'] = df1['Start_date'].fillna(method='ffill') df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)

Related

Pandas : Groupby sum values

How to decide if value in row is valid or not based on multiple conditions?

How to calculate bank statement debit/credit columns using balance column of pandas dataframe?

Percentage change with groupby python

substract incremented value in groupby

Categories

Resources