Pandas conditional outer join based on timedelta (merge_asof) - python

I have multiple dataframes that I need to merge into a single dataset based on a unique identifier (uid), and on the timedelta between dates in each dataframe.
Here's a simplified example of the dataframes:
df1
uid tx_date last_name first_name meas_1
0 60 2004-01-11 John Smith 1.3
1 60 2016-12-24 John Smith 2.4
2 61 1994-05-05 Betty Jones 1.2
3 63 2006-07-19 James Wood NaN
4 63 2008-01-03 James Wood 2.9
5 65 1998-10-08 Tom Plant 4.2
6 66 2000-02-01 Helen Kerr 1.1
df2
uid rx_date last_name first_name meas_2
0 60 2004-01-14 John Smith A
1 60 2017-01-05 John Smith AB
2 60 2017-03-31 John Smith NaN
3 63 2006-07-21 James Wood A
4 64 2002-04-18 Bill Jackson B
5 65 1998-10-08 Tom Plant AA
6 65 2005-12-01 Tom Plant B
7 66 2013-12-14 Helen Kerr C
Basically I am trying to merge records for the same person from two separate sources, where there link between records for unique individuals is the 'uid', and the link between rows (where it exists) for each individiual is a fuzzy relationship between 'tx_date' and 'rx_date' that can (usually) be accomodated by a specific time delta. There won't always be an exact or fuzzy match between dates, data could be missing from any column except 'uid', and each dataframe will contain a different but intersecting subset of 'uid's.
I need to be able to concatenate rows where the 'uid' columns match, and where the absolute time delta between 'tx_date' and 'rx_date' is within a given range (e.g. max delta of 14 days). Where the time delta is outside that range, or one of either 'tx_date' or 'rx_date' is missing, or where the 'uid' exists in only one of the dataframes, I still need to retain the data in that row. The end result should be something like:
uid tx_date rx_date first_name last_name meas_1 meas_2
0 60 2004-01-11 2004-01-14 John Smith 1.3 A
1 60 2016-12-24 2017-01-05 John Smith 2.4 AB
2 60 NaT 2017-03-31 John Smith NaN NaN
3 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood NaN NaN
6 64 2002-04-18 NaT Bill Jackson NaN B
7 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
8 65 NaT 2005-12-01 Tom Plant NaN B
9 66 2000-02-01 NaT Helen Kerr 1.1 NaN
10 66 NaT 2013-12-14 Helen Kerr NaN C
Seems like pandas.merge_asof should be useful here, but I've not been able to get it to do quite what I need.
Trying merge_asof on two of the real dataframes I have gave an error ValueError: left keys must be sorted
As per this question the problem there was actually due to there being NaT values in the 'date' column for some rows. I dropped the rows with NaT values, and sorted the 'date' columns in each dataframe, but the result still isn't quite what I need.
The code below shows the steps taken.
import pandas as pd
df1['date'] = df1['tx_date']
df1['date'] = pd.to_datetime(df1['date'])
df1['date'] = df1['date'].dropna()
df1 = df1.sort_values('date')
df2['date'] = df2['rx_date']
df2['date'] = pd.to_datetime(df2['date'])
df2['date'] = df2['date'].dropna()
df2 = df2.sort_values('date')
df_merged = (pd.merge_asof(df1, df2, on='date', by='uid', tolerance=pd.Timedelta('14 days'))).sort_values('uid')
Result:
uid tx_date rx_date last_name_x first_name_x meas_1 meas_2
3 60 2004-01-11 2004-01-14 John Smith 1.3 A
6 60 2016-12-24 2017-01-05 John Smith 2.4 AB
0 61 1994-05-05 NaT Betty Jones 1.2 NaN
4 63 2006-07-19 2006-07-21 James Wood NaN A
5 63 2008-01-03 NaT James Wood 2.9 NaN
1 65 1998-10-08 1998-10-08 Tom Plant 4.2 AA
2 66 2000-02-01 NaT Helen Kerr 1.1 NaN
It looks like a left join rather than a full outer join, so anywhere there's a row in df2 without a match on 'uid' and 'date' in df1 is lost (and it's not really clear from this simplified example, but I also need to add the rows back in where the date was NaT).
Is there some way to achieve a lossless merge, either by somehow doing an outer join with merge_asof, or using some other approach?

Related

Identify value based on matching Rows and Columns across two Dataframes

I'm very new to python and I have two dataframes... I'm trying to match the "Names" aka the columns of dataframe 1 with the rows of dataframe 2 and collect the value for the year 2022 with the hopeful output looking like Dataframe 3... I've tried looking through other queries but not found anything to help, any help would be greatly appreciated!
Dataframe 1 - Money Dataframe 2 Dataframe 3
Date Alex Rob Kev Ben Name Name Amount
2022 29 45 65 12 James James
2021 11 32 11 19 Alex Alex 29
2019 45 12 22 76 Carl Carl
Rob Rob 45
Kev Kev 65
There are many different ways to achieve this.
One option is using map:
s = df1.set_index('Date').loc[2022]
df2['Amount'] = df2['Name'].map(s)
output:
Name Amount
0 James NaN
1 Alex 29.0
2 Carl NaN
3 Rob 45.0
4 Kev 65.0
Another option is using merge:
s = df1.set_index('Date').loc[2022]
df3 = df2.merge(s.rename('Amount'), left_on='Name', right_index=True, how='left')

Merging three dataframes with three similar indexes:

I have three dataframes
df1 :
Date ID Number ID2 info_df1
2021-12-11 1 34 36 60
2021-12-10 2 33 35 57
2021-12-09 3 32 34 80
2021-12-08 4 3133 55
df2:
Date ID Number ID2 info_df2
2021-12-10 2 18 20 50
2021-12-11 1 34 36 89
2021-12-10 2 33 35 40
2021-12-09 3 32 34 92
df3:
Date ID Number ID2 info_df3
2021-12-10 2 18 20 57
2021-12-10 2 18 20 63
2021-12-11 1 34 36 52
2021-12-10 2 33 35 33
I need a data frame with info column from df1,df2 and df3 and Date,ID,Number,ID2 as index.
Format of the merged dataframe should consist these columns:
Date ID Number ID2 info_df1 info_df2
info_df3
If you trying to merge the dataframe based on Date, I think what you need is merge function:
mergedDf = df1.merge(df2, on="Date").merge(df3, on="Date");
mergedDf.set_index("ID2", inplace = True)
But if you are trying to merge dataframes based on multiple columns, you can use a list of column names on the on argument:
mergedDf = df1.merge(df2, on=["Date", "ID", "ID2"]).merge(df3, on=["Date", "ID", "ID2"]);
mergedDf.set_index("ID2", inplace = True)
Two steps:
first, pandas.concat(<DFs-list>) all those DFs into a df;
then, define a multi-index with df.set_index(<col-names-list>).
That will do it. Sure, you have to read some docs (here below), but those two steps should be about it.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.set_levels.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html
As others have mentioned, you need to merge the dataframes together. Using the built-in function functools.reduce, we can do this dynamically (for any number of dataframes) and easily:
i = 0
def func(x, y):
global i
i += 1
return y.merge(x.rename({'info': f'info_df{i + 1}'}, axis=1), on=['Date', 'ID', 'Number', 'ID2'], how='outer')
dfs = [df1, df2, df3]
new_df = ft.reduce(func, dfs).rename({'info': 'info_df1'}, axis=1)
Output:
>>> new_df
Date ID Number ID2 info_df1 info_df7 info_df6
0 2021-12-10 2 18 20 57.0 50.0 NaN
1 2021-12-10 2 18 20 63.0 50.0 NaN
2 2021-12-11 1 34 36 52.0 89.0 60.0
3 2021-12-10 2 33 35 33.0 40.0 57.0
4 2021-12-09 3 32 34 NaN 92.0 80.0
5 2021-12-08 4 31 33 NaN NaN 55.0

Pandas: Combine two data-frames with different shape based on one common column

I have a df with columns:
Student_id subject marks
1 English 70
1 math 90
1 science 60
1 social 80
2 English 90
2 math 50
2 science 70
2 social 40
I have another df1 with columns
Student_id Year_of_join column_with_info
1 2020 some_info1
1 2020 some_info2
1 2020 some_info3
2 2019 some_info4
2 2019 some_info5
I want to combine two of the above data frames(.csv files) something like below res_df:
Student_id subject marks year_of_join column_with_info
1 English 70 2020 some_info1
1 math 90 2020 some_info2
1 science 60 2020 some_info3
1 social 80 NaN NaN
2 English 90 2019 some_info4
2 math 50 2019 some_info5
2 science 70 NaN NaN
2 social 40 NaN NaN
Note:
I want to join the datasets based on Student_ids. Both have the same unique Student_id's but the shape of the data is different for both the datasets.
P.S: The resulting df res_df is just an example of how the data might look after combining two data-frames, It can also be like this:
Student_id subject marks year_of_join column_with_info
1 English 70 NaN NaN
1 math 90 2020 some_info1
1 science 60 2020 some_info2
1 social 80 2020 some_info3
2 English 90 NaN NaN
2 math 50 NaN NaN
2 science 70 2019 some_info4
2 social 40 2019 some_info5
Thanks in advance for the help! Please help me to solve this..
Use GroupBy.cumcount for helper column used for merge with left join:
df['g'] = df.groupby('Student_id').cumcount()
df1['g'] = df1.groupby('Student_id').cumcount()
df = df.merge(df1, on=['Student_id','g'], how='left').drop('g', axis=1)

Pandas rolling apply to df where filter based on values in current row

I have a Pandas dataframe with a datetime column (that I've used as a DatetimeIndex) that has a categorical column, and a numerical column. I'd like to apply a complex function to the numerical column when the categorical column is the same as the current row, in a short (ten-day) window lagging the current row (non-inclusive).
As a contrived example:
name = ['steve', 'bob', 'harry', 'jeff'] * 5
df = pd.DataFrame(
index=pd.DatetimeIndex(start='2018-10-10', end='2018-10-29', freq='D'),
data={'value': [x for x in range(20)],
'name': names
}
)
produces a simple dataframe, to which I'd like to add another column (result) that calculates the number of rows * the sum of the values in 'value' (or something - just a formula that there's not a Pandas built-in function for). So for the dataframe above, I'd like the following:
num name result
2018-10-10 0 steve NaN
2018-10-11 1 bob NaN
2018-10-12 2 harry NaN
2018-10-13 3 jeff NaN
2018-10-14 4 steve 0
2018-10-15 5 bob 1
2018-10-16 6 harry 2
2018-10-17 7 jeff 3
2018-10-18 8 steve 8
2018-10-19 9 bob 12
2018-10-20 10 harry 16
2018-10-21 11 jeff 20
2018-10-22 12 steve 24
2018-10-23 13 bob 28
2018-10-24 14 harry 32
2018-10-25 15 jeff 36
2018-10-26 16 steve 40
2018-10-27 17 bob 44
2018-10-28 18 harry 48
2018-10-29 19 jeff 52
I can write my own function for this and use it in pandas.apply:
def rolling_apply(df, time, window_size=timedelta(days=10)):
event_time = time
event_name = df[df.index == time]['names'].iloc[0]
return df[
(df['names'] == event_name) &
(df.index < event_time) &
(df.index >= event_time - window_size)
]
df['result'] = df.apply(lambda x: rolling_apply(df, x.name)['value'].sum() * rolling_apply(df, x.name).count(), axis=1)
but performance gets pretty terrible pretty quickly as my data grows. pandas.rolling.apply seems sort of appropriate, but I can't quite make it fit what I want to do.
Any suggestions or help would be very much appreciated!

How long are Pandas groupby objects remembered?

I have the following example Python 3.4 script. It does the following:
creates a dataframe,
converts the date variable to datetime64 format,
creates a groupby object based on two categorical variables,
produces a dataframe that contains a count of the number items in each group,
merges count dataframe back with original dataframe to create a column containing the number of rows in each group
creates a column containing the difference in dates between sequential rows.
Here is the script:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
This script produces the following output:
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 243 days 06:18:00
2 old 2015-06-04 12:34:00 female 2 3 NaT
3 old 2015-09-04 23:03:00 female 3 3 92 days 10:29:00
4 old 2015-04-21 12:59:00 female 6 3 -137 days +13:56:00
5 old 2015-12-04 01:00:00 male 4 6 NaT
6 old 2015-04-15 07:12:00 male 5 6 -233 days +06:12:00
7 old 2015-06-05 11:12:00 male 9 6 51 days 04:00:00
8 old 2015-05-19 19:22:00 male 12 6 -17 days +08:10:00
9 old 2015-04-06 12:57:00 male 15 6 -44 days +17:35:00
10 old 2015-06-15 03:23:00 male 17 6 69 days 14:26:00
11 young 2015-12-05 14:19:00 female 11 4 NaT
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 163 days 18:28:00
And this exactly what I'd expect. However, it seems to rely on creating the groupby object twice (in exactly the same way). If the second groupby definition is commented out, it seems to lead to a very different output in the diff column:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
# ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT *****
# tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
And, this time the output is very different (and NOT what I wanted at all)
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 NaT
2 old 2015-06-04 12:34:00 female 2 3 92 days 10:29:00
3 old 2015-09-04 23:03:00 female 3 3 NaT
4 old 2015-04-21 12:59:00 female 6 3 -233 days +06:12:00
5 old 2015-12-04 01:00:00 male 4 6 -137 days +13:56:00
6 old 2015-04-15 07:12:00 male 5 6 NaT
7 old 2015-06-05 11:12:00 male 9 6 NaT
8 old 2015-05-19 19:22:00 male 12 6 51 days 04:00:00
9 old 2015-04-06 12:57:00 male 15 6 243 days 06:18:00
10 old 2015-06-15 03:23:00 male 17 6 NaT
11 young 2015-12-05 14:19:00 female 11 4 -17 days +08:10:00
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 -44 days +17:35:00
(In my real-life script the results seem to be a little erratic, sometimes it works and sometimes it doesn't. But in the above script, the different outputs seem to occur consistently.)
Why is it necessary to recreate the groupby object on what is, essentially, the same dataframe (albeit with an additional column added) immediately before using the .diff() function? This seems very dangerous to me.
Not the same, the index has changed. For example:
tempDF.loc[1].id # before
10
tempDF.loc[1].id # after
2
So if you compute tempGroupby with the old tempDF and then change the indexes in tempDF when you do this:
tempDF['diff'] = tempGroupby['date'].diff()
the indexes do not match as you expect. You are assigning to each row the difference corresponding to the row that had that index in the old tempDF.

Categories

Resources