Fuzzy matching between columns of different dataframes (of different lengths) [duplicate]

Fuzzy matching between columns of different dataframes (of different lengths) [duplicate] - python

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e

Similar to #locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
.
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

Using fuzzywuzzy
Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:
Example datframe
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
Function for fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Using our function on the dataframes: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Using our function on the dataframes: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
Installation:
Pip
pip install fuzzywuzzy
Anaconda
conda install -c conda-forge fuzzywuzzy

I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e

For a general approach: fuzzy_merge
For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
Here are some use cases with two sample dataframes:
print(df1)
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
print(df2)
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
With the above example, we'd get:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
And we could do a left join with:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
For a right join, we'd have all non-matching keys in the left dataframe to None:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:
print(df2)
letter
one a
too b
three c
fours d
a very different string e
We'd get an index out of range error:
df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
IndexError: list index out of range
In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.

http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...
I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.
use the below command to install
pip install fuzzymatcher
Below is the sample Code (already submitted by RobinL above)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Errors you may get
ZeroDivisionError: float division by zero---> Refer to this
link to resolve it
OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll
from here and replace the DLL file in your python or anaconda
DLLs folder.
Pros :
Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
Gives you score for all the closest matches for the same record. you choose whats the cutoff score.
cons:
Original package installation is buggy
Required C++ and visual studios installed too
Wont work for 64 bit anaconda/Python

There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)
results.head()
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag

As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.
def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN

You can use d6tjoin for that
import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
index number index_right letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 five e
It has a variety of additional features such as:
check join quality, pre and post join
customize similarity function, eg edit distance vs hamming distance
specify max distance
multi-core compute
For details see
MergeTop1 examples - Best match join examples notebook
PreJoin examples - Examples for diagnosing join problems

I have used fuzzywuzz in a very minimal way whilst matching the existing behaviour and keywords of merge in pandas.
Just specify your accepted threshold for matching (between 0 and 100):
from fuzzywuzzy import process
def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
match, score, *_ = process.extract(x, df[column], limit=1)[0]
if score >= threshold:
return match
else:
return None
if on is not None:
left_on = on
right_on = on
# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)
merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
del merged_df['tmp']
return merged_df
Try it out using the example data:
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
fuzzy_merge(df, df2, on='Key', threshold=80)

Using thefuzz
Using SeatGeek's great package thefuzz, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.
Sample data
df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})
col_a col_b
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})
col_a col_b
0 one a
1 too b
2 three c
3 fours d
4 five e
Function used to do the matching
def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
):
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit) # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
)
# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left) # Convert list of matches to rows
df_matches[
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index) # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True) # Drop column of match tuples
# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
if df_matches.index.name:
index_name = df_matches.index.name # Stash index name
else:
index_name = 'index' # Default used by pandas
df_matches.reset_index(inplace=True)
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True) # The previous index has now become a column: rename for ease of reference
# Drop matches below threshold
df_matches.drop(
df_matches.loc[df_matches['match_score'] < threshold].index,
inplace=True
)
return df_matches
Use function and merge data
import pandas as pd
from thefuzz import process
df_matches = fuzzy_match(
df1,
df2,
'col_a',
'col_a',
threshold=60,
limit=1
)
df_output = df1.merge(
df_matches,
how='left',
left_index=True,
right_on='df_left_id'
).merge(
df2,
how='left',
left_on='df_right_id',
right_index=True,
suffixes=['_df1', '_df2']
)
df_output.set_index('df_left_id', inplace=True) # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table
df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']] # Drop columns used in the matching
df_output.index.name = 'id'
id col_a_df1 col_b_df1 col_b_df2
0 one 1 a
1 two 2 b
2 three 3 c
3 four 4 d
4 five 5 e
Tip: Fuzzy matching using thefuzz is much quicker if you optionally install the python-Levenshtein package too.

For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

if the join axis is numeric this could also be used to match indexes with a specified tolerance:
def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values
diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1) # min. of each column
mask_i = np.arange(mask_j.shape[0])
df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]
if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]
df2_.index = df1_.index
out = pd.concat([df1_, df2_], axis=1)
return out

TheFuzz is the new version of a fuzzywuzzy
In order to fuzzy-join string-elements in two big tables you can do this:
Use apply to go row by row
Use swifter to parallel, speed up and visualize default apply function (with colored progress bar)
Use OrderedDict from collections to get rid of duplicates in the output of merge and keep the initial order
Increase limit in thefuzz.process.extract to see more options for merge (stored in a list of tuples with % of similarity)
'*' You can use thefuzz.process.extractOne instead of thefuzz.process.extract to return just one best-matched item (without specifying any limit). However, be aware that several results could have same % of similarity and you will get only one of them.
'**' Somehow the swifter takes a minute or two before starting the actual apply. If you need to process small tables you can skip this step and just use progress_apply instead
from thefuzz import process
from collections import OrderedDict
import swifter
def match(x):
matches = process.extract(x, df1, limit=6)
matches = list(OrderedDict((x, True) for x in matches).keys())
print(f'{x:20} : {matches}')
return str(matches)
df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

Related

How to select top N columns in a dataframe with a criteria

Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns
The columns in my data are split into 3 categories
columns start with "Basic"
columns end with "_T"
and everything else
a sample of my dataframe is this
RowID Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1 2 8 4 3 1 5 6 7
2 8 3 5 0 9 7 0 5
I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns
I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"
and by Top I mean the greatest values
Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
How can I achieve that?

Step 1: You can use .filter() with regex to filter the columns with the following 2 conditions:
start with "Basic", or
end with "_T"
The regex used is r'(?:^Basic)|(?:_T$)' where:
(?: ) is a non-capturing group of regex. It is for a temporary grouping.
^ is the start of text anchor to indicate start position of text
Basic matches with the text Basic (together with ^, this Basic must be at the beginning of column label)
| is the regex meta-character for or
_T matches the text _T
$ is the end of text anchor to indicate end of text position (together with _T, _T$ indicate _T at the end of column name.
We name these columns as cols_Basic_T
Step 2: Then, use Index.difference() to find others columns. We name these other columns as cols_others.
Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others.
Full set of codes:
## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns
## Step 2
cols_others = df.columns.difference(cols_Basic_T)
## Step 3
#Top = 20
Top = 3 # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
# To keep the original column sequence
df_others = df_others[df.columns.intersection(cols_others)]
Results:
cols_Basic_T
print(cols_Basic_T)
Index(['Basic1011', 'Basic2837', 'Car92_T', 'Basic383_T'], dtype='object')
cols_others
print(cols_others)
Index(['Brat82', 'Jot112', 'Lemon836', 'Manf3953', 'RowID'], dtype='object')
df_others
print(df_others)
## With Top 3 shown as non-zeros. Other non-Top3 masked as zeros
RowID Lemon836 Manf3953 Brat82 Jot112
0 0 4 0 5 7
1 0 0 9 7 5

Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.
# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()
# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]
#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
#then merge dfn with df_ranked

How to merge and get the count of non-matching entries efficiently?

I have two data frames like as shown below
import pandas as pd
import numpy as np
source_df = pd.DataFrame({'source_value':['21ABCDE1','22CDEF2','23DEF3','24FGH4']})
client_df = pd.DataFrame({'source_value':['21ABCDE1','29SB','21ABCDE1','29SB','25FG','25FG','31DE','35DE']})
What I would like to do is find the number of source_values which are present in client_df but not present in the source_df.
Please note that there can be duplicates in the client_df but not in source_df
Basically I have to identify them as they are not valid (because they are missing in the parent dataframe which is source_df)
I tried the below but it is for matching entries.
pd.merge(
source_df,client_df,
how="inner",
on = 'source_value').groupby('source_value').size()
How can I elegantly do it for non-matching entries because I have several millions of data (At least 10 million records and can go up to 15 million).
I expect my output to be like as shown below

One option is to get the source_value from source_df as a set and then filter and count values in client_df:
source_values = set(source_df.source_value.to_list())
client_df.source_value[lambda x: ~x.isin(source_values)].value_counts()
#29SB 2
#25FG 2
#35DE 1
#31DE 1
#Name: source_value, dtype: int64

In []: client_df[~client_df.source_value.isin(source_df.source_value)].value_counts()
Out[]:
source_value
29SB 2
25FG 2
35DE 1
31DE 1
dtype: int64

Using Left Merge with Indicator to identify ones in client only
In[]: merge_df = client_df.merge(source_df, how='left', indicator=True)
merge_df[merge_df['_merge']=='left_only'].groupby('source_value').size()
Out[]:
source_value
25FG 2
29SB 2
31DE 1
35DE 1

client_df.groupby('source_value')['source_value'].count().loc[pd.Index(client_df.source_value).difference(source_df.source_value)]

Pandas Inner Join with Lambda

I have the following two frames:
frame1:
id
0 111-111-111
1 111-111-222
2 222-222-222
3 333-333-333
frame2:
data id
0 ones 111-111
1 threes 333-333
And, I have a lambda function that maps the frame1.id to frame2.id:
id_map = lambda x: x[:7]
My goal is to perform an inner join between these two tables, but to have the id go through the lambda. So that the output is:
id data
0 111-111-111 ones
1 111-111-222 ones
2 333-333-333 threes
I've come up with a rather non-elegant solution that almost does what I'm trying to do, however it messes up when the inner join removes rows:
# Save a copy the original ids of frame1
frame1_ids = frame1['id'].copy()
# Apply the id change to frame1
frame1['id'] = frame1['id'].apply(id_map)
# Merge
frame1 = frame1.merge(frame2, how='inner', on='id')
# Set the ids back to what they originally were
frame1['id'] = frame1_ids
Is there a elegant solution for this?

Could use assign to create a dummy id column (newid) to join on like this:
frame1.assign(newid=frame1['id'].str[:7])
.merge(frame2, left_on='newid', right_on='id', suffixes=('','_y'))
.drop(['id_y','newid'], axis=1)
Output:
id data
0 111-111-111 ones
1 111-111-222 ones
2 333-333-333 threes

use of .apply() for comparing elements

I have a dataframe df of thousands of items where the value of the column "group" repeats from two to ten times. The dataframe has seven columns, one of them is named "url"; another one "flag". All of them are strings.
I would like to use Pandas in order to traverse through these groups. For each group I would like to find the longest item in the "url" column and store a "0" or "1" in the "flag" column that corresponds to that item. I have tried the following but I can not make it work. I would like to 1) get rid of the loop below, and 2) be able to compare all items in the group through df.apply(...)
all_groups = df["group"].drop_duplicates.tolist()
for item in all_groups:
df[df["group"]==item].apply(lambda x: Here I would like to compare the items within one group)
Can apply() and lambda be used in this context? Any faster way to implement this?
Thank you!

Using groupby() and .transform() you could do something like:
df['flag'] = df.groupby('group')['url'].transform(lambda x: x.str.len() == x.map(len).max())
Which provides a boolean value for df['flag']. If you need it as 0, 1 then just add .astype(int) to the end.

Unless you write code and find it's running slowly don't sweat optimizing it. In the words of Donald Knuth "Premature optimization is the root of all evil."
If you want to use apply and lambda (as mentioned in the question):
df = pd.DataFrame({'url': ['abc', 'de', 'fghi', 'jkl', 'm'], 'group': list('aaabb'), 'flag': 0})
Looks like:
flag group url
0 0 a abc
1 0 a de
2 0 a fghi
3 0 b jkl
4 0 b m
Then figure out which elements should have their flag variable set.
indices = df.groupby('group')['url'].apply(lambda s: s.str.len().idxmax())
df.loc[indices, 'flag'] = 1
Note this only gets the first url with maximal length. You can compare the url lengths to the maximum if you want different behavior.
So df now looks like:
flag group url
0 0 a abc
1 0 a de
2 1 a fghi
3 1 b jkl
4 0 b m

Equivalent of R data.table rolling join in Python and PySpark

Does anyone know how to do an R data.table rolling join in PySpark?
Borrowing the example and nice explanation of rolling joins from Ben here;
sales<-data.table(saleID=c("S1","S2","S3","S4","S5"),
saleDate=as.Date(c("2014-2-20","2014-5-1","2014-6-15","2014-7- 1","2014-12-31")))
commercials<-data.table(commercialID=c("C1","C2","C3","C4"),
commercialDate=as.Date(c("2014-1-1","2014-4-1","2014-7-1","2014-9-15")))
setkey(sales,"saleDate")
setkey(commercials,"commercialDate")
sales[commercials, roll=TRUE]
Result being;
saleDate saleID commercialID
1: 2014-01-01 NA C1
2: 2014-04-01 S1 C2
3: 2014-07-01 S4 C3
4: 2014-09-15 S4 C4
Many thanks for the help.

Rolling join is not join + fillna
First of all a rolling join is not the same as a join and a fillna! That would only be the case if the key of the table that is joined against (in terms of data.table that would be the left table and a right-join) has equivalents in the main table. A data.table rolling join does not require this.
There is no direct equivalent, as far as I know and I searched for quite a while. There is even an issue for it https://github.com/pandas-dev/pandas/issues/7546. However:
Solution in pandas:
There is a solution though in pandas. Let's assume your right data.table is table A and your left data.table is table B.
Sort the table A and and B each by key.
Add a column tag to A which are all 0 and a column tag to B that are all 1.
Delete all columns except the key and tagfrom B (can be omitted, but it is clearer this way) and call the table B'. Keep B as an original - we are going to need it later.
Concatenate A with B' to C and ignore the fact that the rows from B' has many NAs.
Sort C by key.
Make a new cumsum column with C = C.assign(groupNr = np.cumsum(C.tag))
Using filtering (query) on tag get rid of all B'-rows.
Add a running counter column groupNr to the original B (integers from 0 to N-1 or from 1 to N, depending on whether you want forward or backward rolling join).
Join B with C on groupNr.
Programming code
#0. 'date' is the key for the rolling join. It does not have to be a date.
A = pd.DataFrame.from_dict(
{'date': pd.to_datetime(["2014-3-1", "2014-5-1", "2014-6-1", "2014-7-1", "2014-12-1"]),
'value': ["a1", "a2", "a3", "a4", "a5"]})
B = pd.DataFrame.from_dict(
{'date': pd.to_datetime(["2014-1-15", "2014-3-15", "2014-6-15", "2014-8-15", "2014-11-15", "2014-12-15"]),
'value': ["b1", "b2", "b3", "b4", "b5", "b6"]})
#1. Sort the table A and and B each by key.
A = A.sort_values('date')
B = B.sort_values('date')
#2. Add a column tag to A which are all 0 and a column tag to B that are all 1.
A['tag'] = 0
B['tag'] = 1
#3. Delete all columns except the key and tagfrom B (can be omitted, but it is clearer this way) and call the table B'. Keep B as an original - we are going to need it later.
B_ = B[['date','tag']] # You need two [], because you get a series otherwise.
#4. Concatenate A with B' to C and ignore the fact that the rows from B' has many NAs.
C = pd.concat([A, B_])
#5. Sort C by key.
C = C.sort_values('date')
#6. Make a new cumsum column with C = C.assign(groupNr = np.cumsum(C.tag))
C = C.assign(groupNr = np.cumsum(C.tag))
#7. Using filtering (query) on tag get rid of all B'-rows.
C = C[C.tag == 0]
#8. Add a running counter column groupNr to the original B (integers from 0 to N-1 or from 1 to N, depending on whether you want forward or backward rolling join).
B['groupNr'] = range(len(B)+1)[1:] # B's values are carried forward to A's values
B['groupNr'] = range(len(B)) # B's values are carried backward to A's values
#9. Join B with C on groupNr to D.
D = C.set_index('groupNr').join(B.set_index('groupNr'), lsuffix='_A', rsuffix='_B')

I also had a similar problem, solved with pandas.merge_asof.
Here is a quick solution for the exposed case:
sales = pd.DataFrame.from_dict(
{'saleDate': pd.to_datetime(["2014-02-20","2014-05-01","2014-06-15","2014-07-01","2014-12-31"]),
'saleID': ["S1","S2","S3","S4","S5"]})
commercials = pd.DataFrame.from_dict(
{'commercialDate': pd.to_datetime(["2014-01-01","2014-04-01","2014-07-01","2014-09-15"]),
'commercialID': ["C1","C2","C3","C4"]}
result = pd.merge_asof(commercials,
sales,
left_on='commercialDate',
right_on='saleDate')
# Ordering for easier comparison
result = result[['commercialDate','saleID','commercialID' ]]
The result is the same as expected:
commercialDate saleID commercialID
0 2014-01-01 NaN C1
1 2014-04-01 S1 C2
2 2014-07-01 S4 C3
3 2014-09-15 S4 C4

This might be a simpler solution.
sales.asfreq("D",method="ffill").join(commercials,how="outer").dropna(subset= ["commercialID"])
I tested this on the first example in https://gormanalysis.com/r-data-table-rolling-joins/ and it works. A similar approach can potentially be used for other rolling joins.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy matching between columns of different dataframes (of different lengths) [duplicate] - python

http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though... I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

Related

How to select top N columns in a dataframe with a criteria

How to merge and get the count of non-matching entries efficiently?

Pandas Inner Join with Lambda

use of .apply() for comparing elements

Equivalent of R data.table rolling join in Python and PySpark

Categories

Resources