Fuzzy matching two dataframes and joining on result

Fuzzy matching two dataframes and joining on result - python

Im trying to join two dataframes on string columns which are not identical. I realise this has been asked a lot but I am struggling to find anything relevant to my need. The code I have is as follows
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
xls = pd.ExcelFile(filename)
df_1 = pd.read_excel(xls, sheet_name="Sheet 1")
df_2 = pd.read_excel(xls, sheet_name="Sheet 2")
df_2['key'] = df_2['Name'].apply(lambda x : [process.extract(x, df_1['Name'], limit=1)][0][0][0])
The idea would then be to joing the two datframes based on df_2['key'], However when I run this code it runs but does not return anything. The df sizes are as follows: df_1 (3366, 8) and df_2(1771, 6)
Is there a better way to do this?

This code returns nothing, because it is exactly what it should do.
df_2['key'] = ... just appends 'key' columns to df_2 dataframe.
If you want to merge dataframes, your code should look similar to this:
name_list_1 = df_1['Name'].tolist()
name_list_2 = df_2['Name'].tolist()
matches = list(map(lambda x: process.extractOne(
x, name_list_1, scorer=fuzz.token_set_ratio)[:2], name_list_2))
df_keys = pd.DataFrame(matches, columns=['key', 'score'])
df_2 = pd.merge(df_2, df_keys, left_index=True, right_index=True)
df_2 = df_2[df_2['score'] > 70]
df_3 = pd.merge(df_1, df_2, left_on='Name', right_on='key', how='outer')
print(df_3)
I use extractOne method, which I guess better suits your situation. It is important to play with scorer parameter as it heavily affects matching result.

you can better use process.extractOne() instead. you code will look like
name_list_1=df_1['Name'].tolist()
name_list_2=df_2['Name'].tolist()
key = map(lambda x : process.extractOne(x,name_list_1)[0],name_list_2)
df_1['key']=key
then you can make the join on the key column.

Related

How do to speed up ordinary dataframe loop in python? vectorisation? multiprocess?

I have a simple piece of code.
Essentially, I want to speed up my loop that creates a dataframe using dataframes.
I haven't found an example and would appreciate anyones help.
df_new = []
for df_i in df:
df_selected = df[df['good_value'] == df_i_list]
df_new = pd.concat([df_new,df_selected])

Given your code does not work, this is the best I can come up with.
Start with a list of dataframes, then select the rows in your dataframes to another list and then concat in one step.
Since concat is the heavy operation, this makes sure you call it only once, which is how it's meant to be used.
import pandas as pd
dfs = [df1, df2, df3, df4, ...]
sel = [df[df['column_to_filter'] == 'good_value'] for df in dfs]
df_new = pd.concat(sel) # might be useful to add `ignore_index=True`

df_new = df[df['good_value'].isin(df_i_list)]
The pd.concat is 4x slower than .isin()

Import multiple CSV files into pandas and merge those based on column values

I have 4 dataframes:
import pandas as pd
df_inventory_parts = pd.read_csv('inventory_parts.csv')
df_colors = pd.read_csv('colors.csv')
df_part_categories = pd.read_csv('part_categories.csv')
df_parts = pd.read_csv('parts.csv')
Now I have merged them into 1 new dataframe like:
merged = pd.merge(
left=df_inventory_parts,
right=df_colors,
how='left',
left_on='color_id',
right_on='id')
merged = pd.merge(
left=merged,
right=df_parts,
how='left',
left_on='part_num',
right_on='part_num')
merged = pd.merge(
left=merged,
right=df_part_categories,
how='left',
left_on='part_cat_id',
right_on='id')
merged.head(20)
This gives the correct dataset that I'm looking for. However, I was wondering if there's a shorter way / faster way of writing this. Using pd.merge 3 times one seems a bit excessive.

You have a pretty clear section of code that does exactly what you want. You want to do three merges so using merge() three times is adequate rather than excessive.
You can make your code a bit shorter by using the fact DataFrames have a merge function so you don't need the left argument. You can also chain them but I would point out my example does not look as neat and readable as your longer form code.
merged = df_inventory_parts.merge(
right=df_colors,
how='left',
left_on='color_id',
right_on='id').merge(
right=df_parts,
how='left',
left_on='part_num',
right_on='part_num').merge(
right=df_part_categories,
how='left',
left_on='part_cat_id',
right_on='id')

Better pattern for storing results in loop?

When I work with data, very often I will have a bunch of similar objects I want to iterate over to do some processing and store the results.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = []
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results.append(tmp_result) # append results
The problem I have with this is that it's not clear which dataframe the results correspond to. I thought of using the objects as keys for a dict, but this won't always work as dataframes are not hashable objects and can't be used as keys to dicts:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0, 1000, 20))
df2 = pd.DataFrame(np.random.randint(0, 1000, 20))
results = {}
for df in [df1, df2]:
tmp_result = df.median() # do some rpocessing
results[df] = tmp_result # doesn't work
I can think of a few hacks to get around this, like defining unique keys for the input objects before the loop, or storing the input and the result as a tuple in the results list. But in my experience those approaches are rather unwieldy, error prone, and I suspect they're not terrilbly great for memory usage either. Mostly, I just end up using the first example, and make sure I'm careful to manually keep track of the position of the results.
Are there any obvious solutions or best practices to this problem here?

You can keep the original dataframe and the result together in a class:
class Whatever:
def __init__(self, df):
self.df = df
self.result = None
whatever1 = Whatever(pd.DataFrame(...))
whatever2 = Whatever(pd.DataFrame(...))
for whatever in [whatever1, whatever2]:
whatever.result = whatever.df.median()
There are many ways to improve this depending on your situation: generate the result right in the constructor, add a method to generate and store it, compute it on the fly from a property, and so on.

I would concatenate your data frames, adding an index for each data frame, then use a group-by operation.
pd.concat([df1, df2], keys=['df1', 'df2']).groupby(level=0).median()
If your actual processing is more complex, you could use .apply() instead of .median().

You can try something like this:
dd = {'df1': df1,
'df2': df2}
results_dict = {}
for k, v in dd.items():
results_dict[k] = v.mean()
results_df = pd.concat(results_dict, keys=results_dict.keys(), axis=1)
print(results_df)
Output:
df1 df2
0 561.65 549.85

if you want corresponding output dfs , search SO for using globals() in a loop see if you can rename them to something similar to the input.
For df1, you could name it
df1.name = 'df1_output'
then use globals() to set the name of the output df to df1.name. Then you'd have df1 and df1_ouptut

Python Pandas Group by

I've the below code
import pandas as pd
Orders = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Orders')
Returns = pd.read_excel (r"C:\Users\Bharath Shana\Desktop\Python\Sample.xls", sheet_name='Returns')
Sum_value = pd.DataFrame(Orders['Sales']).sum
Orders_Year = pd.DatetimeIndex(Orders['Order Date']).year
Orders.merge(Returns, how="inner", on="Order ID")
which gives the output as below
My Requirement is i would like to use groupby and would like to see the output as below
Can some one please help me how to use groupby in my above code, it means i would like to see everything in the single line by using groupby
Regards,
Bharath

You can do by selecting column then define to a new dataframe
grouped = pd.DataFrame()
groupby = ['Year','Segment','Sales']
for i in groupby:
grouped[i] = Orders[i]

Iterating over different data frames using an iterator

Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})
and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?
def modify(df,nr):
df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
df_valid_nr=~df_invalid_nr
Invalid_cycles_nr=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_nr)
print(df)
So, when I try to run the above function
modify(df_1,1)
It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.
I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.
for i in range(1,n+1):
df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
How do I, in general, access df_1 using an iterator? It seems to be a problem.
Any help would be appreciated, thanks!

Solution
Inputs
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))
Code
To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:
for idx, df in enumerate([df_1, df_2]):
col = 'SPEED'+str(idx+1)
df['valid'] = df[col] <= 500
print(df_1)
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]
It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.
Another (better?) solution
If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:
dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))
It will allow you to do the following one liner:
dfs = dict(map(lambda key_val: (key_val[0],
key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
dfs.items()))
print(dfs['df_1'])
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
Explanations:
dfs.items() returns a list of key (i.e. names) and values (i.e. DataFrames)
map(foo, bar) apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs of dfs.items().
dict() cast the map to a dict.
Notes
About modify
Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.
You can then test the following for instance:
def modify(df):
df=df[df.SPEED1<0.5]
#The change in df is on the scope of the function only,
#it will not modify your input, return the df...
return df
#... and affect the output to apply changes
df_1 = modify(df_1)
About access df_1 using an iterator
Notice that when you do:
for i in range(1,n+1):
df_i something
df_i in your loop will call the object df_i for each iteration (and not df_1 etc.)
To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.
To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:
dfs = {}
dfs['df_1'] = ...
or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :
dfs = dict((var, eval(var)) for
var in dir() if
isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)
Then it would be easier for your to iterate over your DataFrames:
for i in range(1,n+1):
dfs['df_'+str(i)'] something

You can use the globals() function which allows you to get a variable by his name.
I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :
for i in range(1,n+1):
df_i = globals()["df_"+str(i)]
df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)

Your code sample leaves me a little confused, but focusing on
I want to make the same changes to all of the data frames.
and
How do I, in general, access df_1 using an iterator?
you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).
Here's how:
Assuming you've got a bunch of variables in your namespace...
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f'])
df_3 = df_3.set_index(rng)
...you can identify all that are dataframes using:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
... and then organize them in a dict using:
myFrames = {}
for dfName in dfNames:
myFrames[dfName] = eval(dfName)
From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:
invalid = ['df_3']
for inv in invalid:
myFrames.pop(inv, None)
Now you can reference ALL your valid dfs by looping through them:
for key in myFrames.keys():
print(myFrames[key])
And that should cover the...
How do I, in general, access df_1 using an iterator?
...part of the question.
And you can of course reference a single dataframe by its name / key in the dict:
print(myFrames['df_1'])
From here you can do something with ALL columns in ALL dataframes.
for key in myFrames.keys():
myFrames[key] = myFrames[key]*10
print(myFrames[key])
Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns
# A function
decimator = lambda x: x/10
# A subset of columns:
myCols = ['SPEED1', 'SPEED2']
Apply that function to your subset of columns in your dataframes of interest:
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in myCols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
print(myFrames[key][col])
So, back to your function...
modify(df_1,1)
... here's my take on it wrapped in a function.
First we'll redefine the dataframes and the function.
Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)].
Here's the datasets and the function for an easy copy-paste:
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_3 = df_3.set_index(rng)
# A function that divides columns by 10
decimator = lambda x: x/10
# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
# A function as per your request
def modify(dfs, cols, fx):
""" Define a subset of available dataframes and list of interesting columns, and
apply a function on those columns.
"""
# Subset all dataframes with names that start with df_
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
# Organize those dfs in a dict if they match the dataframe names of interest
myFrames = {}
for dfName in dfNames:
if dfName in dfs:
myFrames[dfName] = eval(dfName)
print(myFrames)
# Apply fx to the cols of your dfs subset
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in cols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)
Here are dataframes df_1 and df_2 before manipulation:
Here are the dataframes after manipulation:
Anyway, this is how I would approach it.
Hope you'll find it useful!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy matching two dataframes and joining on result - python

you can better use process.extractOne() instead. you code will look like name_list_1=df_1['Name'].tolist() name_list_2=df_2['Name'].tolist() key = map(lambda x : process.extractOne(x,name_list_1)[0],name_list_2) df_1['key']=key then you can make the join on the key column.

Related

How do to speed up ordinary dataframe loop in python? vectorisation? multiprocess?

Import multiple CSV files into pandas and merge those based on column values

Better pattern for storing results in loop?

Python Pandas Group by

Iterating over different data frames using an iterator

Categories

Resources