running fuzzywuzzy ratio over pandas single column - python

I have a big list of full names example:
datafile.csv:
full_name, dob,
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012
I'm trying to use fuzz.ration to see if the names in column['fullname'] have any similarities, but code takes forever, mostly because of the nested for loop.
sample code:
dataframe = pd.read_csv('datafile.csv')
_list = []
for row1 in dataframe['fullname']:
for row2 in dataframe['fullname']:
x = fuzz.ratio(row1, row2)
if x > 90:
_list.append([row1, row2, x])
print(_list)
Is there a better method to iterate of a single pandas column to get a ratio for potential duplicate data?
Thanks
Jim

You can create first fuzzy data:
import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz
data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")
df = pd.read_csv(data, names=['full_name'])
for index, row in df.iterrows():
df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))
print(df.to_string())
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
1 Morty Smith 73 100 26 76 91
2 Rick Sanchez 26 26 100 27 35
3 Jery Smith 95 76 27 100 67
4 Morti Smith 64 91 35 67 100
Then find best match for selected name:
data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
3 Jery Smith 95 76 27 100 67

import pandas as pd
from io import StringIO
from fuzzywuzzy import process
s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""
df = pd.read_csv(StringIO(s))
# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow
# 3 - convert the list comprehension results to a dataframe
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],
index=df.index, columns=['match_name', 'match_percent', 'match_index'])
# join the new dataframe to the original
final = df.join(df2)
full_name dob match_name match_percent match_index
0 Jerry Smith 21/01/2010 Jery Smith 95 3
1 Morty Smith 18/06/2008 Morti Smith 91 4
2 Rick Sanchez 27/04/1993 Morti Smith 43 4
3 Jery Smith 27/12/2012 Jerry Smith 95 0
4 Morti Smith 13/03/2012 Morty Smith 91 1

This comparison method is doing double work since running fuzz.ratio between "Jerry Smith" and "Morti Smith" is the same as between "Morti Smith" and "Jerry Smith".
If you iterate through a sub array then you will be able to complete this faster.
dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
comparison_fullname = dataframe['fullname'][i_dataframe]
for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):
if entry_score >=90:
_list.append((comparison_fullname, entry_fullname, entry_score)
print(_list)
This will prevent any duplicate work.

There are generally two parts that can help you to improve the performance:
reduce the amount of comparisions
use a faster way to match the strings
In your implementation your performing a lot of unrequired comparisions, since your always comparing A <-> B and later on B <-> A. Your comparing A <-> A aswell, which is generally always 100. So you can reduce the amount of comparisions by more than 50%. Since you only want to add matches with a score of over 90, this information can be used to speed up the comparisions. While this can not be done in FuzzyWuzzy, it can be done in Rapidfuzz (I am the author). Rapidfuzz implements the same algorithms as FuzzyWuzzy with a relatively similar interface, but has a lot of performance improvements.
Your code could be implemented the following way to implement those two changes, which should be a lot faster. When testing this on my machine this runs around 12 seconds with your code, while this improved version only requires 1.7 seconds.
import pandas as pd
from io import StringIO
from rapidfuzz import fuzz
# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012'''*500
dataframe = pd.read_csv(StringIO(s))
# only create the data series once
full_names = dataframe['fullname']
for index, row1 in full_names.items():
# skip elements that are already compared
for row2 in full_names.iloc[index+1::]:
# use a score_cutoff to improve the runtime for bad matches
score = fuzz.ratio(row1, row2, score_cutoff=90)
if score:
_list.append([row1, row2, score])

Related

How do I append a column in a dataframe and and give each unique string a number?

I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster

How do i increase an element value from column in Pandas?

Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1

pandas sort by row value

I am quite new to pandas (working with 3rd party code, forced to use it!), and I have a dataframe which looks like so:
name_id cookie_id file_name_id
John 56 /some/loc
Doe 45 /some/loc2
John 67 /some/loc3
hilary 768 /some/loc4
wendy 8 /some/loc3
hilary 4 /some/loc4
I would like to sort them by the name_id like so:
name_id cookie_id file_name_id
Doe 45 /some/loc2
John 56 /some/loc
John 67 /some/loc3
hilary 768 /some/loc4
hilary 4 /some/loc4
wendy 8 /some/loc3
I am looking at:
df.sort_values(by=['name_id'])
and it does seem to give me the correct answer, but since I am new to pandas, I am afraid there might be some gotchas I need to aware of.
df.sort_values(by=['name_id']) should be perfectly fine to use. Watch out for spaces in the beginning of the name_id string as those would be sorted first. For example, " wendy" would be placed in the top in your case.

Pandas find values present in at least two groups

I have a multiindex dataframe like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
John 3 961.0
4 346.0
Bricks James 10 244.0
20 303.0
30 811.0
Fred 40 449.0
James 501 265.0
Sand Donald 15 378.0
800 359.0
How can I slice that df to see only drivers, who worked for different companies? So the result should be like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0
UPD: My original dataframe is 400k long, so I can't just slice it by index. I'm trying to find general solution to solve problems like these.
To get the number of unique companies a person has worked for, use groupby and unique:
v = (df.index.get_level_values(0)
.to_series()
.groupby(df.index.get_level_values(1))
.nunique())
# Alternative involving resetting the index, may not be as efficient.
# v = df.reset_index().groupby('Driver').Company.nunique()
v
Driver
Donald 1
Fred 2
James 1
John 1
Name: Company, dtype: int64
Now, you can run a query:
names = v[v.gt(1)].index.tolist()
df.query("Driver in #names")
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0

Python Pandas: How do I return members of groupby

I´m doing some resarch on a dataframe for people that are relative. But I can´t manage when I find brothers, I can´t find a way to write them down all on a specific column. Here follow an example:
cols = ['Name','Father','Brother']
df = pd.DataFrame({'Brother':'',
'Father':['Erick Moon','Ralph Docker','Erick Moon','Stewart Adborn'],
'Name':['John Smith','Rodolph Ruppert','Mathew Common',"Patrick French"]
},columns=cols)
df
Name Father Brother
0 John Smith Erick Moon
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon
3 Patrick French Stewart Adborn
What I want is this:
Name Father Brother
0 John Smith Erick Moon Mathew Common
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon John Smith
3 Patrick French Stewart Adborn
I apreciate any help!
Here is an idea you can try, firstly create a Brother column with all brothers as a list including itself and then remove itself separately. The code could probably be optimized but where you can start from:
import numpy as np
import pandas as pd
df['Brother'] = df.groupby('Father')['Name'].transform(lambda g: [g.values])
def deleteSelf(row):
row.Brother = np.delete(row.Brother, np.where(row.Brother == row.Name))
return(row)
df.apply(deleteSelf, axis = 1)
# Name Father Brother
# 0 John Smith Erick Moon [Mathew Common]
# 1 Rodolph Ruppert Ralph Docker []
# 2 Mathew Common Erick Moon [John Smith]
# 3 Patrick French Stewart Adborn []
def same_father(me, data):
hasdad = data.Father == data.at[me, 'Father']
notme = data.index != me
isbro = hasdad & notme
return data.loc[isbro].index.tolist()
df2 = df.set_index('Name')
getbro = lambda x: same_father(x.name, df2)
df2['Brother'] = df2.apply(getbro, axis=1)
I think this should work.(untested)

Categories

Resources