If I have two dataframes (John,Alex,harry) and (ryan, kane, king). How can I use fuzzywuzzy in python to get the following output.
fuzz.Ratio
John ryan 25
John kane 54
John king 44
alex ryan 23
alex kane 14
alex king 55
harry ryan 47
harry kane 47
harry king 50
Your ratios are wrong. What you are looking for is cartesian product of the corresponding columns of both the dataframes.
Sample code:
import itertools
df1 = pd.DataFrame({'name': ['John','Alex','harry']})
df2 = pd.DataFrame({'name': ['ryan','kane','king']})
for w1, w2 in itertools.product(
df1['name'].apply(str.lower).values, df2['name'].apply(str.lower).values):
print (f"{w1}, {w2}, {fuzz.ratio(w1,w2)}")
Output:
john, ryan, 25
john, kane, 25
john, king, 25
alex, ryan, 25
alex, kane, 50
alex, king, 0
harry, ryan, 44
harry, kane, 22
harry, king, 0
IIUC, you could do:
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
a = ('John','Alex','harry')
b = ('ryan', 'kane', 'king')
# compute the ratios for each pair
res = ((ai, bi, fuzz.ratio(ai, bi)) for ai, bi in product(a, b))
# create DataFrame filter out the values that are 0
out = pd.DataFrame([e for e in res if e[2] > 0], columns=['name_a', 'name_b', 'fuzz_ratio'])
print(out)
Output
name_a name_b fuzz_ratio
0 John ryan 25
1 John kane 25
2 John king 25
3 Alex kane 25
4 harry ryan 44
5 harry kane 22
Related
I have a dataset with unique names. Another dataset contains several rows with the same names as in the first dataset.
I want to create a column with unique ids in the first dataset and another column in the second dataset with the same ids corresponding to all the same names in the first dataset.
For example:
Dataframe 1:
player_id Name
1 John Dosh
2 Michael Deesh
3 Julia Roberts
Dataframe 2:
player_id Name
1 John Dosh
1 John Dosh
2 Michael Deesh
2 Michael Deesh
2 Michael Deesh
3 Julia Roberts
3 Julia Roberts
I want to do to use both data frames to run deep feature synthesis using featuretools.
To be able to do something like this:
entity_set = ft.EntitySet("basketball_players")
entity_set.add_dataframe(dataframe_name="players_set",
dataframe=players_set,
index='name'
)
entity_set.add_dataframe(dataframe_name="season_stats",
dataframe=season_stats,
index='season_stats_id'
)
entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")
This should do what your question asks:
import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])
df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])
print('inputs:', '\n')
print(df1)
print(df2)
df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]
print('\noutputs:', '\n')
print(df1)
print(df2)
Input/output:
inputs:
Name
0 John Dosh
1 Michael Deesh
2 Julia Roberts
Name
0 John Dosh
1 John Dosh
2 Michael Deesh
3 Michael Deesh
4 Michael Deesh
5 Julia Roberts
6 Julia Roberts
outputs:
id Name
0 1 John Dosh
1 2 Michael Deesh
2 3 Julia Roberts
id Name
0 1 John Dosh
1 1 John Dosh
2 2 Michael Deesh
3 2 Michael Deesh
4 2 Michael Deesh
5 3 Julia Roberts
6 3 Julia Roberts
UPDATE:
An alternative solution which should give the same result is:
df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]
Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
I'm a relative python noob and also new to natural language processing (NLP).
I have dataframe containing names and sales. I want to: 1) break out all the tokens and 2) aggregate sales by each token.
Here's an example of the dataframe:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do? I'm using Python.
Assumption: you have a function tokenize that takes in a string as input and returns a list of tokens
I'll use this function as a tokenizer for now:
def tokenize(word):
return word.casefold().split()
Solution
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation
The assign step creates a column called tokens that applies the tokenize functio
Note: For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)
this generates a df that looks like
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
use df.explode on this to get
name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
last step is just a groupy-agg step.
You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])
I have a big list of full names example:
datafile.csv:
full_name, dob,
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012
I'm trying to use fuzz.ration to see if the names in column['fullname'] have any similarities, but code takes forever, mostly because of the nested for loop.
sample code:
dataframe = pd.read_csv('datafile.csv')
_list = []
for row1 in dataframe['fullname']:
for row2 in dataframe['fullname']:
x = fuzz.ratio(row1, row2)
if x > 90:
_list.append([row1, row2, x])
print(_list)
Is there a better method to iterate of a single pandas column to get a ratio for potential duplicate data?
Thanks
Jim
You can create first fuzzy data:
import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz
data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")
df = pd.read_csv(data, names=['full_name'])
for index, row in df.iterrows():
df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))
print(df.to_string())
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
1 Morty Smith 73 100 26 76 91
2 Rick Sanchez 26 26 100 27 35
3 Jery Smith 95 76 27 100 67
4 Morti Smith 64 91 35 67 100
Then find best match for selected name:
data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)
Output:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
3 Jery Smith 95 76 27 100 67
import pandas as pd
from io import StringIO
from fuzzywuzzy import process
s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""
df = pd.read_csv(StringIO(s))
# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow
# 3 - convert the list comprehension results to a dataframe
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],
index=df.index, columns=['match_name', 'match_percent', 'match_index'])
# join the new dataframe to the original
final = df.join(df2)
full_name dob match_name match_percent match_index
0 Jerry Smith 21/01/2010 Jery Smith 95 3
1 Morty Smith 18/06/2008 Morti Smith 91 4
2 Rick Sanchez 27/04/1993 Morti Smith 43 4
3 Jery Smith 27/12/2012 Jerry Smith 95 0
4 Morti Smith 13/03/2012 Morty Smith 91 1
This comparison method is doing double work since running fuzz.ratio between "Jerry Smith" and "Morti Smith" is the same as between "Morti Smith" and "Jerry Smith".
If you iterate through a sub array then you will be able to complete this faster.
dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
comparison_fullname = dataframe['fullname'][i_dataframe]
for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):
if entry_score >=90:
_list.append((comparison_fullname, entry_fullname, entry_score)
print(_list)
This will prevent any duplicate work.
There are generally two parts that can help you to improve the performance:
reduce the amount of comparisions
use a faster way to match the strings
In your implementation your performing a lot of unrequired comparisions, since your always comparing A <-> B and later on B <-> A. Your comparing A <-> A aswell, which is generally always 100. So you can reduce the amount of comparisions by more than 50%. Since you only want to add matches with a score of over 90, this information can be used to speed up the comparisions. While this can not be done in FuzzyWuzzy, it can be done in Rapidfuzz (I am the author). Rapidfuzz implements the same algorithms as FuzzyWuzzy with a relatively similar interface, but has a lot of performance improvements.
Your code could be implemented the following way to implement those two changes, which should be a lot faster. When testing this on my machine this runs around 12 seconds with your code, while this improved version only requires 1.7 seconds.
import pandas as pd
from io import StringIO
from rapidfuzz import fuzz
# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012'''*500
dataframe = pd.read_csv(StringIO(s))
# only create the data series once
full_names = dataframe['fullname']
for index, row1 in full_names.items():
# skip elements that are already compared
for row2 in full_names.iloc[index+1::]:
# use a score_cutoff to improve the runtime for bad matches
score = fuzz.ratio(row1, row2, score_cutoff=90)
if score:
_list.append([row1, row2, score])
I have a pandas dataframe that looks
df = pd.DataFrame(
[
['JoeSmith', 5],
['CathySmith', 3],
['BrianSmith', 12],
['MarySmith', 67],
['JoeJones', 23],
['CathyJones', 98],
['BrianJones', 438],
['MaryJones', 75],
['JoeCollins', 56],
['CathyCollins', 125],
['BrianCollins', 900],
['MaryCollins', 321],
], columns = ['Name', 'Value']
)
print df
Name Value
0 JoeSmith 5
1 CathySmith 3
2 BrianSmith 12
3 MarySmith 67
4 JoeJones 23
5 CathyJones 98
6 BrianJones 438
7 MaryJones 75
8 JoeCollins 56
9 CathyCollins 125
10 BrianCollins 900
11 MaryCollins 321
The first column 'Name' needs to be split into First and Last names and put into a MultiIndex.
Value
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321
I think you can use extract for extracting Names and surname, then set_index and last drop column Name:
df[['name','surname']] = df.Name.str.extract(r'([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
df = df.set_index(['name','surname']).drop('Name', axis=1)
print df
Value
name surname
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321
Solution
import pandas as pd
pattern = r'.*\b([A-Z][a-z]*)([A-Z][a-z]*)\b.*'
names = df.Name.str.extract(pattern, expand=True)
midx = pd.MultiIndex.from_tuples(names.values.tolist())
df.index = midx
df[['Value']]
Explanation
pattern grabs a group of letters that starts with a capital A-Z followed by any number of lower-case a-z followed by another capital A-Z and any number of lower-case a-z. Then it splits it into two.
pd.MultiIndex.from_tuples creates the MultiIndex.
names.values.tolist() turns the converted DataFrame into a list of lists that will be interpreted as tuples.