Fuzzy match for 2 lists with very similar names - python

I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.
Have used this so far:
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
for name_master in df2:
if fuzz.partial_ratio(name_to_find,name_master) > 90:
response[name_to_find] = name_master
break
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
sample_name
actual_name
jtsports
JT Sports LLC
tombaseball
Tom Baseball Inc.
context express
Context Express LLC
zb sicily
ZB Sicily LLC
lightening express
Lightening Express LLC
fire roads
Fire Road Express
N/A
Earth Treks
N/A
TS Sports LLC
N/A
MM Baseball Inc.
N/A
Contact Express LLC
N/A
AB Sicily LLC
N/A
Lightening Roads LLC

Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?
import pandas as pd
from fuzzywuzzy import process
threshold = 50
list1 = ['jtsports','tombaseball','context express','zb sicily',
'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']
response = []
for name_to_find in list1:
resp_match = process.extractOne(name_to_find ,list2)
if resp_match[1] > threshold:
row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
response.append(row)
print(row)
results = pd.DataFrame(response)
# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN

It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match.
That is how a lot of nieve spell check systems work.
I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.
Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
highest_so_far = ("", 0)
for name_master in df2:
score = fuzz.partial_ratio(name_to_find, name_master)
if score > highest_so_far[1]:
highest_so_far = (name_master, score)
response[name_to_find] = highest_so_far[0]
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)

Related

Why there are null values when I use group by?

I have data on amazon's 50 best-selling books(from Kaggle).
There are no null values in the data.
Now, I find the mean of reviews given by the user. Now, I use a group by function but it gives null values for User Ratings and mean.
In the next step, I filter all those reviews where the reviews are greater than the average reviews.
My question is: why did I get the null values in the first case? since there were no null values in the dataset?
Why did I get null values when I used group by?
ipynb file
This 'Answer' is an attempt at #reproducibility.
The OP question cannot be reproduced.
PS: The groupby returns grouping as 'expected'.
#TANNU, It appears your 'NaN' might have come from your data cleansing. Kindly show your relevant code.
NB: The 'Amazon Top 50 Bestselling Books 2009 - 2019' dataset has #550 rows {data.shape:(550, 7)}
[For noting]
Your book_review groupby has a whopping 269010 rows. My reproduction of your book_review yielded 351 rows × 5 columns
PS: Updated based on #Siva Shanmugam's edit.
## import libraries
import pandas as pd
import numpy as np
## read dataset
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Amazon%20Top%2050%20Bestselling%20Books%202009%20-%202019.csv')
data.head(2)
''' [out]
Name Author User Rating Reviews Price Year Genre
0 10-Day Green Smoothie Cleanse JJ Smith 4.7 17350 8 2016 Non Fiction
1 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Fiction
'''
## check shape
data.shape
''' [out]
(550, 7)
'''
## check dataset
data.describe()
''' [out]
User Rating Reviews Price Year
count 550.000000 550.000000 550.000000 550.000000
mean 4.618364 11953.281818 13.100000 2014.000000
std 0.226980 11731.132017 10.842262 3.165156
min 3.300000 37.000000 0.000000 2009.000000
25% 4.500000 4058.000000 7.000000 2011.000000
50% 4.700000 8580.000000 11.000000 2014.000000
75% 4.800000 17253.250000 16.000000 2017.000000
max 4.900000 87841.000000 105.000000 2019.000000
'''
## check NaN
data.Reviews.isnull().any().any()
''' [out]
False
'''
## mean of reviews
mean_reviews = np.math.ceil(data.Reviews.mean())
mean_reviews
''' [out]
11954
'''
## group by mean of `User Rating` and `Reviews`
book_review = data.groupby(['Name', 'Author', 'Genre'], as_index=False)[['User Rating', 'Reviews']].mean()
book_review
''' [out]
Name Author Genre User Rating Reviews
0 10-Day Green Smoothie Cleanse JJ Smith Non Fiction 4.7 17350.0
2 12 Rules for Life: An Antidote to Chaos Jordan B. Peterson Non Fiction 4.7 18979.0
3 1984 (Signet Classics) George Orwell Fiction 4.7 21424.0
5 A Dance with Dragons (A Song of Ice and Fire) George R. R. Martin Fiction 4.4 12643.0
6 A Game of Thrones / A Clash of Kings / A Storm... George R. R. Martin Fiction 4.7 19735.0
... ... ... ... ... ...
341 When Breath Becomes Air Paul Kalanithi Non Fiction 4.8 13779.0
342 Where the Crawdads Sing Delia Owens Fiction 4.8 87841.0
345 Wild: From Lost to Found on the Pacific Crest ... Cheryl Strayed Non Fiction 4.4 17044.0
348 Wonder R. J. Palacio Fiction 4.8 21625.0
350 You Are a Badass: How to Stop Doubting Your Gr... Jen Sincero Non Fiction 4.7 14331.0
83 rows × 5 columns
'''
## get book reviews that are less than the mean(reviews)
book_review[book_review.Reviews < mean_reviews]
''' [out]
Name Author Genre User Rating Reviews
​
'''

How can I compare one column of a dataframe to multiple other columns using SequenceMatcher?

I have a dataframe with 6 columns, the first two are an id and a name column, the remaining 4 are potential matches for the name column.
id name match1 match2 match3 match4
id name match1 match2 match3 match4
1 NXP Semiconductors NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN
4 Miami-Dade County County of Will County of Orange NaN NaN
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN
I'd like to use SequenceMatcher to compare the name column with each match column if there is a value and return the match value with the highest ratio, or closest match, in a new column at the end of the dataframe.
So the output would be something like this:
id name match1 match2 match3 match4 best match
1 NXP Semiconductors NaN NaN NaN NaN NaN
2 Cincinnati Children's Hospital Medical Center Montefiore Medical center Children's Hospital Los Angeles Cincinnati Children's Hospital Medical Center SSM Health SLU Hospital Cincinnati Children's Hospital Medical Center
3 Seminole Tribe of Florida The State Board of Administration of Florida NaN NaN NaN The State Board of Administration of Florida
4 Miami-Dade County County of Will County of Orange NaN NaN County of Orange
5 University of California California Teacher's Association Yale University University of Toronto University System of Georgia California Teacher's Association
6 Bon Appetit Management Waste Management Sculptor Capital NaN NaN Waste Management
I've gotten the data into the dataframe and have been able to compare one column to a single other column using the apply method:
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
However, I'm not sure how to loop over multiple columns in the same row. I also thought about trying to reformat my data so it that the method above would work, something like this:
name match
name1 match1
name1 match2
name1 match3
However, I was running into issues dealing with the NaN values. Open to suggestions on the best route to accomplish this.
I ended up solving this using the second idea of reformatting the table. Using the melt function I was able to get a two column table of the name field with each possible match. From there I used the original lambda function to compare the two columns and output a ratio. From there it was relatively easy to go through and see the most likely matches, although it did require some manual effort.
df = pd.read_csv('output.csv')
df1 = df.melt(id_vars = ['id', 'name'], var_name = 'match').dropna().drop('match',1).sort_values('name')
df1['diff'] = df1.apply(lambda x: diff.SequenceMatcher(None, x[1].strip(), x[2].strip()).ratio(), axis=1)
df1.to_csv('comparison-output.csv', encoding='utf-8')

How to separate Numbers from string and move them to next column in Python?

I am working on a share market data and in some columns market cap has shifted to previous column. I am trying to fetch them in next column but the value it's returning is completely different.
This is the code I am using -
data['Market Cap (Crores)']=data['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
data['Market Cap (Crores)']
But the output I am getting is
968 NaN
969 NaN
970 -2.147484e+09
971 -2.147484e+09
972 -2.147484e+09
How do I get the correct values?
You just do it, step by step. First, pick out the rows that need fixing (where the market cap is Nan). Then, I create two functions, one to pull the market cap from the string, one to remove the market cap. I use apply to fix up the rows, and substitute the values into the original dataframe.
import pandas as pd
import numpy as np
data = [
['GNA Axles Ltd', 'Auto Parts', 1138.846797],
['Andhra Paper Ltd', 'Paper Products', 1135.434614],
['Tarc', 'Real Estate 1134.645409', np.NaN],
['Udaipur Cement Works', 'Cement 1133.531734', np.NaN],
['Pnb Gifts', 'Investment Banking 1130.463641', np.NaN],
]
def getprice(row):
return float(row['Sub-Sector'].split()[-1])
def removeprice(row):
return ' '.join(row['Sub-Sector'].split()[:-1])
df = pd.DataFrame( data, columns= ['Company','Sub-Sector','Market Cap (Crores)'] )
print(df)
picks = df['Market Cap (Crores)'].isna()
rows = df[picks]
print(rows)
df.loc[picks,'Sub-Sector'] = rows.apply(removeprice, axis=1)
df.loc[picks,'Market Cap (Crores)'] = rows.apply(getprice, axis=1)
print(df)
Output:
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409
3 Udaipur Cement Works Cement 1133.531734
4 Pnb Gifts Investment Banking 1130.463641
df['Sub-Sector Number'] = df['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
df['Sub-Sector final'] = df[['Sub-Sector Number','Sub-Sector']].ffill(axis=1).iloc[:,-1]
df
Hi there,
Here is the method which you can try, use your code to create a numeric field and select non-missing value from Sub-Sector Number and Sub-Sector creating your final field - Sub-Sector final
Please try it and if not working please let me know
Thanks Leon

Using groupby and to create a column with percentage frequency

Working in python, in a Jupyter notebook. I am given this dataframe
congress chamber state party
80 house TX D
80 house TX D
80 house NJ D
80 house TX R
80 senate KY R
of every congressperson since the 80th congressional term, with a bunch of information. I've narrowed it down to what's needed for this question. I want to alter the dataframe so that I have a single row for every unique combination of congressional term, chamber, state, and party affiliation, Then a new column with the number of rows that are of the associated party divided by the number of rows where everything else besides that is the same. For example, this
congress chamber state party perc
80 house TX D 0.66
80 house NJ D 1
80 house TX R 0.33
80 senate KY R 1
is what I'd want my result to look like. The perc column is the percentage of, for example, democrats elected to congress in TX in the 80th congressional election.
I've tried a few different methods I've found on here, but most of them divide the number of rows by the number of rows in the entire dataframe, rather than by just the rows that meet the 3 given criteria. Here's the latest thing I've tried:
term=80
newdf = pd.crosstab(index=df['party'], columns=df['state']).stack()/len(df[df['congress']==term])
I define term because I'll only care about one term at a time for each dataframe.
A method I tried using groupby involved the following:
newdf = df.groupby(['congress', 'chamber','state']).agg({'party': 'count'})
state_pcts = newdf.groupby('party').apply(lambda x:
100 * x / float(x.sum()))
And it does group by term, chamber, state, but it returns a number that doesn't mean anything to me, when I check what the actual results should be.
Basically, you can do the following using value_counts for each group:
def func(f):
return f['party'].value_counts(normalize=True)
df = (df
.groupby(['congress','chamber','state'])
.apply(func)
.reset_index()
.rename(columns={'party':'perc','level_3':'party'}))
print(df)
congress chamber state party perc
0 80 house NJ D 1.000000
1 80 house TX D 0.666667
2 80 house TX R 0.333333
3 80 senate KY R 1.000000

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Categories

Resources