partial match to compare 2 columns from different dataframes using fuzzy wuzzy

partial match to compare 2 columns from different dataframes using fuzzy wuzzy - python

I want to compare this dataframe df1:
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160
with df2 as shown below:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500
I want to compare Product column in df1 with Product df2. In order to find The good id of the product. And if there is similarity < 80 it will put 'Remove' in the ID field
NB: The text of the Product column in df1 and df2 are not 100% matched
Can Anyone help me with this or how can i use fuzzy wazzy to get the good id?
Here is my code
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])
What i got :
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0
What is expected to be :
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760

You can make a wrapper function:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)
output:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
NB. the output is not exactly what you expect, you have to explain why rows 4/5 should be dropped

Related

Python pandas merge map with multiple values xlookup

I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!

One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85

Remove only first two character from a column in a dataframe using pandas

how can I want to remove only the first 2 characters in a string that starts with 11
My df :
Product1 Id
0 Waterproof Liner 114890
1 Phone Tripod 981150
2 Waterproof Pants 0
3 baby Kids play Mat 1198547
4 Hiking BACKPACKS 113114
5 security Camera 111160
Product1 object
Id object
dtype: object
Expected output:
Product1 Id
0 Waterproof Liner 4890
1 Phone Tripod 981150
2 Waterproof Pants 0
3 baby Kids play Mat 98547
4 Hiking BACKPACKS 3114
5 security Camera 1160
I write this
df1['Id'] = df1['Id'].str.replace("11","")
But i got this output:
Product1 Id
0 Waterproof Liner 4890
1 Phone Tripod 9850
2 Waterproof Pants 0
3 baby Kids play Mat 98547
4 Hiking BACKPACKS 34
5 security Camera 60

Force match on beginning:
df1['Id'] = df1['Id'].str.replace("^11","")

Join Pandas DataFrames matching by string and substring

i want to merge two dataframes by partial string match.
I have two data frames to combine. First df1 consists of 130.000 rows like this:
id text xc1 xc2
1 adidas men shoes 52465 220
2 vakko men suits 49220 224
3 burberry men shirt 78248 289
4 prada women shoes 45780 789
5 lcwaikiki men sunglasses 34788 745
and second df2 consists of 8000 rows like this:
id keyword abc1 abc2
1 men shoes 1000 11
2 men suits 2000 12
3 men shirt 3000 13
4 women socks 4000 14
5 men sunglasses 5000 15
After matching between keyword and text, outputshould look like this:
id text xc1 xc2 keyword abc1 abc2
1 adidas men shoes 52465 220 men shoes 1000 11
2 vakko men suits 49220 224 men suits 2000 12
3 burberry men shirt 78248 289 men shirt 3000 13
4 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15

Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:
df3 = df1.merge(df2, how='cross') # for Pandas version >= 1.2.0 (released in Dec 2020)
import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]
If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross', you can replace the merge statement with:
# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply().
We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits'.
We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b") to ensure matching only for whole word match for text in df1, i.e. men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men.
Result:
print(df_out)
id_x text xc1 xc2 id_y keyword abc1 abc2
0 1 adidas men shoes 52465 220 1 men shoes 1000 11
6 2 vakko men suits 49220 224 2 men suits 2000 12
12 3 burberry men shirt 78248 289 3 men shirt 3000 13
24 5 lcwaikiki men sunglasses 34788 745 5 men sunglasses 5000 15
Here, columns id_x and id_y are the original id column in df1 and df2 respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:
df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)
Final outcome
print(df_out)
text xc1 xc2 keyword abc1 abc2
0 adidas men shoes 52465 220 men shoes 1000 11
1 vakko men suits 49220 224 men suits 2000 12
2 burberry men shirt 78248 289 men shirt 3000 13
3 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15

Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"
lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)
Now define a matching function; each text value from df1 will be passed as s to find a matching keyword:
def is_match(arr, s):
for a in arr:
if a in s:
return a
return None
Now we can extract the keyword from each text in df1, and add it to a new column:
df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))
We now have everything we need for a standard merge:
pd.merge(df1, df2, on='keyword')

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).

IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638

Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

partial match to compare 2 columns from different dataframes using fuzzy wuzzy - python

Related

Python pandas merge map with multiple values xlookup

Remove only first two character from a column in a dataframe using pandas

Join Pandas DataFrames matching by string and substring

How to run hypothesis test with pandas data frame and specific conditions?

Selecting data based on number of occurences using Python / Pandas

Categories

Resources