Fuzzy matching inside a column

Fuzzy matching inside a column - python

Suppose I have a list of sports like this :
sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]
I would like to create a dataframe that match each element of sport with it's closest if the fuzzy matching is superior than 0.5 and if it's not just match it with itself. (I want to use the function fuzzywuzzy.fuzz.ratio(x,y) for that)
The result should look like :
pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})
sport closest_match
0 futball futball
1 fitbal futball
2 football football
3 tennis tennis
4 tenis tennis
5 tenisse tennis
6 footbal futball
7 zennis tennis
8 ping-pong ping-pong
Thanks

here is a solution using itertools.combinations:
from fuzzywuzzy import fuzz
import pandas as pd
sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
df = pd.DataFrame(dist, columns=["sport","closest"])
df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
print(df)
df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()
output:
sport closest ratio
0 fitbal football 77
1 football footbal 93
2 futball football 80
3 tenis zennis 83
4 tenisse zennis 62
5 tennis zennis 91

Related

How to reverse the order of a specific column in python

I have extracted some data online and I would like to reverse the first column order.
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/us_weekly.html").content, 'html.parser')
for e in soup.select('#spotifyweekly tr:has(td)'):
data.append({
'Frequency':e.td.text,
'Artists':e.a.text,
'Songs':e.a.find_next_sibling('a').text
})
data2 = data[:100]
print(data2)
data = pd.DataFrame(data2).to_excel('Kworb_Weekly.xlsx', index = False)
And here is my output:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/TmGmI.png
I've used [::-1], but it reversed all the columns and I just only want to reverse the first column.

Your first column is 'Frequency', so you can get that column from the data frame, and use [::] on both sides:
data = pd.DataFrame(data2)
print(data)
data['Frequency'][::1] = data['Frequency'][::-1]
print(data)
Got this as the output:
Frequency Artists Songs
0 1 SZA Kill Bill
1 2 PinkPantheress Boy's a liar Pt. 2
2 3 Miley Cyrus Flowers
3 4 Morgan Wallen Last Night
4 5 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 96 Lizzo Special
96 97 Glass Animals Heat Waves
97 98 Frank Ocean Pink + White
98 99 Foo Fighters Everlong
99 100 Meghan Trainor Made You Look
[100 rows x 3 columns]
Frequency Artists Songs
0 100 SZA Kill Bill
1 99 PinkPantheress Boy's a liar Pt. 2
2 98 Miley Cyrus Flowers
3 97 Morgan Wallen Last Night
4 96 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 5 Lizzo Special
96 4 Glass Animals Heat Waves
97 3 Frank Ocean Pink + White
98 2 Foo Fighters Everlong
99 1 Meghan Trainor Made You Look
[100 rows x 3 columns]
Process finished with exit code 0

partial match to compare 2 columns from different dataframes using fuzzy wuzzy

I want to compare this dataframe df1:
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160
with df2 as shown below:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500
I want to compare Product column in df1 with Product df2. In order to find The good id of the product. And if there is similarity < 80 it will put 'Remove' in the ID field
NB: The text of the Product column in df1 and df2 are not 100% matched
Can Anyone help me with this or how can i use fuzzy wazzy to get the good id?
Here is my code
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])
What i got :
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0
What is expected to be :
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760

You can make a wrapper function:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)
output:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760
NB. the output is not exactly what you expect, you have to explain why rows 4/5 should be dropped

How to delete row in Pandas Dataframe using 2 colums as condition?

Basically, I got a table like the following:
Name Sport Frequency
Jonas Soccer 3
Jonas Tennis 5
Jonas Boxing 4
Mathew Soccer 2
Mathew Tennis 1
John Boxing 2
John Boxing 3
John Soccer 1
Let's say this is a standard table and I will transform that into a Pandas DF, using the groupby function just like that:
table = df.groupby(['Name'])
After the dataframe is created I want to delete all the rows where frequencies of all other sports than Soccer are greater than Soccer frequency.
So I need to run following conditions:
Identify where Soccer is present; and then
If so, identify if there is any other sport present; and finally
Delete rows where sport is any other than Soccer and its frequency is greater than the Soccer frequency associated to that name (used in the groupby function).
So, the output would be something like:
Name Sport Frequency
Jonas Soccer 3
Mathew Soccer 2
Mathew Tennis 1
John Soccer 1
Thank you for your support

This is one way about it, by iterating through the groups :
pd.concat(
[
value.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.bfill()
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
for key, value in df.groupby("Name").__iter__()
]
)
Name Sport Frequency
7 John Soccer 1
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
You could also create a categorical type for the Sports column, sort the dataframe, then group :
sport_dtype = pd.api.types.CategoricalDtype(categories=df.Sport.unique(), ordered=True)
df = df.astype({"Sport": sport_dtype})
(
df.sort_values(["Name", "Sport"], ascending=[False, True])
.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
)
Name Sport Frequency
3 Mathew Soccer 2
4 Mathew Tennis 1
0 Jonas Soccer 3
7 John Soccer 1
Note that this works because Soccer is the first entry in the Sports column; if it is not, you have to reorder it to ensure Soccer is the first in the categories
Another option is to get the index of rows that meet our criteria and filter the dataframe :
index = (
df.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.groupby("Name")
.pipe(lambda x: x.ffill().bfill())
.query("Frequency <= temp")
.index
)
df.loc[index]
Name Sport Frequency
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
7 John Soccer 1
A bit surprised that I lost the grouping index though.
UPDATE : Gave this some thought; this may be a simpler solution, find the rows where sport is soccer or the average is greater than or equal to 0.5. the average ensures that soccer is not less than the others.
(df.assign(temp=df.Sport == "Soccer",
temp2=lambda x: x.groupby("Name").temp.transform("mean"),
)
.query('Sport=="Soccer" or temp2>=0.5')
.iloc[:, :3]
)

Folium FeatureGroup in Python

I am trying to create maps using Folium Feature group. The feature group will be from a pandas dataframe row. I am able to achieve this when there is one data in the dataframe. But when there are more than 1 in the dataframe, and loop through it in the for loop I am not able to acheive what I want. Please find attached the code in Python.
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
feature_group1 = FeatureGroup(name='Tim')
feature_group2 = FeatureGroup(name='Andrew')
feature_group1.add_child(Marker([35.035075, -89.89969], popup='Tim'))
feature_group2.add_child(Marker([35.821835, -90.70503], popup='Andrew'))
mapa.add_child(feature_group1)
mapa.add_child(feature_group2)
mapa.add_child(LayerControl())
mapa
My dataframe contains the following:
Name Address
0 Dollar Tree #2020 3878 Goodman Rd.
1 Dollar Tree #2020 3878 Goodman Rd.
2 National Guard Products Inc 4985 E Raines Rd
3 434 SAVE A LOT C MID WEST 434 Kelvin 3240 Jackson Ave
4 WALGREENS 06765 108 E HIGHLAND DR
5 Aldi #69 4720 SUMMER AVENUE
6 Richmond, Christopher 1203 Chamberlain Drive
City State Zipcode Group
0 Horn Lake MS 38637 Johnathan Shaw
1 Horn Lake MS 38637 Tony Bonetti
2 Memphis TN 38118 Tony Bonetti
3 Memphis TN 38122 Tony Bonetti
4 JONESBORO AR 72401 Josh Jennings
5 Memphis TN 38122 Josh Jennings
6 Memphis TN 38119 Josh Jennings
full_address Color sequence \
0 3878 Goodman Rd.,Horn Lake,MS,38637,USA blue 1
1 3878 Goodman Rd.,Horn Lake,MS,38637,USA cadetblue 1
2 4985 E Raines Rd,Memphis,TN,38118,USA cadetblue 2
3 3240 Jackson Ave,Memphis,TN,38122,USA cadetblue 3
4 108 E HIGHLAND DR,JONESBORO,AR,72401,USA yellow 1
5 4720 SUMMER AVENUE,Memphis,TN,38122,USA yellow 2
6 1203 Chamberlain Drive,Memphis,TN,38119,USA yellow 3
Latitude Longitude
0 34.962637 -90.069019
1 34.962637 -90.069019
2 35.035367 -89.898428
3 35.165115 -89.952624
4 35.821835 -90.705030
5 35.148707 -89.903760
6 35.098829 -89.866838
The same when I am trying to loop through in the for loop, I am not able to achieve what I need. :
from folium import Map, FeatureGroup, Marker, LayerControl
mapa = Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,tiles='Stamen Terrain')
#mapa.add_tile_layer()
for i in range(0,len(df_addresses)):
feature_group = FeatureGroup(name=df_addresses.iloc[i]['Group'])
feature_group.add_child(Marker([df_addresses.iloc[i]['Latitude'], df_addresses.iloc[i]['Longitude']],
popup=('Address: ' + str(df_addresses.iloc[i]['full_address']) + '<br>'
'Tech: ' + str(df_addresses.iloc[i]['Group'])),
icon = plugins.BeautifyIcon(
number= str(df_addresses.iloc[i]['sequence']),
border_width=2,
iconShape= 'marker',
inner_icon_style= 'margin-top:2px',
background_color = df_addresses.iloc[i]['Color'],
)))
mapa.add_child(feature_group)
mapa.add_child(LayerControl())

This is an example dataset because I didn't want to format your df. That said, I think you'll get the idea.
print(df_addresses)
Latitude Longitude Group
0 34.962637 -90.069019 B
1 34.962637 -90.069019 B
2 35.035367 -89.898428 A
3 35.165115 -89.952624 B
4 35.821835 -90.705030 A
5 35.148707 -89.903760 A
6 35.098829 -89.866838 A
After I create the map object(maps), I perform a groupby on the group column. I then iterate through each group. I first create a FeatureGroup with the grp_name(A or B). And for each group, I iterate through that group's dataframe and create Markers and add them to the FeatureGroup
mapa = folium.Map(location=[35.11567262307692,-89.97423444615382], zoom_start=12,
tiles='Stamen Terrain')
for grp_name, df_grp in df_addresses.groupby('Group'):
feature_group = folium.FeatureGroup(grp_name)
for row in df_grp.itertuples():
folium.Marker(location=[row.Latitude, row.Longitude]).add_to(feature_group)
feature_group.add_to(mapa)
folium.LayerControl().add_to(mapa)
mapa

Regarding the stamenterrain query, if you're referring to the appearance in the control box you can remove this by declaring your map with tiles=None and adding the TileLayer separately with control set to false: folium.TileLayer('Stamen Terrain', control=False).add_to(mapa)

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).

IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638

Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy matching inside a column - python

Related

How to reverse the order of a specific column in python

partial match to compare 2 columns from different dataframes using fuzzy wuzzy

How to delete row in Pandas Dataframe using 2 colums as condition?

Folium FeatureGroup in Python

Selecting data based on number of occurences using Python / Pandas

Categories

Resources