Compare two columns on Fuzzy match - python

I am using google colab to perform fuzzy match between 2 columns of a dataframe
I want to list all values in first column based on complete or partial match and put EXISTS if there's a match.
I have tried below, but the code take very long to execute on 5000 * 2
records
Below is my code :
#pip install fuzzywuzzy
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import difflib
data=pd.read_csv('/content/mydata')
Df=pd.DataFrame(data[['ColA','ColB']])
df1=pd.DataFrame(data['ColA'])
df2=pd.DataFrame(data['ColB'])
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_merge(df1,df2,'ColA','ColB')
Below is my Dataframe
|ColA| ColB| result|
|-|-|-|
|aaabc.eval.moc| abcde| EXISTS|
|abcde.eval| abc.123| EXISTS|
|def.gcd.xyz| def.gc| EXISTS|
|abc.123.moc| xyz123.eval.moc.facebook.google| EXISTS|
|xyz123.eval.moc| google.facebook.apple.chromebook| EXISTS|
|google.facebook.apple| 435 | NOT EXISTS|
|Testing435| `|NOT EXISTS`

Related

why do i get a key error from output when i do a merge

hi please help me I am trying to fuzzy merge using pandas and fuzzywuzzy on two datasets using two columns from each, but I get a traceback at the line before the print function that says KeyError: ('name', 'lasntname'), I do not know if I am referencing wrong or what, I have tried the double brackets and parenthesis no luck
heres the code
import pandas as pd
from fuzzywuzzy import fuzz, process
from itertools import product
N = 80
names = {tup: fuzz.ratio(*tup) for tup in
product(df1["Name"].tolist(),
df2["name"].tolist())}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
surnames = {tup: fuzz.ratio(*tup) for tup in
product(df1["Last_name"].tolist(),
df2["lasntname"].tolist())}
s2 = pd.Series(surnames)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
# map and fill nulls
df2["name"] =
df2["name"].map(s1).fillna(df2["name"])
df2["lasntname"] =
df2["lasntname"].map(s2).fillna(df2["lasntname"])
df = df1.merge(df2, on=["name", "lasntname"],
how='outer')
print(df)
Hi Just make your Column names uniform on both tables should work

Consolidation of consecutive rows by condition with Python Pandas

I try to handle the next data issue. I have a dataframe of values and their labels list (this is multi-class, so the labels are a list).
The dataframe looks like:
| value| labels
---------------------
row_1| A |[label1]
row_2| B |[label2]
row_3| C |[label3, label4]
row_4| D |[label4, label5]
I want to find all rows that have a specific label and then:
Firstly, concatenate it with the next row - the string will be concatenated before the next row's value.
Secondly, the labels will be appended to the label list of the next row
For example, if I want to do that for label2, the desired output will be:
| value| labels
---------------------
row_1| A |[label1]
row_3| BC |[label2, label3, label4]
row_4| D |[label4, label5]
The value "B" is joined before the next row's values, and the label "label2" will be appended to the beginning of the next row's label list. The indexes are not relevant for me.
I would greatly appreciate help with this. I tried to use, merge, join, shift, and cumsum but without success so far.
The following code creates the data in the example:
data = {'row_1': ["A", ["label1"]], 'row_2': ["B", ["label2"]],
'row_3':["C", ["label3", "label4"]], 'row_4': ["D", ["label4", "label5"]]}
df = pd.DataFrame.from_dict(data, orient='index').rename(columns={0: "value", 1: "labels"})
You could create a grouping variable and use that to aggregate the columns
import pandas as pd
import numpy as np
def my_combine(data, value):
index = data['labels'].apply(lambda x: np.isin(value, x))
if(all(~index)):
return data
idx = (index | index.shift()).to_numpy()
vals = (np.arange(idx.size) + 1) *(~idx)
gr = np.r_[np.where(vals[1:] != vals[:-1])[0], vals.size - 1]
groups = np.repeat(gr, np.diff(np.r_[-1, gr]) )
return data.groupby(groups).agg(sum)
my_combine(df, 'label2')
value labels
0 A [label1]
2 BC [label2, label3, label4]
3 D [label4, label5]

Python/Pandas: Figuring out the source of SettingWithCopyWarning in a Function

Cannot find out the source of SettingWithCopyWarning? I tried to fix the assignment operations as suggested in the documentation but still, it gives me the SettingWithCopyWarning. Any help would be greatly appreciated.
def fuzzy_merge(df_left, df_right, key_left, key_right, threshold=84, limit=1):
"""
df_1: the left table to join
df_2: the right table to join
key_left: the key column of the left table
key_right: the key column of the right table
threshold: how close the matches should be to return a match, based on Levenshtein distance
limit: the amount of matches that will get returned, these are sorted high to low
"""
s = df_right.loc[:, key_right].tolist()
m = df_left.loc[:, key_left].apply(lambda x: process.extract(x, s, limit=limit))
df_left.loc[:, "matches"] = m
m2 = df_left.loc[:, "matches"].apply(
lambda x: ", ".join([i[0] for i in x if i[1] >= threshold])
)
df_left.loc[:, "matches"] = m2
return df_left

Python: Parse CSV by combining all values for a key into 1 row and store a new dataframe

I have a csv file that contains a key-value pair and I am being asked to combine all the values associated with the same key into one row
For example:
Key,Col1,Col2,Col3
A, 1, A1, C9
A 2, C9, C1
A, 5, C1, C4
B, 7, A8, C5
D, 10 A2, C3
UPDATED the results since there was a mistake on the first row\
This should result to the following: RECORDS for the dataframe
key value
Key,NewCol
A,A1:1:C9:C9:2:C1:C1:5:C4
B,A8:7:C5
D,A2:10:C3
As you can see, I needed them in an order of continuity by Key
For the records with Key = A : the series should be in the order of :
Col2-Col3 value A1 -- C9
Then next record should be have Col2-Col3 value of C9 -- XX
The records are not always in the right row-sequence so i need to make sure that this is accomplished as I store the record
i started doing this and reading the csv and checking each read with a corresponding value of key and then adjusting the value of value if there exists a key or not.
import csv
df = []
with open('example.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
if df.loc[df.key == row[0]]:
exist_value = df[value]
df[value] = exist_value + row[2]+":"+row[1]+":"+row[3]
else:
df[key] = row[0]
df[value] = row[2]+":"+row[1]+":"+row[3]
Question:
1. Is there a more efficient way of doing this? I have a big file to read and I have to do more processing to it like:
import pandas as pd
df = pd.read_csv('waka.csv', header=None)
result = df.groupby(0).agg(lambda x: ':'.join(x.apply(str))).apply(lambda x: ':'.join(x), axis=1)
result
How it works:
import pandas as pd import pandas library
df = pd.read_csv('waka.csv', header=None) read csv file and write it into the dataframe
df.groupby(0) groupby by the column 0 (you have no headers so you have to use column indices
agg(lambda x: ':'.join(x.apply(str))) join all rows in every grouped block
apply(lambda x: ':'.join(x), axis=1) join all columns in the new all-containing row to the one all-containing cell
Result is the Series object with indices equal to grouped elements.
Edit 1: Update for a question specification.
I didn't find any simple solutions for joining grouped rows first. I can recommend only this code:
import pandas as pd
df = pd.read_csv('waka.csv', header=None)
grouped = df.groupby(0)
headers = []
bodies = []
for group in grouped.groups:
headers.append(group)
bodies.append(grouped.get_group(group).drop(columns=0).apply(lambda x: ':'.join([str(e) if type(e) != str else e for e in x]), axis=1).str.cat(sep=':'))
pd.Series(bodies, index=headers)
It is the same mostly, but the main line with bodies generation is a bit different:
grouped grouped df
.get_group(group) particular group
.drop(columns=0) remove column with grouped index (A, B or D)
.apply(lambda x: ':'.join(WAKA), axis=1) join rows into strings
WAKA = [str(e) if type(e) != str else e for e in x] handle non-str elements
.str.cat(sep=':') concatenate rows into one string
Will return:
B 7:A8:C5
D 10:A2:C3
A 1:A1:C9:2:C9:C1:5:C1:C4
dtype: object

Replace string in pandas dataframe if it contains specific substring

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script
import pandas as pd
df=pd.read_csv('lastfile.csv')
df.dropna(inplace=True)
g='Drugs'
z='Weapons'
c='Flowers'
df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )
print(df['category'])
My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem?
Thank you in advance.
Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:
df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})
a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']
g='Drugs'
z='Weapons'
c='Flowers'
d = {g:a, z:b}
df['new_category'] = c
for k, v in d.items():
pat = '|'.join(v)
mask = df.category.str.contains(pat, case=False)
df.loc[mask, 'new_category'] = k
print (df)
category new_category
0 sss mdma df Drugs
1 milit ss aa Weapons
2 aa ss Flowers

Categories

Resources