Fuzzy match in a column in a dataframe in python

Fuzzy match in a column in a dataframe in python - python

I have a column that has strings. I want to do a fuzzy match and mark those which have an 80% match in a column next to it. I can do it with the following code on a smaller dataset but my original dataset is too big for this to work efficiently. Is there a better way to do this? This is what I have done.
import pandas as pd
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
df['yes/no 2'] = ""
for i in range(0, df.shape[0]):
for j in range(0, df.shape[0]):
if (i != j):
if (fuzz.token_sort_ratio(df.iloc[i,df.shape[1]-2],df.iloc[j,df.shape[1]-2]) > 80):
df.iloc[i,df.shape[1]-1] = "yes"

import pandas as pd
from fuzzywuzzy import fuzz
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four'])
def match(row):
thresh = 80
return fuzz.token_sort_ratio(row["two"],row["three"])>thresh
df["Yes/No"] = df.apply(match,axis=1)
print(df)
Serial No one two three four Yes/No
0 1 a b c help pls False
1 2 a c c yooo True
2 3 a c c you will not pass True
3 4 a b b You shall not pass True
4 5 a c c You shall not pass! True

import pandas as pd
from fuzzywuzzy import fuzz,process
l = [[1,'a','b','c','help pls'],[2,'a','c','c','yooo'],[3,'a','c','c','you will not pass'],[4,'a','b','b','You shall not pass'],[5,'a','c','c','You shall not pass!']]
df = pd.DataFrame(l,columns = ['Serial No','one','two','three','four']).reset_index()
def match(df,col):
thresh = 80
return df[col].apply(lambda x:"Yes" if len(process.extractBests(x[1],[xx[1] for i,xx in enumerate(df[col]) if i!=x[0]],
scorer=fuzz.token_sort_ratio,score_cutoff=thresh+1,limit=1))>0 else "No")
df["five"] = df.apply(lambda x:(x["index"],x["four"]),axis=1)
df["Yes/No"] = df.pipe(match,"five")
print(df)
index Serial No one two three four five Yes/No
0 0 1 a b c help pls (0, help pls) No
1 1 2 a c c yooo (1, yooo) No
2 2 3 a c c you will not pass (2, you will not pass) Yes
3 3 4 a b b You shall not pass (3, You shall not pass) Yes
4 4 5 a c c You shall not pass! (4, You shall not pass!) Yes

Related

Count how many characters from a column appear in another column (pandas)

I am trying to count how many characters from the first column appear in second one. They may appear in different order and they should not be counted twice.
For example, in this df
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2"],["AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"]]).T
the result should be:
result = [3,2,2,2,3]

Looks like set.intersection after zipping the 2 columns:
[len(set(a).intersection(set(b))) for a,b in zip(df[0],df[1])]
#[3, 2, 2, 2, 3]

The other solutions will fail in the case that you compare names that both have the same multiple character, eg. AAL0 and AAL0X24. The result here should be 4.
from collections import Counter
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2", "AAL0"],
["AL0X24", "CXP44", "MLN", "KKRR9", "22MMRRS", "AAL0X24"]]).T
def num_shared_chars(char_counter1, char_counter2):
shared_chars = set(char_counter1.keys()).intersection(char_counter2.keys())
return sum([min(char_counter1[k], char_counter2[k]) for k in shared_chars])
df_counter = df.applymap(Counter)
df['shared_chars'] = df_counter.apply(lambda row: num_shared_chars(row[0], row[1]), axis = 'columns')
Result:
0 1 shared_chars
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
5 AAL0 AAL0X24 4

Sticking to the dataframe data structure, you could do:
>>> def count_common(s1, s2):
... return len(set(s1) & set(s2))
...
>>> df["result"] = df.apply(lambda x: count_common(x[0], x[1]), axis=1)
>>> df
0 1 result
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
prints:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner

Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
output:
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Output:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.

My method
s=df.groupby(df.id.shift().ne(df.id).cumsum()).agg({'id':'first','value':['min','max']})
s.columns=s.columns.droplevel(0)
t=s['min'].values[:,None]-s['max'].values
t=t.astype(float)
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
newdf=pd.DataFrame(t,index=s['first'],columns=s['first'])
newdf.values[newdf.index.values[:,None]==newdf.index.values]=np.nan
newdf=newdf.T.stack()
newdf
Out[933]:
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Pandas dataframes equality test

How do I write a function that checks two input dataframes are of equal as long as rows in both dataframes are equal? So it disregards index positions and column orders. I can't use df.equals() since it will enforce data types to be equal, which is not what I need.
from io import StringIO
canonical_in_csv = """,c,a,b
2,hat,x,1
0,rat,y,4
3,cat,x,2
1,bat,x,2"""
with StringIO(canonical_in_csv) as fp:
df1 = pd.read_csv(fp, index_col=0)
canonical_soln_csv = """,a,b,c
0,x,1,hat
1,x,2,bat
2,x,2,cat
3,y,4,rat"""
with StringIO(canonical_soln_csv) as fp:
df2 = pd.read_csv(fp, index_col=0)
df1:
c a b
2 hat x 1
0 rat y 4
3 cat x 2
1 bat x 2
df2:
a b c
0 x 1 hat
1 x 2 bat
2 x 2 cat
3 y 4 rat
My attempt:
temp1 = (df == df2).all()
temp2 = temp1.all()
temp2
ValueError: Can only compare identically-labeled DataFrame objects

You can use sort_index by index and columns values first, then merge with eq (==) or equals:
df11 = df1.sort_index().sort_index(axis=1)
df22 = df2.sort_index().sort_index(axis=1)
print (df11.merge(df22))
a b c
0 y 4 rat
1 x 2 bat
2 x 1 hat
3 x 2 cat
print (df11.merge(df22).eq(df11))
a b c
0 True True True
1 True True True
2 True True True
3 True True True
a = df11.merge(df22).eq(df11).values.all()
#alternative
#a = df11.merge(df22).equals(df11)
print (a)
True
Your function should be rewritten:
def checkequality(A, B):
df11 = A.sort_index(axis=1)
df11 = df11.sort_values(df11.columns.tolist()).reset_index(drop=True)
df22 = B.sort_index(axis=1)
df22 = df22.sort_values(df22.columns.tolist()).reset_index(drop=True)
return (df11 == df22).values.all()
a = checkequality(df1, df2)
print (a)
True

You request on row index dis-regard is pretty difficult to undertake as this datatype is not optimized for such operation whereas regarding columns issue, fortunately this will help you
df1.values == df2[df1.columns].values
where df1.columns syncs the columns order and values convert to numpy for comparison. I still recommend not doing row re-ordering and match as that can be very taxing for bigger dataset.
Based on index match this can be what you are looking for
df1.values==df2.reindex(df1.index.values.tolist())[df1.columns].values
Update
As pointed by #Dark a cleaner and in-place comparison can be done like this
df1.loc[df2.index,df2.columns] == df2

I figured it out,
def checkequality(A, B):
var_names = sorted(A.columns)
var_names
Y = A[var_names].copy()
Y.sort_values(by = var_names,inplace=True)
Y.set_index([list(range(0,len(Y)))],inplace=True)
var_names2 = sorted(B.columns)
var_names2
Y2 = B[var_names2].copy()
Y2.sort_values(by = var_names2,inplace=True)
Y2.set_index([list(range(0,len(Y2)))],inplace=True)
if (Y==Y2).all().all() == True:
return True
else:
return False

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.

IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3

Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3

Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.

For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]

Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3

To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]

You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)

Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]

I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Directed graph in pandas

I am trying to find out which edges from a graph are bidirectional. Each row is an edge. For each starting node A, I am searching each corresponding end node B if they have node A as an ending point:
for ending_point_B in nodeA:
nodeA in ending_points_of_B
Disregard for now repeated entries in df['S']. How can I optimize this search? I suspect something along the lines of groupby. This way takes too much time for my real graph.
Thank you
from pandas import *
def missing_node(node):
set1 = set(df[df.E == node].S.values)
set2 = set(df.E[df.S == node].values)
return list(set1.difference(set2))
x = [1,1,2,2,3]
y = [2,3,1,3,1]
df = DataFrame([x,y]).T
df.columns = ['S','E'] #Start & End
df['Missing'] = df.S.apply(missing_node)
df:
S E Missing
0 1 2 []
1 1 3 []
2 2 1 []
3 2 3 []
4 3 1 [2]

Pandas is great, but not sure you need it here. Something like the following should give you the links that aren't bidirectional:
x = [1,1,2,2,3]
y = [2,3,1,3,1]
fwd = set( zip(x,y) )
rev = set( zip(y,x) )
print ' not bi: ', fwd.difference(rev)
This returns:
not bi: set([(2, 3)])

If I understand your problem correctly, you need to find all node pairs that are not bidirectional. In the example above, the only such pair of nodes is 2 and 3. Given this, you could do the following:
In [1]: df['is_bi'] = df.index.map(lambda x: np.any(map(lambda y: np.all(y), df.ix[x][['E', 'S']].values == df.values)))
In [2]: df
Out[2]:
S E is_bi
0 1 2 True
1 1 3 True
2 2 1 True
3 2 3 False
4 3 1 True
So df[-df.is_bi] will give you all pairs of nodes that are not bidirectional:
In [3]: df[-df.is_bi][['S', 'E']]
Out[3]:
S E
3 2 3
I have the feeling that I overly complicated this and there must be a way to do this with pandas-native functions, but the above solution does the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy match in a column in a dataframe in python - python

Related

Count how many characters from a column appear in another column (pandas)

Take the difference of all elements of a series with the previous ones in python pandas

Pandas dataframes equality test

Filter a pandas dataframe using values from a dict

Directed graph in pandas

Categories

Resources