Python pandas apply too slow Fuzzy Match

Python pandas apply too slow Fuzzy Match - python

def fuzzy_clean(i, dfr, merge_list, key):
for col in range(0,len(merge_list)):
if col == 0:
scaled_down = dfr[dfr[merge_list[col]]==i[merge_list[col]]]
else:
scaled_down = scaled_down[scaled_down[merge_list[col]]==i[merge_list[col]]]
if len(scaled_down)>0:
if i[key] in scaled_down[key].values.tolist():
return i[key]
else:
return pd.to_datetime(scaled_down[key][min(abs([scaled_down[key]-i[key]])).index].values[0])
else:
return i[key]
df[key]=df.apply(lambda i: fuzzy_clean(i,dfr,merge_list,key), axis=1)
I'm trying to eventually merge together two dataframes, dfr and df. The issue I have is that I need to merge on about 9 columns, one of which being a timestamp that doesn't quite match up between the two dataframes where sometimes it is slightly lagging, sometimes leading. I wrote a function that works when using the following; however, in practice it is just too slow running through hundreds of thousands of rows.
merge_list is a list of columns that each dataframe share that match up 100%
key is a string of a column, 'timestamp', that each share, which is what doesn't match up too well
Any suggestions in speeding this up would be greatly appreciated!
The data looks like the following:
df:
timestamp A B C
0 100 x y z
1 101 y i u
2 102 r a e
3 103 q w e
dfr:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 101.05 y i u
3 102 r a e
4 103.01 q w e
5 103.20 q w e
I want df to look like the following:
timestamp A B C
0 100.01 x y z
1 100.99 y i u
2 102 r a e
3 103.01 q w e
Adding the final merge for reference:
def fuzzy_merge(df_left, df_right, on, key, how='outer'):
df_right[key]=df_right.apply(lambda i: fuzzy_clean(i,df_left,on,key), axis=1)
return pd.merge(df_left, df_right, on=on+[key], how=how, indicator=True).sort_values(key)

I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job.
pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')

Related

Data problem: identifying data rows where colleagues have reached a consensus

I have a table that shows the results of four colleagues trying to classify several objects as either a, b, c or d. If the colleagues were able to agree on the classification, or if only one colleague is able to classify the object, then in a new column I want to show the colleague's classification. If the colleagues disagree, I want to create a separate dataframe that displays
those objects. For each object, at max only two colleagues are assigned to try classify it, so there won't be a situation where three colleagues cannot agree on the classification.
It is easy to show an object's classification if only one colleague is able to identify it, but I am struggling when there are two. I can only get as far as the following given my noob python skills.
The end result I am looking for, is 'a' for the first row, 'b' for third, and 'd' for fourth. The second row would be singled out for manual classification by a more experienced colleague.
df_test = pd.DataFrame({'check1':['a','a','unknown','d'],
'check2':['unknown','b','unknown','unknown'],
'check3':['unknown','unknown','c','d'],
'check4':['unknown','unknown','c','unknown']})
cols = ['check_ind','check1_ind','check2_ind','check3_ind','check4_ind']
for col in cols:
df_test[col] = 0
checks = [('check1','check1_ind'),('check2','check2_ind'),('check3','check3_ind'),('check4','check4_ind')]
rows = df_test.shape[0]
for r in range(rows):
for c in checks:
if df_test.iloc[r, df_test.columns.get_loc(c[0])] != 'unknown':
df_test.iloc[r, df_test.columns.get_loc(c[1])] = 1
sumcolumn = df_test['check1_ind'] + df_test['check2_ind'] + df_test['check3_ind'] + df_test['check4_ind']
df_test['body_check'] = sumcolumn

df.replace('unknown', np.nan, inplace=True)
df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else 'No Consensus', axis=1)
Output:
0 a
1 No Consensus
2 c
3 d
dtype: object
In use:
df['consensus'] = df.apply(lambda x: x.dropna().unique()[0] if x.nunique() == 1 else np.nan, axis=1)
print(df)
...
check1 check2 check3 check4 consensus
0 a NaN NaN NaN a
1 a b NaN NaN NaN
2 NaN NaN c c c
3 d NaN d NaN d

Something like this should do the trick:
def function(series):
val_counts = series.value_counts()
if val_counts.size > 1:
return 'No Consensus'
else:
return val_counts.index[0]
df_test.replace({'unknown': np.nan}).apply(function, axis=1)

For an efficient, vectorial, approach, use mode:
df2 = (df_test
.mask(df_test.eq('unknown'))
.mode(1)
# ensure having a "1" column
.reindex(columns=[0,1])
)
print(df2)
# 0 1
# 0 a NaN
# 1 a b
# 2 c NaN
# 3 d NaN
m = df2[1].notna()
df_test['consensus'] = df2[0].mask(m, 'No consensus')
print(df_test)
Output:
check1 check2 check3 check4 consensus
0 a unknown unknown unknown a
1 a b unknown unknown No consensus
2 unknown unknown c c c
3 d unknown d unknown d

How to drill down data using pandas - pythonic way?

I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful

As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))

Comparing pandas DataFrames where column values are lists

I have some chemical data that I'm trying to process using Pandas. I have two dataframes:
C_atoms_all.head()
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
1 218 2 C C [4.8792, 11.9845, 14.6312]
2 219 3 C C [4.8373, 10.7563, 13.9466]
3 220 4 C C [4.7366, 10.9327, 12.5408]
4 6573 5 C C [1.9482, -3.8747, 19.6319]
C_atoms_a.head()
id_a index_a label_a species_a position
0 55 1 C C [6.609, 6.6024, 19.3302]
1 56 2 C C [4.8792, 11.9844, 14.6313]
2 57 3 C C [4.8372, 10.7565, 13.9467]
3 58 4 C C [4.7367, 10.9326, 12.5409]
4 59 5 C C [5.1528, 15.5976, 14.1249]
What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.
The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.
I'd ideally like to return a dataframe that looks like so:
id_all id_a position
217 55 [6.609, 6.6024, 19.3301]
... ... ...
for every row where the position values match.

You can do it like below:
I named your C_atoms_all as df_all and C_atoms_a as df_a:
# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]
# Then because the position values match (within a small fudge factor)
# we round them with three decimal
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3)
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.
# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
suffixes=(None, "_y"))
# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]
print(df)
id_all id_a position
0 217 55 [6.609, 6.6024, 19.3301]
1 218 56 [4.8792, 11.9845, 14.6312]
2 219 57 [4.8373, 10.7563, 13.9466]
3 220 58 [4.7366, 10.9327, 12.5408]

This isn't pretty, but it might work for you
def do(x, df_a):
try:
return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
except StopIteration:
return np.NAN
match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))

You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.
# Custom hash function
def hash_position(position):
return hash(tuple(position))
# Create the hash column "hashed_position"
C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)
# merge datasets
C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')
# ... keep the columns you need

Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.
Data
C_atoms_all = pd.DataFrame({
'id_all': [217,218,219,220,6573],
'index_all': [1,2,3,4,5],
'label_all': ['C','C','C','C','C'],
'species_all': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})
C_atoms_a = pd.DataFrame({
'id_a': [55,56,57,58,59],
'index_a': [1,2,3,4,5],
'label_a': ['C','C','C','C','C'],
'species_a': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab
Solution
#new dataframe bringing together columns position
df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()
#Create temp column that gives you the comparison tolerances
df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)
#Assume tolerance is where only one of the values is over 0.0
C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]

Splitting and copying a row in pandas

I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!

I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024

pandas merge by coordinates

I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02

I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas apply too slow Fuzzy Match - python

I've found a solution that I believe works. Pandas has a merge_asof that follows, still verifying possible double counting but seemed to do a decent job. pd.merge_asof(left_df, right_df, on='timestamp', by=merge_list, direction='nearest')

Related

Data problem: identifying data rows where colleagues have reached a consensus

How to drill down data using pandas - pythonic way?

Comparing pandas DataFrames where column values are lists

Splitting and copying a row in pandas

pandas merge by coordinates

Categories

Resources