Python/Pandas: Figuring out the source of SettingWithCopyWarning in a Function - python

Cannot find out the source of SettingWithCopyWarning? I tried to fix the assignment operations as suggested in the documentation but still, it gives me the SettingWithCopyWarning. Any help would be greatly appreciated.
def fuzzy_merge(df_left, df_right, key_left, key_right, threshold=84, limit=1):
"""
df_1: the left table to join
df_2: the right table to join
key_left: the key column of the left table
key_right: the key column of the right table
threshold: how close the matches should be to return a match, based on Levenshtein distance
limit: the amount of matches that will get returned, these are sorted high to low
"""
s = df_right.loc[:, key_right].tolist()
m = df_left.loc[:, key_left].apply(lambda x: process.extract(x, s, limit=limit))
df_left.loc[:, "matches"] = m
m2 = df_left.loc[:, "matches"].apply(
lambda x: ", ".join([i[0] for i in x if i[1] >= threshold])
)
df_left.loc[:, "matches"] = m2
return df_left

Related

Compare two columns on Fuzzy match

I am using google colab to perform fuzzy match between 2 columns of a dataframe
I want to list all values in first column based on complete or partial match and put EXISTS if there's a match.
I have tried below, but the code take very long to execute on 5000 * 2
records
Below is my code :
#pip install fuzzywuzzy
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import difflib
data=pd.read_csv('/content/mydata')
Df=pd.DataFrame(data[['ColA','ColB']])
df1=pd.DataFrame(data['ColA'])
df2=pd.DataFrame(data['ColB'])
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_merge(df1,df2,'ColA','ColB')
Below is my Dataframe
|ColA| ColB| result|
|-|-|-|
|aaabc.eval.moc| abcde| EXISTS|
|abcde.eval| abc.123| EXISTS|
|def.gcd.xyz| def.gc| EXISTS|
|abc.123.moc| xyz123.eval.moc.facebook.google| EXISTS|
|xyz123.eval.moc| google.facebook.apple.chromebook| EXISTS|
|google.facebook.apple| 435 | NOT EXISTS|
|Testing435| `|NOT EXISTS`

How to filter this dataframe?

I have a large dataframe (sample). I was filtering the data according to this code:
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
When I print filt_df3, I am getting only an empty dataframe - why?
How can I improve the code, other approaches like some advanced techniques?
I am not sure the code above works as outlined in the edit below?
I would like to know how can I change the code, such that it works as outlined in the edit below?
Edit:
I want to remove the rows based on columns (A0 - A49) based on cond_A.
Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.
Thank you very much in advance.
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this:
https://stackoverflow.com/a/39820329/6139079
You can filter by all columns together with DataFrame.all for test if all rows match together:
A = [f"A{i}" for i in range(50)]
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)]
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)]
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)
And last chain all masks by & for bitwise AND:
filt_df = df[cond_A & cond_B & cond_C]
If get empty DataFrame it seems no row satisfy all conditions.

Apply Function under Group == condition

I have a Df as followed:
position_latitude position_longitude geohash
0 53.398940 10.069293 u1
1 53.408875 10.052669 u1
2 48.856350 9.171759 u0
3 48.856068 9.170798 u0
4 48.856350 9.171759 u0
What I want to do know, is receiving the nearest node to this positions, using different Shapefiles based on the Geohash.
So what I want to do, is load for ever group in Geohash (ex u1) the graph from a file and then use this graph in a function for getting the nearest node.
I could do it in a for loop, however I think there are more efficient ways of doing so.
I though of something like this:
df['nearestNode'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x. position_longitude,x. geohash), axis=1)
However, I can't figure out how to load the graph only once per group, since it will take some time to get it from the file.
what I came up with so far:
groupHashed = geoSub.groupby('geohash')
geoSub['distance'] = np.nan
for name, group in groupHashed:
G = osmnx.graph.graph_from_xml('geohash/'+name+'.osm', simplify=True, retain_all=False)
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
definitely seems to work, however I feel like the if condition slows it down drastically
update:
just updated:
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
to:
geoSub['distance'] = geoSub[geoSub['geohash'] == name].apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G), axis=1)
its a lot faster now. is there an even better method?
You can use transform
I am stubbing G and getDistanceToEdge (as x+y+geohash[-1]) so show a working example
import pandas as pd
from io import StringIO
data = StringIO("""
,position_latitude,position_longitude,geohash
0,53.398940,10.069293,u1
1,53.408875,10.052669,u1
2,48.856350,9.171759,u0
3,48.856068,9.170798,u0
4,48.856350,9.171759,u0
""" )
df = pd.read_csv(data, index_col=0).fillna('')
def getDistanceToEdge(x, y, G):
return x+y+G
def fun(pos):
G = int(pos.values[0][-1][-1])
return pos.apply(lambda x: getDistanceToEdge(x[0], x[1], G))
df['pos'] = list(zip(df['position_latitude'], df['position_longitude'], df['geohash']))
df['distance'] = df.groupby(['geohash'])['pos'].transform(fun)
df = df.drop(['pos'], axis=1)
print (df)
Output:
position_latitude position_longitude geohash distance
0 53.398940 10.069293 u1 64.468233
1 53.408875 10.052669 u1 64.461544
2 48.856350 9.171759 u0 58.028109
3 48.856068 9.170798 u0 58.026866
4 48.856350 9.171759 u0 58.028109
As you can see you can get the name of the group using pos.values[0][-1] inside the function fun. This is because we care framing the pos column as a tuple of (lat, log, geohash), and each geohash within a group after groupby is same. So with a group we can grab the geohash by taking the last value of the tuple (pos) of any row. pos.values[0][-1] give the last value of the tuple of the first row.

How to get the row index for pandas apply function on a Series

I have a DataFrame that I split into column Series (col_series in the snippet below)and use apply tests to each value in each Series. But I would like to report which row in the Series is affected when I detect and error.
...
col_series.apply(self.testdatelimits, args= \
(datetime.strptime('2018-01-01', '%Y-%m-%d'), key))
def testlimits(self, row_id, x, lowerlimit, col_name):
low_error = None
d = float(x)
if lowerlimit != 'NA' and d < float(lowerlimit):
low_error = 'Following record has column ' + col_name + ' lower than range check'
if low_error is not None:
self.set_error(col_index, row_id, low_error)
Of course the above fails because x is a str and does not have the name property. I am thinking that maybe I can pass in the row index in the Series, but am not clear on how to do that?
Edit:
I switched to use a list comprehension to solve this issue rather than ps apply. It is significantly faster too
col_series = col_series.apply(pd.to_datetime, errors='ignore')
dfwithrow = pd.DataFrame(col_series)
dfwithrow.insert(0, 'rowid', range(0, len(dfwithrow)))
dfwithrow['lowerlimit'] = lowlimit
dfwithrow['colname'] = 'fred'
list(map(self.testdatelimits, dfwithrow['rowid'], dfwithrow[colvalue[0]], \
dfwithrow['lowerlimit'], dfwithrow['colname']))

pandas multiindex shift on filtered values

I want to get the time differences between rows of interest.
t = pd.data_range('1/1/2000', period=6, freq='D')
d = pd.DataFrame({'sid':['a']*3 + ['b']*3,
'src':['m']*3 + ['t']*3,
'alert_v':[1,0,0,0,1,1]}, index=rng)
I want to get the time difference between rows where alr==1.
Ive tried shifting, but are there other ways to take the difference between two rows in a column?
i have tried simple lambdas and more complex .loc:
`
def deltat(g):
g['d1'] = g[ g['alert_v']==1 ]['timeindex'].shift(1)
g['d0'] = g[ g['alert_v']==1 ]['timeindex']
return g['td'] = g['d1'] - g['d0']
d['td'] = d.groupby('src','sid').apply(lambda x: deltat(x) )
def indx(g):
d0 = g.loc[g['alert_v']==1 ]
d1[0] = d0[0]
d1.append( d0[:-1] )
g['tavg'] = g.apply( g.ix[d1,'timeindex'] - g.ix[d0,'timeindex'])
return g
After trying a bunch of approaches, I cant seem to get past either the multigroup or filtering issues...
whats the best way to do this?
edit:
diff(1) produces this error:
raise TypeError('incompatible index of inserted column '
TypeError: incompatible index of inserted column with frame index
while shift(1) produces this error:
ZeroDivisionError: integer division or modulo by zero
attempt to clean the data, not help.
if any( pd.isnull( g['timeindex'] ) ):
print '## timeindex not null'
g['timeindex'].fillna(method='ffill')
For multindex group, select rows, diff, and insert new column paradigm: this is how I got it to work with clean output.
some groups have 0 relevant rows, this throws an exception.
shift throws key error, so just sticking with diff()
# -- get the interarrival time
def deltat(g):
try:
g['tavg'] = g[ g['alert_v']==1 ]['timeindex'].diff(1)
return g
except:
pass
d.sort_index(axis=0, inplace=True)
d = d.groupby(['source','subject_id','alert_t','variable'],as_index=False,group_keys=False).apply( lambda x: deltat(x) )
print d[d['alert_v']==1][['timeindex','tavg']]

Categories

Resources