Fuzzy Process extract one giving different result - python

I have a data frame and I am trying to map one of column values to values present in a set.
Data frame is
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR INCOMING AMS
XYZ OUTGOING BOM
TYR A_IN DEL
OMN A_OUT DXB
I have a Constant list where Call Type will be replaced by that in the list
call_type = set("IN","OUT")
Desired data frame
Name CallType Location
ABC IN SFO
DEF OUT LHR
PQR IN AMS
XYZ OUT BOM
TYR IN DEL
OMN OUT DXB
I wrote the code to check the response but the process.extractOne gives IN for OUTGOING sometimes (Which is wrong) and sometimes it gives OUT for OUTGOING (Which is right)
Here's is my code
data=[('ABC','IN','SFO),
('DEF','OUT','LHR),
('PQR','INCOMING','AMS),
('XYZ','OUTGOING','BOM),
('TYR','A_IN','DEL),
('OMN','A_OUT','DXB)]
df = pd.DataFrame(data,
columns =['Name', 'CallType',
'Location'])
call_types=set(['IN','OUT'])
df['Call Type'] = df['Call Type'].apply(lambda x: process.extractOne(x, list(call_types))[0])
total_rows=len(df)
for row_no in range(total_rows):
row=df.iloc[row_no]
print(row) // Here Sometimes OUTGOING sets as OUT and Sometimes IN . Shouldn't the result be consistent ?
I am not sure if there is a better way. Can someone please suggest if I am missing something.

Looks like Series.str.extract is a good fit for this:
df['CallType'] = df.CallType.str.extract(r'(OUT|IN)')
print(df)
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Or, if you want to use call_types explicitly, you can do:
df['CallType'] = df.CallType.str.extract(fr"({'|'.join(call_types)})")
# same result

A possible solution is to use difflib.get_close_matches:
import difflib
df['CallType'] = df['CallType'].apply(
lambda x: difflib.get_close_matches(x, call_type)[0])
Output:
Name CallType Location
0 ABC IN SFO
1 DEF OUT LHR
2 PQR IN AMS
3 XYZ OUT BOM
4 TYR IN DEL
5 OMN OUT DXB
Another possible solution:
df['CallType'] = np.where(df['CallType'].str.contains('OUT'), 'OUT', 'IN')
Output:
# same

Related

More efictive method of test large dataframe and add value based on another value different size/not merge

Lot of answers on merging and full col, but can't figure out a more effective method. for my situation.
current version of python, pandas, numpy, and file format is parquet
Simply put if col1 ==x the col 10 = 1, col11 = 2, col... etc.
look1 = 'EMPLOYEE'
look2 = 'CHESTER'
look3 = "TONY'S"
look4 = "VICTOR'S"
tgt1 = 'inv_group'
tgt2 = 'acc_num'
for x in range(len(df['ph_name'])):
df[tgt1][x] = 'MEMORIAL'
df[tgt2][x] = 12345
elif df['ph_name'][x] == look2:
df[tgt1][x] = 'WALMART'
df[tgt2][x] = 45678
elif df['ph_name'][x] == look3:
df[tgt1][x] = 'TONYS'
df[tgt2][x] = 27359
elif df['ph_name'][x] == look4:
df[tgt1][x] = 'VICTOR'
df[tgt2][x] = 45378
basic sample:
unit_name tgt1 tgt2
0 EMPLOYEE Nan Nan
1 EMPLOYEE Nan Nan
2 TONY'S Nan Nan
3 CHESTER Nan Nan
4 VICTOR'S Nan Nan
5 EMPLOYEE Nan Nan
GOAL:
unit_name tgt1 tgt2
0 EMPLOYEE MEMORIAL 12345
1 EMPLOYEE MEMORIAL 12345
2 TONY'S TONYS 27359
3 CHESTER WALMART 45678
4 VICTOR'S VICTOR 45378
5 EMPLOYEE MEMORIAL 12345
So this works... I get the custom columns values added, It's not the fastest under the sun, but it works.
It takes 6.2429744 on 28896 rows. I'm concerned when I put it to the grind, It's going to start dragging me down.
The other downside is I get this annoyance... Yes I can silence, but I feel like this might be due to a bad practice that I should know how to curtail.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Basically...
Is there a way to optimize this?
Is this warning due to a bad habit, my ignorance, or do I just need to silence it?
Given: (It's silly to have all NaN columns)
unit_name
0 EMPLOYEE
1 EMPLOYEE
2 TONY'S
3 CHESTER
4 VICTOR'S
5 EMPLOYEE
df = pd.DataFrame({'unit_name': {0: 'EMPLOYEE', 1: 'EMPLOYEE', 2: "TONY'S", 3: 'CHESTER', 4: "VICTOR'S", 5: 'EMPLOYEE'}})
Doing: (Let's use pd.Series.map and create a dictionary for easier future modification)
looks = ['EMPLOYEE', 'CHESTER', "TONY'S", "VICTOR'S"]
new_cols = {
'inv_group': ["MEMORIAL", "WALMART", "TONYS", "VICTOR"],
'acc_num': [12345, 45678, 27359, 45378]
}
for col, values in new_cols.items():
df[col] = df['unit_name'].map(dict(zip(looks, values)))
print(df)
Output: (I assumed you'd typed the column names wrong)
unit_name inv_group acc_num
0 EMPLOYEE MEMORIAL 12345
1 EMPLOYEE MEMORIAL 12345
2 TONY'S TONYS 27359
3 CHESTER WALMART 45678
4 VICTOR'S VICTOR 45378
5 EMPLOYEE MEMORIAL 12345
Flying blind here since I don't see your data:
cond_list = [df["ph_name"] == look for look in [look1, look2, look3, look4]]
# Rows ph_name outside of the list will keep their original values
df[tgt1] = np.select(cond_list, ["MEMORIAL", "WALMART", "TONY'S", "VICTOR"])
df[tgt2] = np.select(cond_list, [12345, 45678, 27359, 45378])

Check if multiple rows are filled and if not blank all values

I have the following dataframe:
ID Name Date Code Manager
1 Paulo % 10 Jonh's Peter
2 Pedro 2001-01-20 20
3 James 30
4 Sofia 2001-01-20 40 -
I need a way to check if multiple columns (in this case Date, Code and Manager) are filled with any value. In the case there is any blank in any of those 3 columns, blank all the values in all 3 columns, returning this data:
ID Name Date Code Manager
1 Paulo % 10 Jonh's Peter
2 Pedro
3 James
4 Sofia 2001-01-20 40 -
What is the best solution for this case?
You can use pandas.DataFrame.loc to replace values in certain colums and based on a condition .
Considering that your dataframe is named df:
You can use this code to get the expected output :
df.loc[df[["Date", "Code", "Manager"]].isna().any(axis=1), ["Date", "Code", "Manager"]] = ''
>>> print(df)
You can use pandas.isnull() with pandas.any(axis=1) then use pandas.loc for setting what value that you want.
col_chk = ['Date', 'Code', 'Manager']
m = df[col_chk].isnull().any(axis=1)
df.loc[m , col_chk] = '' # or pd.NA
print(df)
ID Name Date Code Manager
0 1 Paulo % 10.0 Jonh's Peter
1 2 Pedro
2 3 James
3 4 Sofia 2001-01-20 40.0 -

Python Pandas fill missing zipcode with values from another datafrane based on conditions

I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.

Pandas convert columns of one dataframe to index in another dataframe

I have some text files that are in .txt format.
I'm trying to create a .csv file with them so that the .txt files are in the index column.
I will add columns with demographic and statistical information (such as, L1, Prompt, and Level) later when editing the dataframe, but I want to align the txt files in the index so that I can do some NLTK analysis.
The desired output is:
L1 Prompt Level
FileName
data1.txt Japanese P1 High
data2.txt Korean P1 High
data3.txt Chinese P1 High
data4.txt Japanese P2 Med
data5.txt Korean P2 Med
data6.txt Chinese P2 Med
data7.txt Arabic P1 High
data8.txt German P1 High
data9.txt Spanish P1 High
data10.txt Arabic P2 Med
data11.txt German P2 Med
data12.txt Spanish P2 Med
The code I tried is as follows
df1=pd.read_csv('data1.txt',names=['data1'])
df2=pd.read_csv('data2.txt',names=['data2'])
df3=pd.read_csv('data3',names=['data3'])
result=pd.concat([df1,df2,df3],axis=1)
result.to_csv('mergedfile.txt',index=False)
but this of course, creates columns
data1.txt data2.txt data3.txt
0 XYZ GHI PQR
1 ABC JKL STU
2 DEF MNO VWX
XYZ and ABC are all sentences, such as, "One of the differences between my home country and the US is convenient stores." or "One difference is public transportation, everyone took public transportation in my home country, not so much in the US."
I have over 100,000 utterances for each txt file, so I don't want to put all of the data in the dataframe, and if i can get the txt file into the index column, that would be most ideal.
Ultimately, I want to export this to .csv, and then use it for further analysis.
You can just use the columns from your dataframe as index to a new dataframe:
df1 = pd.DataFrame({'data1': ['XYZ', 'ABC', 'DEF']})
df2 = pd.DataFrame({'data2': ['GHI', 'JKL', 'MNO']})
df3 = pd.DataFrame({'data3': ['PQR', 'STU', 'VWX']})
df = pd.concat([df1, df2, df3], axis=1)
print(df)
# data1 data2 data3
# 0 XYZ GHI PQR
# 1 ABC JKL STU
# 2 DEF MNO VWX
res = pd.DataFrame(index=[k+'.txt' for k in df],
columns=['L1', 'Prompt', 'Level'])
print(res)
# L1 Prompt Level
# data1.txt NaN NaN NaN
# data2.txt NaN NaN NaN
# data3.txt NaN NaN NaN

How to merge two pandas DataFrames based on a similarity function?

Given dataset 1
name,x,y
st. peter,1,2
big university portland,3,4
and dataset 2
name,x,y
saint peter3,4
uni portland,5,6
The goal is to merge on
d1.merge(d2, on="name", how="left")
There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.
For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.
One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).
The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.
Is there a way to do this kind of merging using pandas?
Did you look at fuzzywuzzy?
You might do something like:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
Caveat Emptor: I haven't tried to run this.
Let's say you have that function which returns the best match if any, None otherwise:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))
The simplest idea I can get now is to create special dataframe with distances between all names:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN

Categories

Resources