How to do fuzzymatching on nested subsets of a dataframe?

How to do fuzzymatching on nested subsets of a dataframe? - python

I have a dataframe with columns: state, county, and agency_name, and I want to do fuzzy matching on the agency name to another dataframe that has more variables about agency names. But i want to only fuzzy match names within the same state and county.
Dataset #1 looks like this:
State County Agency_Name
FL Broward ~name1
FL Dade name2#
MN Hennepin name11
MN Hennepin name3#
Dataset #2 has names that almost match the Agency_Names in Dataset #1
State County Agency_Name Address agency_code
FL Broward name1 address1 345
FL Dade name2 address2 654
MN Hennepin name1 address3 234
MN Hennepin name3 address4 776
I can select the best fuzzy match out of all names in the dataset using this:
from fuzzywuzzy import process
import rapidfuzz
df1['agency_match'] = df1['Agency_Name'].map(lambda x: process.extractOne(x,df2['Agency_Name'], scorer=rapidfuzz.string_metric.normalized_levenshtein)[0])
However, this match doesn't work because it matches Agency_Names from different states and counties. I need the fuzzy match to pick the best match only from within the same state and county.
What would be an elegant way to do this?

Related

Pandas groupby results - move some grouped column values into rows of new data frame

I have seen a number of similar questions but cannot find a straightforward solution to my issue.
I am working with a pandas dataframe containing contact information for constituent donors to a nonprofit. The data has Households and Individuals. Most Households have member Individuals, but not all Individuals are associated with a Household. There is no data that links the Household to the container Individuals, so I am attempting to match them up based on other data - Home Street Address, Phone Number, Email, etc.
A simplified version of the dataframe looks something like this:
Constituent Id Type Home Street
1234567 Household 123 Main St.
2345678 Individual 123 Main St.
3456789 Individual 123 Main St.
4567890 Individual 433 Elm Rd.
0123456 Household 433 Elm Rd.
1357924 Individual 500 Stack Ln.
1344444 Individual 500 Stack Ln.
I am using groupby in order to group the constituents. In this case, by Home Street. I'm trying to ensure that I only get groupings with more than one record (to exclude Individuals unassociated with a Household). I am using something like:
df1 = df[df.groupby('Home Street').filter(lambda x: len(x)>1)
What I would like to do is somehow export the grouped dataframe to a new dataframe that includes the Household Constituent Id first, then any Individual Constituent Ids. And in the case that there is no Household in the grouping, place the Individual Constituents in the appropriate locations. The output for my data set above would look like:
Household Individual Individual
1234567 2345678 3456789
0123456 4567890
1357924 1344444
I have toyed with iterating through the groupby object, but I feel like I'm missing some easy way to accomplish my task.

This should do it
df['Type'] = df['Type'] + '_' + (df.groupby(['Home Street','Type']).cumcount().astype(str))
df.pivot_table(index='Home Street', columns='Type', values='Constituent Id', aggfunc=lambda x: ' '.join(x)).reset_index(drop=True)
Output
Type Household_0 Individual_0 Individual_1
0 1234567 2345678 3456789
1 0123456 4567890 NaN
2 NaN 1357924 1344444

IIUC, we can use groupby agg(list) and some re-shaping using .join & explode
s = df.groupby(["Home Street", "Type"]).agg(list).unstack(1).reset_index(
drop=True
).droplevel(level=0, axis=1).explode("Household")
df1 = s.join(pd.DataFrame(s["Individual"].tolist()).add_prefix("Indvidual_")).drop(
"Individual", axis=1
)
print(df1.fillna(' '))
Household Indvidual_0 Indvidual_1
0 1234567 2345678 3456789
1 0123456 4567890
2 1357924 1344444
or we can ditch the join and cast Household to your index.
df1 = pd.DataFrame(s["Individual"].tolist(), index=s["Household"])\
.add_prefix("Individual_")
print(df1)
Individual_0 Individual_1
Household
1234567 2345678 3456789
0123456 4567890 None
NaN 1357924 1344444

Split columns after first number with condition

I have a dataframe that holds addresses, which are split in multiple columns:
address postalcode city province country
-----------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I want to split the address column into two separate columns, one for house number and one for street name. Therefore, after running it, the above df would look like:
house_no street postalcode city province country
----------------------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I have been doing this by simply using df[['house_no', 'street']] = df['address'].str.split(' ', 1, expand=True), which was working fine until I noticed that some addresses under the address column are structured as Apt 316 555 Fake Drive (or Unit 316 555 Fake Drive). Therefore, when I run what I am currently using on those, I get:
house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Obviously, this is not want I want.
So essentially, I need an algorithm that splits the string after the first number, unless it starts with "Unit" or "Apt", in which case it will take the second number it sees and split that out into the house_no column.
I need to do this without losing any information, therefore keeping the Unit/Apt number as well (that can be stored in the house_no column, but ideally would have its own unit_no column). Therefore, ideally, the output would look like:
unit_no house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Given that the original address column contained Apt 316 555 Fake Drive and is now split into unit_no,house_no, and street.
I am not sure where to start with this, so any help would be appreciated.

Let's try this data:
df = pd.DataFrame({'address':['123 Fake Street', 'Apt 316 555 Fake Drive']})
# df
# address
# 0 123 Fake Street
# 1 Apt 316 555 Fake Drive
Since you did not specify if you want to capture Unit\Apt number, I assume you do not:
df.address.str.extract('(?:Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
house_no street
0 123 Fake Street
1 555 Fake Drive
Only slight modification needed if you want to keep Unit/Apt:
df.address.str.extract('(?P<unit_no>Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
unit_no house_no street
0 NaN 123 Fake Street
1 Apt 316 555 Fake Drive

you can you the df.loc function, this should work.
df.loc[~df['address'].str.contains('Unit|Apt'), 'house_no'] = df['address'].str.split(' ')

If you always have a number followed by a space then street name, you could use the str.split(' ')
function on the data in address
For example, make a new column with streeet name, a new column with street number
create two arrays, one with street number by using for example
number = address.split(' ')
number[0] will always be the street number
since some street names have spaces, append number[1:] together and that is your data for the street name column
sorry for the psuedo code, in a rush.

I am not sure I understood the question, but if you want to eliminate the words Apt or Unit this will do it (here df and df2 are two .xlsx files I made, and df2 is just another dataframe with the columns you need, (house_no and street) and with as many rows as df but with empty values):
import pandas as pd
df=pd.read_excel('raspuns_so.xlsx')
df2=pd.read_excel('sol.xlsx')
tmp=df['add'].str.split(' ', 1, expand=True)
for i, row_series in df2.iterrows():
if tmp[0][i].isdigit():
df2[['house_no', 'street']] = df['add'].str.split(' ', 1, expand=True)
else:
var=tmp[1][i].split(' ')
arr=[var[0],var[1]]
df2.at[i,'house_no'] = " ".join(arr)
df2.at[i,'street'] = var[2]
print df2
My df:
address pc city province country
0 123 Fake ST F1A2K3 Fakeville ON CA
1 Apt 123 555 FakeST 300000 Fakeville OFF USA
My df2:
house_no street pc city province country
0 0 0 0 0 0 0
1 0 0 0 0 0 0
df2 after I ran the code:
house_no street pc city province country
0 123 Fake ST 0 0 0 0
1 123 555 FakeST 0 0 0 0

grouping duplicates and combining the string columns via Pandas

let's say I have the following pandas dataframe called example:
city state school_lvl schl_name elem_name middle_name highschoo_name
Orlando fl 1 Union Park Union Park
Orlando fl 2 Legacy Legacy
Orlando fl 3 Colonial Colonial
where columns like elem_name were generated using if conditions on school_lvl and schl_name
what I would like instead is
city state elem_name middle_name highschoo_name
Orlando fl Union Park Legacy Colonial
How would I go about doing this? It's not really a groupie since there is no aggregate function? I'd greatly appreciate any help

Use groupby with lambda function for forward and back filling and then drop_duplicates by first 2 and last 3 columns:
c = example.columns[:2].tolist() + example.columns[-3:].tolist()
print (c)
['city', 'state', 'elem_name', 'middle_name', 'highschoo_name']
df = example.groupby(['city', 'state']).apply(lambda x: x.ffill().bfill()).drop_duplicates(c)
print (df)
city state school_lvl schl_name elem_name middle_name \
0 Orlando fl 1 Union Park Union Park Legacy
highschoo_name
0 Colonial
If want remove columns simplier is first drop and then remove duplicates by all columns:
example = example.drop(['school_lvl','schl_name'], axis=1)
df = example.groupby(['city', 'state']).apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df)
city state elem_name middle_name highschoo_name
0 Orlando fl Union Park Legacy Colonial

Fill pandas dataframe rows from values in another dataframe rows

I have two pandas dataframes as given below:
df1
Name City Postal_Code State
James Phoenix 85003 AZ
John Scottsdale 85259 AZ
Jeff Phoenix 85003 AZ
Jane Scottsdale 85259 AZ
df2
Postal_Code Income Category
85003 41038 Two
85259 104631 Four
I would like to insert two columns, Income and Category, to df1 by capturing the values for Income and Category from df2 corresponding to the postal_code for each row in df1.
The closest question that I could find in SO was this - Fill DataFrame row values based on another dataframe row's values pandas. But, the pd.merge solution does not solve the problem for me. Specifically, I used
pd.merge(df1,df2,on='postal_code',how='outer')
All I got was nan values in the two new columns. Not sure whether this is because the no of rows for df1 and df2 are different. Any suggestions to solve this problem?

you just have the wrong how, use 'inner' instead. This matches where keys exist in both dataframes
df1.Postal_Code = df1.Postal_Code.astype(int)
df2.Postal_Code = df2.Postal_Code.astype(int)
df1.merge(df2,on='Postal_Code',how='inner')
Name City Postal_Code State Income Category
0 James Phoenix 85003 AZ 41038 Two
1 Jeff Phoenix 85003 AZ 41038 Two
2 John Scottsdale 85259 AZ 104631 Four
3 Jane Scottsdale 85259 AZ 104631 Four

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?

You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891

You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries

My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.