Split columns after first number with condition - python

I have a dataframe that holds addresses, which are split in multiple columns:
address postalcode city province country
-----------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I want to split the address column into two separate columns, one for house number and one for street name. Therefore, after running it, the above df would look like:
house_no street postalcode city province country
----------------------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I have been doing this by simply using df[['house_no', 'street']] = df['address'].str.split(' ', 1, expand=True), which was working fine until I noticed that some addresses under the address column are structured as Apt 316 555 Fake Drive (or Unit 316 555 Fake Drive). Therefore, when I run what I am currently using on those, I get:
house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Obviously, this is not want I want.
So essentially, I need an algorithm that splits the string after the first number, unless it starts with "Unit" or "Apt", in which case it will take the second number it sees and split that out into the house_no column.
I need to do this without losing any information, therefore keeping the Unit/Apt number as well (that can be stored in the house_no column, but ideally would have its own unit_no column). Therefore, ideally, the output would look like:
unit_no house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Given that the original address column contained Apt 316 555 Fake Drive and is now split into unit_no,house_no, and street.
I am not sure where to start with this, so any help would be appreciated.

Let's try this data:
df = pd.DataFrame({'address':['123 Fake Street', 'Apt 316 555 Fake Drive']})
# df
# address
# 0 123 Fake Street
# 1 Apt 316 555 Fake Drive
Since you did not specify if you want to capture Unit\Apt number, I assume you do not:
df.address.str.extract('(?:Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
house_no street
0 123 Fake Street
1 555 Fake Drive
Only slight modification needed if you want to keep Unit/Apt:
df.address.str.extract('(?P<unit_no>Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
unit_no house_no street
0 NaN 123 Fake Street
1 Apt 316 555 Fake Drive

you can you the df.loc function, this should work.
df.loc[~df['address'].str.contains('Unit|Apt'), 'house_no'] = df['address'].str.split(' ')

If you always have a number followed by a space then street name, you could use the str.split(' ')
function on the data in address
For example, make a new column with streeet name, a new column with street number
create two arrays, one with street number by using for example
number = address.split(' ')
number[0] will always be the street number
since some street names have spaces, append number[1:] together and that is your data for the street name column
sorry for the psuedo code, in a rush.

I am not sure I understood the question, but if you want to eliminate the words Apt or Unit this will do it (here df and df2 are two .xlsx files I made, and df2 is just another dataframe with the columns you need, (house_no and street) and with as many rows as df but with empty values):
import pandas as pd
df=pd.read_excel('raspuns_so.xlsx')
df2=pd.read_excel('sol.xlsx')
tmp=df['add'].str.split(' ', 1, expand=True)
for i, row_series in df2.iterrows():
if tmp[0][i].isdigit():
df2[['house_no', 'street']] = df['add'].str.split(' ', 1, expand=True)
else:
var=tmp[1][i].split(' ')
arr=[var[0],var[1]]
df2.at[i,'house_no'] = " ".join(arr)
df2.at[i,'street'] = var[2]
print df2
My df:
address pc city province country
0 123 Fake ST F1A2K3 Fakeville ON CA
1 Apt 123 555 FakeST 300000 Fakeville OFF USA
My df2:
house_no street pc city province country
0 0 0 0 0 0 0
1 0 0 0 0 0 0
df2 after I ran the code:
house_no street pc city province country
0 123 Fake ST 0 0 0 0
1 123 555 FakeST 0 0 0 0

Related

How to do fuzzymatching on nested subsets of a dataframe?

I have a dataframe with columns: state, county, and agency_name, and I want to do fuzzy matching on the agency name to another dataframe that has more variables about agency names. But i want to only fuzzy match names within the same state and county.
Dataset #1 looks like this:
State County Agency_Name
FL Broward ~name1
FL Dade name2#
MN Hennepin name11
MN Hennepin name3#
Dataset #2 has names that almost match the Agency_Names in Dataset #1
State County Agency_Name Address agency_code
FL Broward name1 address1 345
FL Dade name2 address2 654
MN Hennepin name1 address3 234
MN Hennepin name3 address4 776
I can select the best fuzzy match out of all names in the dataset using this:
from fuzzywuzzy import process
import rapidfuzz
df1['agency_match'] = df1['Agency_Name'].map(lambda x: process.extractOne(x,df2['Agency_Name'], scorer=rapidfuzz.string_metric.normalized_levenshtein)[0])
However, this match doesn't work because it matches Agency_Names from different states and counties. I need the fuzzy match to pick the best match only from within the same state and county.
What would be an elegant way to do this?

Pandas Split 1 Column into Multiple Columns where Delimited Split size Can Vary

I have some address data like:
Address
Buffalo, NY, 14201
Stackoverflow Street, New York, NY, 9999
I'd like to split these into columns like:
Street City State Zip
NaN Buffalo NY 14201
StackOverflow Street New York NY 99999
Essentially, I'd like to shift my strings over by one in each column in the result.
With Pandas I know I can split columns like:
import pandas as pd
df = pd.DataFrame(
data={'Address': ['Buffalo, NY, 14201', 'Stackoverflow Street, New York, NY, 99999']}
)
df[['Street','City','State','Zip']] = (
df['Address']
.str.split(',', expand=True)
.applymap(lambda col: col.strip() if col else col)
)
but need to figure out how to conditionally shift columns when my result is only 3 columns.
First, create a function to reverse a split for each row. Because if you split normally, the NaN will be at the end, so you reverse the order and split the list now the NaN will be at the end but the list is reversed.
Then, apply it to all rows.
Then, rename the columns because they will be integers.
Finally, set them in the right order.
fn = lambda x: pd.Series([i for i in reversed(x.split(','))])
pad = df['Address'].apply(fn)
pad looks like this right now,
0 1 2 3
0 14201 NY Buffalo NaN
1 99999 NY New York Stackoverflow Street
Just need to rename the columns and flip the order back.
pad.rename(columns={0:'Zip',1:'State',2:'City',3:'Street'},inplace=True)
df = pad[['Street','City','State','Zip']]
Output:
Street City State Zip
0 NaN Buffalo NY 14201
1 Stackoverflow Street New York NY 99999
Use a bit of numpy magic to reorder the columns with None on the left:
df2 = df['Address'].str.split(',', expand=True)
df[['Street','City','State','Zip']] = df2.to_numpy()[np.arange(len(df))[:,None], np.argsort(df2.notna())]
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 None Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Another idea, add as many commas as needed to have n-1 (here 3) before splitting:
df[['Street','City','State','Zip']] = (
df['Address'].str.count(',')
.rsub(4-1).map(lambda x: ','*x)
.add(df['Address'])
.str.split(',', expand=True)
)
Output:
Address Street City State Zip
0 Buffalo, NY, 14201 Buffalo NY 14201
1 Stackoverflow Street, New York, NY, 99999 Stackoverflow Street New York NY 99999
Well I found a solution but not sure if there is something more performant out there. Open to other ideas.
def split_shift(s: str) -> list[str]:
split_str: list[str] = s.split(',')
# If split is only 3 items, shift things over by inserting a NA in front
if len(split_str) == 3:
split_str.insert(0,pd.NA)
return split_str
df[['Street','City','State','Zip']] = pd.DataFrame(df['Address'].apply(lambda x: split_shift(x)).tolist())

Pandas groupby results - move some grouped column values into rows of new data frame

I have seen a number of similar questions but cannot find a straightforward solution to my issue.
I am working with a pandas dataframe containing contact information for constituent donors to a nonprofit. The data has Households and Individuals. Most Households have member Individuals, but not all Individuals are associated with a Household. There is no data that links the Household to the container Individuals, so I am attempting to match them up based on other data - Home Street Address, Phone Number, Email, etc.
A simplified version of the dataframe looks something like this:
Constituent Id Type Home Street
1234567 Household 123 Main St.
2345678 Individual 123 Main St.
3456789 Individual 123 Main St.
4567890 Individual 433 Elm Rd.
0123456 Household 433 Elm Rd.
1357924 Individual 500 Stack Ln.
1344444 Individual 500 Stack Ln.
I am using groupby in order to group the constituents. In this case, by Home Street. I'm trying to ensure that I only get groupings with more than one record (to exclude Individuals unassociated with a Household). I am using something like:
df1 = df[df.groupby('Home Street').filter(lambda x: len(x)>1)
What I would like to do is somehow export the grouped dataframe to a new dataframe that includes the Household Constituent Id first, then any Individual Constituent Ids. And in the case that there is no Household in the grouping, place the Individual Constituents in the appropriate locations. The output for my data set above would look like:
Household Individual Individual
1234567 2345678 3456789
0123456 4567890
1357924 1344444
I have toyed with iterating through the groupby object, but I feel like I'm missing some easy way to accomplish my task.
This should do it
df['Type'] = df['Type'] + '_' + (df.groupby(['Home Street','Type']).cumcount().astype(str))
df.pivot_table(index='Home Street', columns='Type', values='Constituent Id', aggfunc=lambda x: ' '.join(x)).reset_index(drop=True)
Output
Type Household_0 Individual_0 Individual_1
0 1234567 2345678 3456789
1 0123456 4567890 NaN
2 NaN 1357924 1344444
IIUC, we can use groupby agg(list) and some re-shaping using .join & explode
s = df.groupby(["Home Street", "Type"]).agg(list).unstack(1).reset_index(
drop=True
).droplevel(level=0, axis=1).explode("Household")
df1 = s.join(pd.DataFrame(s["Individual"].tolist()).add_prefix("Indvidual_")).drop(
"Individual", axis=1
)
print(df1.fillna(' '))
Household Indvidual_0 Indvidual_1
0 1234567 2345678 3456789
1 0123456 4567890
2 1357924 1344444
or we can ditch the join and cast Household to your index.
df1 = pd.DataFrame(s["Individual"].tolist(), index=s["Household"])\
.add_prefix("Individual_")
print(df1)
Individual_0 Individual_1
Household
1234567 2345678 3456789
0123456 4567890 None
NaN 1357924 1344444

Grouping by street address and splitting it into street name and number

I have a dataset that contains the following fields:
building guid (abcd-efgh-5678-1234, ..., etc)
street address (1256 Grant St, 500 wall st, etc)
price ($5000, $10000, etc)
Based on this, I want to add two new columns to my DataFrame object in Pandas.:
street name (wall st)
street number (500)
Until now, I've been able to fetch specific instances of the word wall st as follows:
str_street = 'Wall St'
wall_st = dataset.loc[dataset['street_address'].str.lower().str.endswith(str_street.lower()), :]
wall_st['street_name'] = ???
wall_st['street_address_number'] = ???
How do I go about doing this?
I think you need extract:
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
print (df)
street address
0 500 wall street
1 123 blafoo
df1 = df['street address'].str.extract('(?P<number>\d+)(?P<name>.*)', expand=True)
print (df1)
number name
0 500 wall street
1 123 blafoo
Solution with split:
df[['number','name']] = df['street address'].str.split(n=1, expand=True)
print (df)
street address number name
0 500 wall street 500 wall street
1 123 blafoo 123 blafoo
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
df['street address'].apply(lambda x: pd.Series(x.split(None, 1)))
will result in:
0 1
0 500 wall street
1 123 blafoo
You can then just rename the columns and pd.concat this to you original data frame.

Update missing values in a column using pandas

I have a dataframe df with two of the columns being 'city' and 'zip_code':
df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})
As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.
How do I accomplish this task using pandas?
You can go for:
import numpy as np
df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')
>>> df
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:
In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df
Out[255]:
city zip_code
0 Cambridge 12345
1 Washington 67891
2 Miami 23457
3 Cambridge 12345
4 Miami 23457
5 Washington 67891
If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])
The reason you need to do this is because it'll raise an error if there are duplicate index entries
My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.
And then you use that dictionary to fill in all missing zip code values.

Categories

Resources