I have a dataset that contains the following fields:
building guid (abcd-efgh-5678-1234, ..., etc)
street address (1256 Grant St, 500 wall st, etc)
price ($5000, $10000, etc)
Based on this, I want to add two new columns to my DataFrame object in Pandas.:
street name (wall st)
street number (500)
Until now, I've been able to fetch specific instances of the word wall st as follows:
str_street = 'Wall St'
wall_st = dataset.loc[dataset['street_address'].str.lower().str.endswith(str_street.lower()), :]
wall_st['street_name'] = ???
wall_st['street_address_number'] = ???
How do I go about doing this?
I think you need extract:
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
print (df)
street address
0 500 wall street
1 123 blafoo
df1 = df['street address'].str.extract('(?P<number>\d+)(?P<name>.*)', expand=True)
print (df1)
number name
0 500 wall street
1 123 blafoo
Solution with split:
df[['number','name']] = df['street address'].str.split(n=1, expand=True)
print (df)
street address number name
0 500 wall street 500 wall street
1 123 blafoo 123 blafoo
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
df['street address'].apply(lambda x: pd.Series(x.split(None, 1)))
will result in:
0 1
0 500 wall street
1 123 blafoo
You can then just rename the columns and pd.concat this to you original data frame.
Related
I keep running into this use and I haven't found a good solution. I am asking for a solution in python, but a solution in R would also be helpful.
I've been getting data that looks something like this:
import pandas as pd
data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
df = pd.DataFrame(data)
Col1
0 Bob
1 101
3
4 Sue
5 102
6 Second Street
7
8 Alex
9 200
10 Third Street
11
The pattern in my real data does repeat like this. Sometimes there is a blank row (or more than 1), and sometimes there are not any blank rows. The important part here is that I need to convert this column into a row.
I want the data to look like this.
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have tried playing around with this, but nothing has worked. My thought was to iterate through a few rows at a time, assign the values to the appropriate column, and just build a data frame row by row.
x = len(df['Col1'])
holder = pd.DataFrame()
new_df = pd.DataFrame()
while x < 4:
temp = df.iloc[:5]
holder['Name'] = temp['Col1'].iloc[0]
holder['Address'] = temp['Col1'].iloc[1]
holder['Street'] = temp['Col1'].iloc[2]
new_df = pd.concat([new_df, holder])
df = temp[5:]
df.reset_index()
holder = pd.DataFrame()
x = len(df['Col1'])
new_df.head(10)
In R,
data <- data.frame(
Col1 = c('Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '')
)
k<-which(grepl("Street", data$Col1) == TRUE)
j <- k-1
i <- k-2
data.frame(
Name = data[i,],
Adress = data[j,],
Street = data[k,]
)
Name Adress Street
1 Bob 101 First Street
2 Sue 102 Second Street
3 Alex 200 Third Street
Or, if Street not ends with Street but Adress are always a number, you can also try
j <- which(apply(data, 1, function(x) !is.na(as.numeric(x)) ))
i <- j-1
k <- j+1
Python3
In Python 3, you can convert your DataFrame into an array and then reshape it.
n = df.shape[0]
df2 = pd.DataFrame(
data=df.to_numpy().reshape((n//4, 4), order='C'),
columns=['Name', 'Address', 'Street', 'Empty'])
This produces for your sample data this:
Name Address Street Empty
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
If you like you can remove the last column:
df2 = df2.drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
One-liner code
df2 = pd.DataFrame(data=df.to_numpy().reshape((df.shape[0]//4, 4), order='C' ), columns=['Name', 'Address', 'Street', 'Empty']).drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
In python i believe this may help u.
1 import pandas as pd
2
3 data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
4
5 var = list(data.values())[0]
6 var2 = []
7 for aux in range(int(len(var)/4)):
8 var2.append(var[aux*4: aux*4+3])
9 data = pd.DataFrame(var2, columns=['Name', 'Address','Street',])
10 print(data)
Another R solution. This solution is based on the tidyverse package. The example data frame data is from Park's post (https://stackoverflow.com/a/69833814/7669809).
library(tidyverse)
data2 <- data %>%
mutate(ID = cumsum(Col1 %in% "")) %>%
filter(!Col1 %in% "") %>%
group_by(ID) %>%
mutate(Type = case_when(
row_number() == 1L ~"Name",
row_number() == 2L ~"Address",
row_number() == 3L ~"Street",
TRUE ~NA_character_
)) %>%
pivot_wider(names_from = "Type", values_from = "Col1") %>%
ungroup()
data2
# # A tibble: 3 x 4
# ID Name Address Street
# <int> <chr> <chr> <chr>
# 1 0 Bob 101 First Street
# 2 1 Sue 102 Second Street
# 3 2 Alex 200 Third Street
The values of the DataFrame are reshaped by unknown rows and 4 columns, then the first 3 columns of the entire array are taken out by slicing and converted into a DataFrame, and finally the columns of DataFrame are reset by set_axis
result = pd.DataFrame(df.values.reshape(-1, 4)[:, :-1])\
.set_axis(['Name', 'Address', 'Street'], axis=1)
result
>>>
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have this data set. df1 = 70,000 rows and df2 = ~30 rows. I want to match the address to see if df2 appears in df1 and if it does than I want to show the match and also pull info from df1 to create a new df3. Sometimes the address info is off by a bit..for example (road = rd, street = st, etc )Here's an example:
df1 =
address unique key (and more columns)
123 nice road Uniquekey1
150 spring drive Uniquekey2
240 happy lane Uniquekey3
80 sad parkway Uniquekey4
etc
df2 =
address (and more columns)
123 nice rd
150 spring dr
240 happy lane
80 sad parkway
etc
And this is what Id want a new dataframe :
df3=
address(from df2) addressed matched(from df1) unique key(comes from df1) (and more columns)
123 nice rd 123 nice road Uniquekey1
150 spring dr 150 spring drive Uniquekey2
240 happy lane 240 happy lane Uniquekey3
80 sad parkway 80 sad parkway Uniquekey4
etc
Here's what Ive tried so far using difflib:
df1['key'] = df1['address']
df2['key'] = df2['address']
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))
this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)
then I try to merge using df2.merge(df1, on ='key') and no address is matching?
I'm not sure what it could be but any help would be greatly appreciated. I also am playing around with the fuzzywuzzy package.
My answer is similar to one of your old questions that I answered.
I slightly modified your dataframe:
>>> df1
address unique key
0 123 nice road Uniquekey1
1 150 spring drive Uniquekey2
2 240 happy lane Uniquekey3
3 80 sad parkway Uniquekey4
>>> df2 # shuffle rows
address
0 80 sad parkway
1 240 happy lane
2 150 winter dr # change the season :-)
3 123 nice rd
Use extractOne function from fuzzywuzzy.process:
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
score_cutoff=THRESHOLD))
The output of extractOne is:
>>> best_match
0 (80 sad parkway, 100, 3)
1 (240 happy lane, 100, 2)
2 None
3 (123 nice road, 92, 0)
Name: address, dtype: object
Now you can merge your 2 dataframes:
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
>>> df3
address_x address_y unique key
0 80 sad parkway 80 sad parkway Uniquekey4
1 240 happy lane NaN NaN
2 150 winter dr 150 spring drive Uniquekey2
3 123 nice rd 123 nice road Uniquekey1
This answer is longer but I'll post it because maybe you can follow along better as you can see the steps as they happen.
Set up the frames:
import pandas as pd
#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz, process
# matching threshold. may need altering from 45-95 etc. higher is better but being stricter means things aren't matched. fiddle as required
threshold = 75
df1 = pd.DataFrame({'address': {0: '123 nice road',
1: '150 spring drive',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
df2 = pd.DataFrame({'address': {0: '123 nice rd',
1: '150 spring dr',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
Then the main code:
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = threshold)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
df1_addresses = list(df1.address.unique())
df2_addresses = list(df2.address.unique())
# via fuzzywuzzy matching and using match_addresses() above
# return a dictionary of addresses where there is a match
names = []
for x in df1_addresses:
match = match_addresses(x, df2_addresses, threshold)
if match[1] >= threshold:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['df1_address', 'df2_address'])
# create new frame
df3 = pd.concat([df1, match_df], axis=1)
del df3['df1_address']
# shuffle the matched address column to be next to the original address of df1
c = df3.columns.tolist()
c.insert(1, c.pop(c.index('df2_address')))
df3 = df3.reindex(columns=c)
# add fuzzywuzzy scoring as a new column
df3['fuzzywuzzy_score'] = df3.apply(lambda x: scoringMatches(x['address'], df2['address']), axis=1)
print(df3)
Output:
address df2_address unique key (and more columns) fuzzywuzzy_score
0 123 nice road 123 nice rd Uniquekey1 92
1 150 spring drive 150 spring dr Uniquekey2 90
2 240 happy lane 240 happy lane Uniquekey3 100
3 80 sad parkway 80 sad parkway Uniquekey4 100
I have a dataframe of addresses as below:
main_df =
address
0 3, my_street, Mumbai, Maharashtra
1 Bangalore Karnataka 45th Avenue
2 TelanganaHyderabad some_street, some apartment
And I have a dataframe with city and state as below (note few states have cities with same names too:
city_state_df =
city state
0 Mumbai Maharashtra
1 Ahmednagar Maharashtra
2 Ahmednagar Bihar
3 Bangalore Karnataka
4 Hyderabad Telangana
I want to have a mapping of city and state next to each address. I am able to do so with iterrows() with nested for loops. However, both take more than an hour each for mere 15k records. What is the optimum way of achieving this considering addresses are randomly written and multiple states have same city name?
My code below:
main_df = pd.DataFrame({'address': ['3, my_street, Mumbai, Maharashtra', 'Bangalore Karnataka 45th Avenue', 'TelanganaHyderabad some_street, some apartment']})
city_state_df = pd.DataFrame({'city': ['Mumbai', 'Ahmednagar', 'Ahmednagar', 'Bangalore', 'Hyderabad'],
'state': ['Maharashtra', 'Maharashtra', 'Bihar', 'Karnataka', 'Telangana']})
df['city'] = np.nan
df['state'] = np.nan
for i, df_row in df.iterrows():
for j, city_row in city_state_df.iterrows():
if city_row['city'] in df_row['address']:
city_filtered = city[city['city'] == city_row['city']]
for k, fil_row in city_filtered.iterrows():
if fil_row['state'] in df_row['address']:
df_row['city'] = fil_row['city']
df_row['state'] = fil_row['state']
break
break
Hello maybe something like this:
main_df = main_df.reindex(columns=[*main_df.columns.tolist(), 'state', 'city'],fill_value=None)
for i, row in city_state_df.iterrows():
main_df.loc[(main_df.address.str.contains(row.city)) & \
(main_df.address.str.contains(row.state)), \
['city', 'state']] = [row.city, row.state]
I have a dataframe that holds addresses, which are split in multiple columns:
address postalcode city province country
-----------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I want to split the address column into two separate columns, one for house number and one for street name. Therefore, after running it, the above df would look like:
house_no street postalcode city province country
----------------------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I have been doing this by simply using df[['house_no', 'street']] = df['address'].str.split(' ', 1, expand=True), which was working fine until I noticed that some addresses under the address column are structured as Apt 316 555 Fake Drive (or Unit 316 555 Fake Drive). Therefore, when I run what I am currently using on those, I get:
house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Obviously, this is not want I want.
So essentially, I need an algorithm that splits the string after the first number, unless it starts with "Unit" or "Apt", in which case it will take the second number it sees and split that out into the house_no column.
I need to do this without losing any information, therefore keeping the Unit/Apt number as well (that can be stored in the house_no column, but ideally would have its own unit_no column). Therefore, ideally, the output would look like:
unit_no house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Given that the original address column contained Apt 316 555 Fake Drive and is now split into unit_no,house_no, and street.
I am not sure where to start with this, so any help would be appreciated.
Let's try this data:
df = pd.DataFrame({'address':['123 Fake Street', 'Apt 316 555 Fake Drive']})
# df
# address
# 0 123 Fake Street
# 1 Apt 316 555 Fake Drive
Since you did not specify if you want to capture Unit\Apt number, I assume you do not:
df.address.str.extract('(?:Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
house_no street
0 123 Fake Street
1 555 Fake Drive
Only slight modification needed if you want to keep Unit/Apt:
df.address.str.extract('(?P<unit_no>Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
unit_no house_no street
0 NaN 123 Fake Street
1 Apt 316 555 Fake Drive
you can you the df.loc function, this should work.
df.loc[~df['address'].str.contains('Unit|Apt'), 'house_no'] = df['address'].str.split(' ')
If you always have a number followed by a space then street name, you could use the str.split(' ')
function on the data in address
For example, make a new column with streeet name, a new column with street number
create two arrays, one with street number by using for example
number = address.split(' ')
number[0] will always be the street number
since some street names have spaces, append number[1:] together and that is your data for the street name column
sorry for the psuedo code, in a rush.
I am not sure I understood the question, but if you want to eliminate the words Apt or Unit this will do it (here df and df2 are two .xlsx files I made, and df2 is just another dataframe with the columns you need, (house_no and street) and with as many rows as df but with empty values):
import pandas as pd
df=pd.read_excel('raspuns_so.xlsx')
df2=pd.read_excel('sol.xlsx')
tmp=df['add'].str.split(' ', 1, expand=True)
for i, row_series in df2.iterrows():
if tmp[0][i].isdigit():
df2[['house_no', 'street']] = df['add'].str.split(' ', 1, expand=True)
else:
var=tmp[1][i].split(' ')
arr=[var[0],var[1]]
df2.at[i,'house_no'] = " ".join(arr)
df2.at[i,'street'] = var[2]
print df2
My df:
address pc city province country
0 123 Fake ST F1A2K3 Fakeville ON CA
1 Apt 123 555 FakeST 300000 Fakeville OFF USA
My df2:
house_no street pc city province country
0 0 0 0 0 0 0
1 0 0 0 0 0 0
df2 after I ran the code:
house_no street pc city province country
0 123 Fake ST 0 0 0 0
1 123 555 FakeST 0 0 0 0
For example, if I have a home address like this:
71 Pilgrim Avenue, Chevy Chase, MD
in a column named 'address'. I would like to split it into columns 'street', 'city', 'state', respectively.
What is the best way to achieve this using Pandas ?
I have tried df[['street', 'city', 'state']] = df['address'].findall(r"myregex").
But the error I got is Must have equal len keys and value when setting with an iterable.
Thank you for your help :)
You can use split by regex ,\s+ (, and one or more whitespaces):
#borrowing sample from `Allen`
df[['street', 'city', 'state']] = df['address'].str.split(',\s+', expand=True)
print (df)
address id street city \
0 71 Pilgrim Avenue, Chevy Chase, MD a 71 Pilgrim Avenue Chevy Chase
1 72 Main St, Chevy Chase, MD b 72 Main St Chevy Chase
state
0 MD
1 MD
And if need remove column address add drop:
df[['street', 'city', 'state']] = df['address'].str.split(',\s+', expand=True)
df = df.drop('address', axis=1)
print (df)
id street city state
0 a 71 Pilgrim Avenue Chevy Chase MD
1 b 72 Main St Chevy Chase MD
df = pd.DataFrame({'address': {0: '71 Pilgrim Avenue, Chevy Chase, MD',
1: '72 Main St, Chevy Chase, MD'},
'id': {0: 'a', 1: 'b'}})
#if your address format is consistent, you can simply use a split function.
df2 = df.join(pd.DataFrame(df.address.str.split(',').tolist(),columns=['street', 'city', 'state']))
df2 = df2.applymap(lambda x: x.strip())