I have a list of addresses that I would like to put into a dataframe where each row is a new address and the columns are the units of the address (title, street, city).
However, the way the list is structured, some addresses are longer than others. For example:
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
I have a pandas dataframe with the following columns:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
I would like to split the Address column into up to three columns depending on how many comma separators there are in the address, with NaN filling in where values will be missing.
For example, I hope the data will look like this:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
I have plowed through and tried a ton of different solutions on StackOverflow to no avail. The closest I have gotten is with this code:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
But that returns a dataframe that adds the following three columns to the end structured as such:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
This is close, but as you can see, for the shorter entries, the city lines up with the second address line for the longer entries.
Ideally, column 2 should populate every single row with a city, and column 1 should alternate between "None" and the second address line if applicable.
I hope this makes sense -- this is a tough one to put into words. Thanks!
You could do something like this:
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
Addresses, especially those produced by human input can be tricky. But, if your addresses only fit those two formats this will work:
Note: If there is an additional format you have to account for, this will print the culprit.
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
Essentially the two things that should help you are the string.count() function which will tell you the number of commas in a string, and the string.split() which you already found that will split the input into an array. You can reference the portions of this array to allocate the pieces to the correct column.
You can look into creating a function using the package usaddress. It has been very helpful for me when I need to split address into parts:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
Then create functions for how you want to split the data:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
out:
Address Address1 Address2 City
0 123 Main St. Suite 100 Chicago, IL 123 Main St. Suite 100 Chicago
1 123 Main St. PO Box 100 Chicago, IL 123 Main St. PO Box 100 Chicago
Related
I have a dataframe looks like this:
Premise
Thoroughfare
Locality
PostalCode
Country
FullAddress
Yew Tree Lane
Holmbridge
HD9 2NR
N Ireland
Old Thorn, Yew Tree Lane, Holmbridge HD9 2NR, N Ireland
3
Cysgod Y Castell
Llandudno Junction
LL31 9LJ
Uk
3 Cysgod Y Castell, Llandudno Junction LL31 9LJ
1168
Christchurch Road
Bournemouth
BH7 6DY
Wales UK
1168 Christchurch Road, BH7 6DY Bournemouth
And want to create another column or dataframe that looks like this
FullAddress
FullAdressWithTag
Old Thorn, Yew Tree Lane, Holmbridge HD9 2NR, N Ireland
Old^Others Thorn^Others, Yew^Thoroughfare Tree^Thoroughfare Lane^Thoroughfare, Holmbridge^Locality HD9^PostalCode 2NR^PostalCode, N^Country Ireland^Country
3 Cysgod Y Castell, Llandudno Junction LL31 9LJ
3^Premise Cysgod^Thoroughfare Y^Thoroughfare Castell^Thoroughfare, Llandudno^Locality Junction^Locality LL31^PostalCode 9LJ^PostalCode
1168 Christchurch Road, BH7 6DY Bournemouth
1168^Premise Christchurch^Thoroughfare Road^Thoroughfare, BH7^PostalCode 6DY^PostalCode Bournemouth^Locality
I am trying to map the FullAddressWithTag columns that based on data that is available on the single column such as Locality, Premise, PostalCode etc. Do note that the pattern of the FullAddress might be vary.
For example, it can be:
Premise -> Thoroughfare -> Postalcode -> Locality
Premise -> Thoroughfare -> Locality -> PostalCode
Thoroughfare -> Premise -> Postalcode -> Locality
It can be in different position depends on how the FullAddress given. If the element in FullAddress doesnt have a tag, it will tags as "Others"
I have million records for this data to be map with.
Here is a long and bulky code that could get your job done:
def getDFwithFullTag(df1):
cols_to_check = ['Premise', 'Thoroughfare', 'Locality', 'PostalCode', 'Country']
def checkAddr(row):
ans = ''
for st in row['FullAddress'].split(' '):
flg = False
for col in cols_to_check:
if st in row[col].split(' '):
if st[-1]==',':
ans += st[:-1]+'^'+col+', '
else:
ans += st+'^'+col+' '
flg = True
break
if not flg:
if st[-1]==',':
ans += st[:-1]+'^'+col+', '
else:
ans += st+'^'+col+' '
return ans
ftag = []
for i in range(len(df1)):
ftag.append(checkAddr(df1.loc[i]))
df_new = pd.DataFrame(data=df1.FullAddress)
df_new.insert(1, 'FullAdressWithTag', ftag, True)
return df_new
Apart from being large and ugly, this function would take around 20-30 minutes to process 1 million rows. This function takes a dataframe as input and outputs a dataframe of your desired format.
I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)
Ive got a df thats been merged and I want to do some logic to it so that I capture issues from the data sources.
I want to capture both when theres a situation when the Areacode's match but T's do not
AND when both Areacode's and T's dont match at all.
Here's a merged_df before the filter.
Name t_1 Areacode_1 t_2 Areacode_2
Jerry New Jersey 12674 Texas 12674
Elaine New York 98765 Alaska 78654
George New York 12345 New York 12345
Is there a way to do this all in one filter? This is what I have so far, but it would be nice to put it as one line:
m = merged_df.loc[(merged_df['t_1'] != merged_df['t_2']) & (merged_df['Areacode_1'] == merged_df['Areacode_2']) ]
m2 = merged_df.loc[(merged_df['t_1'] != merged_df['t_2']) & (merged_df['Areacode_1'] != merged_df['Areacode_2']) ]
After the filter Id expect George to be removed because all columns matched.
Expected merged_df:
Name t_1 Areacode_1 t_2 Areacode_2
Jerry New Jersey 12674 Texas 12674
Elaine New York 98765 Alaska 78654
You could do it like this:
import pandas as pd
merged_df = pd.DataFrame({'Name':['Jerry','Elaine','George'],
't_1':['New Jersey', 'New York','New York'],
'Areacode_1': [12674,98765,12345],
't_2':['Texas','Alaska','New York'],
'Areacode_2':[12674,78654,12345]})
filtered1 = merged_df.loc[~((merged_df.t_1 == merged_df.t_2) & (merged_df.Areacode_1 == merged_df.Areacode_2))]
display(filtered1)
filtered2 = merged_df.loc[(merged_df.t_1 != merged_df.t_2)]
display(filtered2)
Note that filtered1 shows the same output as filtered2 and is the same as your 'Expected merged_df'.
Both will essentially meet your criteria.
I used np.where to solve this.
merged_df2 = merged_df.assign(Filter = np.where((merged_df['Salesforce_Territory'] !=
merged_df['Snowflake Territory']) & (merged_df['Salesforce_Zip_Code'] != merged_df['Snowflake Zip']) |
((merged_df['Salesforce_Territory'] != merged_df['Snowflake Territory'])), True, False))
I am doing a triple for loop on a dataframe with almost 70 thousand entries. How do I optimize it?
My ultimate goal is to create a new column that has the country of a seismic event. I have a latitude, longitude and 'place' (ex: '17km N of North Nenana, Alaska') column. I tried to reverse geocode, but with 68,488 entries, there is no free service that lets me do that. And as a student, I cannot afford it.
So I am using a dataframe with a list of countries and a dataframe with a list of states to compare to USGS['place']'s values. To do that, I ultimately settled on using 3 for loops.
As you can assume, it takes a long time. I was hoping there is a way to speed things up. I am using python, but I use r as well. The for loops just run better on python.
Any better options I'll take.
USGS = pd.DataFrame(data = {'latitide':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[NA, NA]})
states = pd.DataFrame(data = {'state':['AK', 'AL'], 'name':['Alaska', 'Alabama']})
countries = pd.DataFrame(data = {'country':['Afghanistan', 'Canada']})
for head in states:
for state in states[head]:
for p in USGS['place']:
if state in p:
USGS['country'] = USGS['country'].map({p : 'United 'States'})
# I have not finished the code for the countries dataframe
You do have options to do geocoding. Mapquest offers a free 15,000 calls per month. You can also look at using geopy which is what I use.
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
USGS_df = pd.DataFrame(data = {'latitude':[64.7385, 61.116], 'longitude':[-149.136, -138.655], 'place':['17km N of North Nenana, Alaska', '74km WNW of Haines Junction, Canada'], 'country':[None, None]})
geopy.geocoders.options.default_user_agent = "locations-application"
geolocator=Nominatim(timeout=10)
for i, row in USGS_df.iterrows():
try:
lat = row['latitude']
lon = row['longitude']
location = geolocator.reverse('%s, %s' %(lat, lon))
country = location.raw['address']['country']
print ('Found: ' + location.address)
USGS_df.loc[i, 'country'] = country
except:
print ('Location not identified: %s, %s' %(lat, lon))
Input:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska None
1 61.1160 -138.655 74km WNW of Haines Junction, Canada None
Output:
print (USGS_df)
latitude longitude place country
0 64.7385 -149.136 17km N of North Nenana, Alaska USA
1 61.1160 -138.655 74km WNW of Haines Junction, Canada Canada
I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df
I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.
zipdb has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in zipdb.
def zipdb_lookup(zipcode, city, state, country):
countries_mapping = { "UNITED STATES":"US"
, "CANADA":"CA"
, "KOREA REP OF":"KR"
, "ITALY":"IT"
, "AUSTRALIA":"AU"
, "CHILE":"CL"
, "UNITED KINGDOM":"GB"
, "BERMUDA":"BM"
}
try:
slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]
if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
else:
return None
except:
return None
I have tried pandas' .apply as well as a for loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.
zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above
Note: I've also performed this transformation on zibdb:
zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)
Function Call:
#Defined a wrapper function:
def lookup(row):
"""
:param row:
:return:
"""
lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]
)
return lnglat
lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))
Sample Data from my_df:
organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates
Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:
# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
zips.rename(columns=dict(Zipcode='organization_zip',
City='organization_city'),
inplace=True)
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')
=>
Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.