I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:
Suppose I have a list of addresses:
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
and I extract them using usaddress
info = [usaddress.parse(loc) for loc in addr]
"info" is a list of a list of tuples that looks like this:
[[('123', 'AddressNumber'),
('Pennsylvania', 'StreetName'),
('Ave', 'StreetNamePostType'),
('NW', 'StreetNamePostDirectional'),
('Washington', 'PlaceName'),
('DC', 'StateName'),
('20008', 'ZipCode')],
[('652', 'AddressNumber'),
('Polk', 'StreetName'),
('St', 'StreetNamePostType'),
('San', 'PlaceName'),
('Francisco,', 'PlaceName'),
('CA', 'StateName'),
('94102', 'ZipCode')],
[('3711', 'AddressNumber'),
('Travis', 'StreetName'),
('St', 'StreetNamePostType'),
('#', 'OccupancyIdentifier'),
('800', 'OccupancyIdentifier'),
('Houston,', 'PlaceName'),
I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.
Any help would be much appreciated!
Thanks
Not sure if there is a DataFrame constructor that can handle info exactly as you have it now. (Maybe from_records or from_items?--still don't think this structure would be directly compatible.)
Here's a bit of manipulation to get what you're looking for:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably
# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102
Thank you for your responses! I ended up doing a completely different workaround as follows:
I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!
parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])
Then I created a new column that made a string out of the usaddress parse list and called it "Info"
df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))
Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!
for colname in parse_tags:
df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
colname, x) else "")
This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!
Related
I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.
I am having trouble with pandas replace-function. Let's say we have an example dataframe like this:
df = pd.DataFrame({'State': ['Georgia', 'Alabama', 'Tennessee'],
'Cities': [['Atlanta', 'Albany'], ['Montgomery', 'Huntsville', 'Birmingham'], ['Nashville', 'Knoxville']]})
>>> df
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
Now I want to replace the state names and city names all by abbreviations. I have two dictionaries that define the replacement values:
state_abbrv = {'Alabama': 'AL', 'Georgia': 'GA', 'Tennessee': 'TN'}
city_abbrv = {'Albany': 'Alb.', 'Atlanta': 'Atl.', 'Birmingham': 'Birm.',
'Huntsville': 'Htsv.', 'Knoxville': 'Kxv.',
'Montgomery': 'Mont.', 'Nashville': 'Nhv.'}
When using pd.DataFrame.replace() on the "States" column (which only contains one value per row) it works as expected and replaces all state names:
>>> df.replace({'State': state_abbrv})
State Cities
0 GA [Atlanta, Albany]
1 AL [Montgomery, Huntsville, Birmingham]
2 TN [Nashville, Knoxville]
I was hoping that it would also individually replace all matching names within the lists of the "Cities" column, but unfortunately it does not seem to work as all cities remain unabbreviated:
>>> df.replace({'Cities': city_abbrv})
State Cities
0 Georgia [Atlanta, Albany]
1 Alabama [Montgomery, Huntsville, Birmingham]
2 Tennessee [Nashville, Knoxville]
How do I get the pd.DataFrame.replace() function to individually circle through all list elements in the column per row and replace accordingly?
Try:
explode to split the list into individual rows
replace each column using the relevant dictionary
groupby and agg to get back the original structure
>>> output = df.explode("Cities").replace({"State": state_abbrv, "Cities": city_abbrv}).groupby("State", as_index=False)["Cities"].agg(list)
State Cities
0 AL [Mont., Htsv., Birm.]
1 GA [Atl., Alb.]
2 TN [Nhv., Kxv.]
Currently each column that I want to delimit stores address information (street,city,zipcode) and I want to separate each section in another column. Ex. Streets, city, zip codes will all have their own columns. However, I need to repeat this process for all columns stored in my list address_columns
What the data looks like:
address_columns = ['QUESTION_47', 'QUESTION_56', 'QUESTION_65', 'QUESTION_83', 'QUESTION_92']
(How each column looks like using Fake addresses)
QUESTION 47
64 Fordham St, Toms River, NJ 08753
7352 Poor House St. Hartford, CT 06106
8591 Peninsula Lane, Copperas Cove, TX 76522
Rough idea of how to implement my problem:\
Step One.
all strings before the first comma go in the first column
all strings before the second comma go in the second column
all strings before third comma go in the third column
etc..
Step Two.
identify text after the last comma and put it in the last additional
comma
Step Three.
Repeat for all columns in the list
You can use the pandas.Series.str.split method with the expand=True argument. For example:
import pandas as pd
df = pd.DataFrame({'q47': ['11 4 st,seattle,22222', '11 9st,chicago,23456']})
dfnew = df['q47'].str.split(',', expand=True).rename(columns={0:'street', 1:'city', 2:'zip'})
lets say this is your data:
df = pd.DataFrame({'QUESTION_47': {0: '64 Fordham St, Toms River, NJ 08753',
1: '7352 Poor House St, Hartford, CT 06106',
2: '8591 Peninsula Lane, Copperas Cove, TX 76522'},
'QUESTION_56': {0: '1234 Some House St, City, CT 11111',
1: '234 Some Other St, City, AB 23456',
2: '90 Yet Other St, City, AB 12345'}})
you can expand each column into 3, with suffix of question, and then stack them horizontally:
# holder dataframe
df_all = pd.DataFrame()
# loop over columns in dataframe
for c in df.columns:
df_ = pd.DataFrame()
ext= c[8:] #extract question number from column
df_[['street'+ext, 'city'+ext, 'zipcode'+ext]] = df[c].str.split(',', expand=True)
#concatenate new expansion to previously accumulated questyions
df_all = pd.concat([df_all, df_], axis=1)
output:
I have a column in a dataframe that is like this:
OWNER
--------------
OTTO J MAYER
OTTO MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLI
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT CITY
Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name.
For example this is I what I expect from the data frame above:
NAME
--------------
OTTO J MAYER
OTTO J MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLY
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT
OTTO MAYER is replaced with OTTO J MAYER because they are both very similar. The DANIEL's stayed the same because they do not match much. The LISA CULL's all have the same values and etc.
I have some code I got from another post on stack overflow that was trying to solve something similar but they are using a dictionary of names. However, I'm having trouble reworking their code to produce the output that I need.
Here is what I have currently:
d = pd.DataFrame({'OWNER' : pd.Series(['OTTO J MAYER', 'OTTO MAYER','DANIEL J ROSEN','DANIEL ROSSY',
'LISA CULLI', 'LISA CULLY'])})
names = d['OWNER']
names = names.values
names
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
return u" ".join(tokens)
return x
d["OWNER"].apply(lambda x: fuzzy_replace(x, names))
Indeed difflib.get_close_matches is fit for the task, but splitting the name into tokens does no good. In order to differentiate the names as specified, we have to raise the cutoff score to about 0.8, and to make sure that all possible names are returned, raise the maximum number to len(names). Then we have two cases to decide which name to prefer:
If a name occurs more often than the others, choose that one.
Otherwise choose the one occurring first.
def fuzzy_replace(x, names):
aliases = difflib.get_close_matches(x, names, len(names), .8)
closest = pd.Series(aliases).mode()
closest = aliases[0] if closest.empty else closest[0]
d['OWNER'].replace(aliases, closest, True)
for x in d["OWNER"]: fuzzy_replace(x, d['OWNER'])
Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.
mydf = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
print(mydf)
name city county
0 jim new york
1 jon los angeles
And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.
# api for new york addresses
def get_NY_zipcode_api():
return 'abc123'
# api for chicago addresses
def get_CH_zipcode_api():
return 'abc124'
# api for los angeles addresses
def get_LA_zipcode_api():
return 'abc125'
# api for miami addresses
def get_MI_zipcode_api():
return 'abc126'
Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:
def myfunc(row):
city = row['city']
county = row['county']
if city == 'chicago' or county == 'chicago':
# call chicago api
zipcode = get_CH_zipcode_api()
return zipcode
elif city == 'new york' or county == 'new york':
# call new york api
zipcode = get_NY_zipcode_api()
return zipcode
elif city == 'los angeles' or county == 'los angeles':
# call los angeles api
zipcode = get_LA_zipcode_api()
return zipcode
elif city == 'miami' or county == 'miami':
# call miami api
zipcode = get_MI_zipcode_api()
return zipcode
And I apply() this to the df and get my results:
mydf['result'] = mydf.apply(myfunc,axis=1)
print(mydf)
name city county result
0 jim new york abc123
1 jon los angeles abc125
I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?
I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:
operationFuncs = {
'chicago': get_CH_zipcode_api,
'new york': get_NY_zipcode_api,
'los angeles': get_LA_zipcode_api,
'miami': get_MI_zipcode_api
}
But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x
mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)
I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).
Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.
Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.
Or you could use the dict and iterate through it, this seems unnecessary though.
For key, fn in fdict:
if key in (city,county):
fn()
I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.
If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.
df['region'] = df['City'] + '' + df['Country']
Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.
mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}
Create a dataframe using this dictionary & then merge with the original dataframe
mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')
Hope this helps!!
You need a relation table which can be represented by a dict.
df = pd.DataFrame({'name':['jim','jon'],
'city':['new york',''],
'county':['','los angeles']})
df['region'] = df['city'] + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)
Output
name city county region
0 jim new york abc123
1 jon los angeles abc125