I've spent the last few months developing a program that my company is using to clean and geocode addresses on a large scale (~5,000/day). It is functioning adequately well, however, there are certain address formats that I see daily that are causing issues for me.
Addresses with a format such as this park avenue 1 are causing issues with my geocoding. My thought process to tackle this issue is as follows:
Split the address into a list
Find the index of my delimiter word in the list. The delimiter words are words such as avenue, street, road, etc. I have a list of these delimiters called patterns.
Check to see if the word immediately following the delimiter is composed of digits with a length of 4 or less. If the number has a length of higher than 4 it is likely to be a zip code, which I do not need. If it's less than 4 it will most likely be the house number.
If the word meets the criteria that I explained in the previous step, I need to move it to the first position in the list.
Finally, I will put the list back together into a string.
Here is my initial attempt at putting my thoughts into code:
patterns ['my list of delimiters']
address = 'park avenue 1' # this is an example address
address = address.split(' ')
for pattern in patterns:
location = address.index(pattern) + 1
if address[location].isdigit() and len(address[location]) <= 4:
# here is where i'm getting a bit confused
# what would be a good way to go about moving the word to the first position in the list
address = ' '.join(address)
Any help would be appreciated. Thank you folks in advance.
Make the string address[location] into a list by wrapping it in brackets, then concatenate the other pieces.
address = [address[location]] + address[:location] + address[location+1:]
An example:
address = ['park', 'avenue', '1']
location = 2
address = [address[location]] + address[:location] + address[location+1:]
print(' '.join(address)) # => '1 park avenue'
Here's a modified version of your code. It uses simple list slicing to rearrange the parts of the address list.
Rather than using a for loop to search for a matching road type it uses set operations.
This code isn't perfect: it won't catch "numbers" like 12a, and it won't handle weird street names like "Avenue Road".
road_patterns = {'avenue', 'street', 'road', 'lane'}
def fix_address(address):
address_list = address.split()
road = road_patterns.intersection(address_list)
if len(road) == 0:
print("Can't find a road pattern in ", address_list)
elif len(road) > 1:
print("Ambiguous road pattern in ", address_list, road)
else:
road = road.pop()
index = address_list.index(road) + 1
if index < len(address_list):
number = address_list[index]
if number.isdigit() and len(number) <= 4:
address_list = [number] + address_list[:index] + address_list[index + 1:]
address = ' '.join(address_list)
return address
addresses = (
'42 tobacco road',
'park avenue 1 a',
'penny lane 17',
'nonum road 12345',
'strange street 23 london',
'baker street 221b',
'37 gasoline alley',
'83 avenue road',
)
for address in addresses:
fixed = fix_address(address)
print('{!r} -> {!r}'.format(address, fixed))
output
'42 tobacco road' -> '42 tobacco road'
'park avenue 1 a' -> '1 park avenue a'
'penny lane 17' -> '17 penny lane'
'nonum road 12345' -> 'nonum road 12345'
'strange street 23 london' -> '23 strange street london'
'baker street 221b' -> 'baker street 221b'
Can't find a road pattern in ['37', 'gasoline', 'alley']
'37 gasoline alley' -> '37 gasoline alley'
Ambiguous road pattern in ['83', 'avenue', 'road'] {'avenue', 'road'}
'83 avenue road' -> '83 avenue road'
Related
I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1
I have a list of strings as such (amount, address, payment):
"44.53 54 orchard rd Cash"
"32.34 600 sprout brook lane Card"
I am just trying to get the address from each string. It seems to me the best way to go about this would be to split at the first and last occurrence of a space. Is there any way to do this?
Python split function is defined like this: str.split(sep=None, maxsplit=-1).
Similarly, there isstr.rsplit(sep=None, maxsplit=-1).
This means that you can split off just the beginning and the ending:
>>> s = "44.53 54 orchard rd Cash"
>>> s.split(maxsplit=1)
['44.53', '54 orchard rd Cash']
>>> s.rsplit(maxsplit=1)
['44.53 54 orchard rd', 'Cash']
Then, to simply split the string into 3, you can write a simple function:
>>> def purchase_parts(purchase):
... lsplit = purchase.split(maxsplit=1)
... rsplit = lsplit[1].rsplit(maxsplit=1)
... return (lsplit[0], rsplit[0], rsplit[1])
...
>>> purchase_parts("44.53 54 orchard rd Cash")
('44.53', '54 orchard rd', 'Cash')
>>> purchase_parts("32.34 600 sprout brook lane Card")
('32.34', '600 sprout brook lane', 'Card')
Still, I would suggest to switch to separated value list, because then you can just split using that separator, but also directly support importing/exporting of csv format (comma separated values) files.
Manual solution:
>>> [p.strip() for p in "32.34, 600 sprout brook lane, Card".split(',')]
['32.34', '600 sprout brook lane', 'Card']
You could potentially do something like:
line = "44.53 54 orchard rd Cash"
line_parts = line.split(" ")
address = " ".join(line_parts[1:-1])
It's a bit untidy and definitely brittle to changes in line format, but would do the job.
You can use your method, splitting at the first and last spaces, but you need to join back the rest (in the middle):
def get_address(s):
s = s.split()
return ' '.join(s[1:-1])
# s[1:-1] will remove the first (amount) and the last (payment) values
# ' '.join will then put back the spaces that were removed from the address by s.split
Input:
print(get_address("44.53 54 orchard rd Cash"))
print(get_address("32.34 600 sprout brook lane Cash"))
Output:
54 orchard rd
600 sprout brook lane
You could also use a regular expression to be a bit more flexible and robust. Here, the first two \d+ elements say that you must at first have two digits separated by a dot, then a space, then your address as returned result (in parenthesis ()) consisting of any characters (\w) or ([]) whitespace characters (\W) until a space and another sequence of characters (\w+).
import re
addresses = [
"44.53 54 orchard rd Cash",
"32.34 600 sprout brook lane Card"
]
addresses = [re.findall(r'\d+\.\d+ ([\w\W]+) \w+', address)[0] for address in addresses]
print(addresses) # ['54 orchard rd', '600 sprout brook lane']
You could get the first and last using unpacking and the reassemble then rest to form the address:
amount,*rest,payment = s.split()
address = " ".join(rest)
I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals
If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071
With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'
I am working on some address cleaning/geocoding software, and I recently ran into a specfic address format that is causing some problems for me.
My external geocoding module is having trouble finding addresses such as 30 w 60th new york (30 w 60th street new york is the proper format of the address).
Essentially what I would need to do is parse the string and check the following:
Are there any numbers followed by th or st or nd or rd? (+ a space following them). I.E 33rd 34th 21st 24th
If so, is the word street following it?
If yes, do nothing.
If no, add the word street immediately after the specific pattern?
Would regex be the best way to approach this situation?
Further Clarification: I am not having any issues with other address suffixes, such as avenue, road, etc etc etc. I have analyzed very large data sets (I'm running about 12,000 addresses/day through my application), and instances where street is left out is what is causing the biggest headaches for me. I have looked into address parsing modules, such as usaddress, smartystreets, and others. I really just need to come up with a clean (hopefully regex?) solution to the specific problem that I have described.
I'm thinking something along the lines of:
Converting the string to a list.
Find the index of the element in the list that meets the criteria that i've explained
Check to see if the next element is street. If so, do nothing.
If not, reconstruct the list with [:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]. (targetword would be 47th or whatever is in the string)
Join the list back into a string.
I'm not exactly the best with regex, so i'm looking for some input.
Thanks.
It seems that your looking for regexp. = P
Here some code I build specialy for you :
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# then check if not followed by 'street'
if re.match('street', has_number.group('following')) is None:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # the format is good (followed by 'street')
else:
return True # there is no number like 'th, st, nd, rd'
I'm python learner so thank you for let me know if it solves your issue.
Tested on a small list of addresses.
Hope it helps or leads you to solution.
Thank you !
EDIT
Improved to take care if followed by "avenue" or "road" as well as "street" :
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return True # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # there is no number like 'th, st, nd, rd'
RE-EDIT
I made some improvement for your needs and added an example of use :
import re
# build the original address list includes bad format
address_list = [
'30 w 60th new york',
'30 w 60th new york',
'30 w 21st new york',
'30 w 23rd new york',
'30 w 1231st new york',
'30 w 1452nd new york',
'30 w 1300th new york',
'30 w 1643rd new york',
'30 w 22nd new york',
'30 w 60th street new york',
'30 w 60th street new york',
'30 w 21st street new york',
'30 w 22nd street new york',
'30 w 23rd street new york',
'30 w brown street new york',
'30 w 1st new york',
'30 w 2nd new york',
'30 w 116th new york',
'30 w 121st avenue new york',
'30 w 121st road new york',
'30 w 123rd road new york',
'30 w 12th avenue new york',
'30 w 151st road new york',
'30 w 15th road new york',
'30 w 16th avenue new york'
]
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # return original address
# else add the "street" word
else:
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return address # there is no number like 'th, st, nd, rd' -> return original address
# initialisation of the new list
new_address_list = []
# built the new clean list
for address in address_list:
new_address_list.append(check_th_add_street(address))
# or you could use it straight here i.e. :
# address = check_th_add_street(address)
# print address
# use the new list to do you work
for address in new_address_list:
print "Formated address is : %s" % address # or what ever you want to do with 'address'
Will output :
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york
RE-RE-EDIT
The final function : added the count parameter to re.sub()
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
return new_address
else:
return address # there is no number like 'th, st, nd, rd'
While you could certainly use regex for this sort of problem, I can't help, but think that there's most likely a Python library out there that has already solved this problem for you. I've never used these, but just some quick searching finds me these:
https://github.com/datamade/usaddress
https://pypi.python.org/pypi/postal-address
https://github.com/SwoopSearch/pyaddress
PyParsing also has an address sample here you might look at: http://pyparsing.wikispaces.com/file/view/streetAddressParser.py
You might also take a look at this former question: is there a library for parsing US addresses?
Any reason you can't just use a 3rd party library to solve the problem?
edit: Pyparsing moved their url: https://github.com/pyparsing/pyparsing
You could possibly do this by turning each one of those strings into lists, and looking for certain groups of characters in those lists. For example:
def check_th(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex = addressList.index(character)
if addressList[charIndex + 1] == 'h':
numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
return int(''.join(str(x) for x in numberList))
This looks very messy, but it should get the job done, as long as the number is two digits long. However, if there are many things you need to look for, you should probably look for a more convenient and simpler way to do this.
To check and add the word street, the following function should work as long as the street number comes before its name:
def check_add_street(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex_t = addressList.index(character)
if addressList[charIndex_t + 1] == 'h':
newIndex = charIndex_t + 1
break
elif character == 's':
charIndex_s = addressList.index(character)
if addressList[charIndex_s + 1] == 't':
newIndex = charIndex_s + 1
break
elif character == 'n':
charIndex_n = addressList.index(character)
if addressList[charIndex_n + 1] == 'd':
newIndex = charIndex_n + 1
break
elif character == 'r':
charIndex_r = addressList.index(character)
if addressList[charIndex_r + 1] == 'd':
newIndex = charIndex_r + 1
break
if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
newAddressList = []
for n in range(len(addressList)):
while n <= newIndex:
newAddressList.append(addressList[n])
newAddressList.append(' ')
newAddressList.append('s')
newAddressList.append('t')
newAddressList.append('r')
newAddressList.append('e')
newAddressList.append('e')
newAddressList.append('t')
for n in range(len(addressList) - newIndex):
newAddressList.append(addressList[n + newIndex])
return ''.join(str(x) for x in newAddressList)
else:
return ''.join(str(x) for x in addressList)
This will add the word "street" if it is not already present, given that the format that you gave above is consistent.
I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK