Search for pattern in string and add characters if found

Search for pattern in string and add characters if found - python

I am working on some address cleaning/geocoding software, and I recently ran into a specfic address format that is causing some problems for me.
My external geocoding module is having trouble finding addresses such as 30 w 60th new york (30 w 60th street new york is the proper format of the address).
Essentially what I would need to do is parse the string and check the following:
Are there any numbers followed by th or st or nd or rd? (+ a space following them). I.E 33rd 34th 21st 24th
If so, is the word street following it?
If yes, do nothing.
If no, add the word street immediately after the specific pattern?
Would regex be the best way to approach this situation?
Further Clarification: I am not having any issues with other address suffixes, such as avenue, road, etc etc etc. I have analyzed very large data sets (I'm running about 12,000 addresses/day through my application), and instances where street is left out is what is causing the biggest headaches for me. I have looked into address parsing modules, such as usaddress, smartystreets, and others. I really just need to come up with a clean (hopefully regex?) solution to the specific problem that I have described.
I'm thinking something along the lines of:
Converting the string to a list.
Find the index of the element in the list that meets the criteria that i've explained
Check to see if the next element is street. If so, do nothing.
If not, reconstruct the list with [:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]. (targetword would be 47th or whatever is in the string)
Join the list back into a string.
I'm not exactly the best with regex, so i'm looking for some input.
Thanks.

It seems that your looking for regexp. = P
Here some code I build specialy for you :
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# then check if not followed by 'street'
if re.match('street', has_number.group('following')) is None:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # the format is good (followed by 'street')
else:
return True # there is no number like 'th, st, nd, rd'
I'm python learner so thank you for let me know if it solves your issue.
Tested on a small list of addresses.
Hope it helps or leads you to solution.
Thank you !
EDIT
Improved to take care if followed by "avenue" or "road" as well as "street" :
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return True # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # there is no number like 'th, st, nd, rd'
RE-EDIT
I made some improvement for your needs and added an example of use :
import re
# build the original address list includes bad format
address_list = [
'30 w 60th new york',
'30 w 60th new york',
'30 w 21st new york',
'30 w 23rd new york',
'30 w 1231st new york',
'30 w 1452nd new york',
'30 w 1300th new york',
'30 w 1643rd new york',
'30 w 22nd new york',
'30 w 60th street new york',
'30 w 60th street new york',
'30 w 21st street new york',
'30 w 22nd street new york',
'30 w 23rd street new york',
'30 w brown street new york',
'30 w 1st new york',
'30 w 2nd new york',
'30 w 116th new york',
'30 w 121st avenue new york',
'30 w 121st road new york',
'30 w 123rd road new york',
'30 w 12th avenue new york',
'30 w 151st road new york',
'30 w 15th road new york',
'30 w 16th avenue new york'
]
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # return original address
# else add the "street" word
else:
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return address # there is no number like 'th, st, nd, rd' -> return original address
# initialisation of the new list
new_address_list = []
# built the new clean list
for address in address_list:
new_address_list.append(check_th_add_street(address))
# or you could use it straight here i.e. :
# address = check_th_add_street(address)
# print address
# use the new list to do you work
for address in new_address_list:
print "Formated address is : %s" % address # or what ever you want to do with 'address'
Will output :
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york
RE-RE-EDIT
The final function : added the count parameter to re.sub()
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
return new_address
else:
return address # there is no number like 'th, st, nd, rd'

While you could certainly use regex for this sort of problem, I can't help, but think that there's most likely a Python library out there that has already solved this problem for you. I've never used these, but just some quick searching finds me these:
https://github.com/datamade/usaddress
https://pypi.python.org/pypi/postal-address
https://github.com/SwoopSearch/pyaddress
PyParsing also has an address sample here you might look at: http://pyparsing.wikispaces.com/file/view/streetAddressParser.py
You might also take a look at this former question: is there a library for parsing US addresses?
Any reason you can't just use a 3rd party library to solve the problem?
edit: Pyparsing moved their url: https://github.com/pyparsing/pyparsing

You could possibly do this by turning each one of those strings into lists, and looking for certain groups of characters in those lists. For example:
def check_th(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex = addressList.index(character)
if addressList[charIndex + 1] == 'h':
numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
return int(''.join(str(x) for x in numberList))
This looks very messy, but it should get the job done, as long as the number is two digits long. However, if there are many things you need to look for, you should probably look for a more convenient and simpler way to do this.

To check and add the word street, the following function should work as long as the street number comes before its name:
def check_add_street(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex_t = addressList.index(character)
if addressList[charIndex_t + 1] == 'h':
newIndex = charIndex_t + 1
break
elif character == 's':
charIndex_s = addressList.index(character)
if addressList[charIndex_s + 1] == 't':
newIndex = charIndex_s + 1
break
elif character == 'n':
charIndex_n = addressList.index(character)
if addressList[charIndex_n + 1] == 'd':
newIndex = charIndex_n + 1
break
elif character == 'r':
charIndex_r = addressList.index(character)
if addressList[charIndex_r + 1] == 'd':
newIndex = charIndex_r + 1
break
if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
newAddressList = []
for n in range(len(addressList)):
while n <= newIndex:
newAddressList.append(addressList[n])
newAddressList.append(' ')
newAddressList.append('s')
newAddressList.append('t')
newAddressList.append('r')
newAddressList.append('e')
newAddressList.append('e')
newAddressList.append('t')
for n in range(len(addressList) - newIndex):
newAddressList.append(addressList[n + newIndex])
return ''.join(str(x) for x in newAddressList)
else:
return ''.join(str(x) for x in addressList)
This will add the word "street" if it is not already present, given that the format that you gave above is consistent.

Related

Replace equal sub-strings with different words

I have a string:
s = '96 ST MARY ST'
Now the first occurrence of 'ST' is Saint, and the second occurrence is Street i.e. Saint Mary Street.
I want to replace the first ST with Saint, and the second ST with Street. For this I tried to use find() and rfind():
# index of ST
ind = s.find('ST')
s[ind:(ind+2)] = 'Saint'
# index of last ST
ind2 = s.rfind('ST')
s[ind2:(ind2+2)] = 'Street'
TypeError: 'str' object does not support item assignment
I don't know how to get around this.
Is there a way to extract these sub-strings somehow and replace them?

Two replacement:
s = s.replace("ST", "Saint", 1).replace("ST", "Street", 1)

You might be OK with using re.sub along with its count parameter, to target the first replacement:
s = '96 ST MARY ST'
print(s)
out = re.sub(r'\bST\b', 'Saint', s, 1)
print(out)
out = re.sub(r'\bST\b', 'Street', s)
print(out)
This prints:
96 ST MARY ST
96 Saint MARY ST
96 Street MARY Street
However, while the above coincidentally works for your exact sample input, there are many edge cases where this would fail. It assumes that Saint comes before Street, and this may not always be the case, nor may there always be only two occurrences of ST.

A simple way, assuming there are always 2 occurences of 'ST':
p1, p2, _ = s.split('ST')
res = f"{p1}Saint{p2}Street"
If your input strings happen to be more complexes, you should go for regex (as Tim Biegeleisen proposed)

How to find a word inside a list and search backwards from that word?

How can I search for a specific word inside a list? Once I have searched for that specific word, how can I search backwards from there? In my example, I need to search for the city, and then search backwards to find the street type (ex: Rd, St, Ave, etc.)
I first allow a user to input an address, like 123 Fakeville St SW San Francisco CA 90215:
searchWord = 'San Francisco'
searchWord = searchWord.upper()
address = raw_input("Type an address: ").upper()
Once the address is entered, I split it using address = address.split(), which results in:
['123', 'Fakeville', 'St', 'SW', 'San Francisco', 'CA', '90215']
I then search for city in the list:
for items in address:
if searchWord in items:
print searchWord
But I'm not sure how to count backwards to find the street type (ex: St).

You can use list.index method to search the index of an item in a list.
There is no list.rindex method to search backward.
You need to use:
rev_idx = len(my_list) - my_list[::-1].index(item) - 1
I don't really understand what’s your aim, be I can explain how to search backward the "St" string in the address list of strings:
address = ['123', 'Fakeville', 'St', 'SW', 'San Francisco', 'CA', '90215']
town_idx = address.index('San Francisco')
print(town_idx)
# You'll get: 4
before = address[:town_idx]
st_index = len(before) - before[::-1].index("St") - 1
print(st_index)
# You'll get: 2

for items in address:
if searchWord in items:
for each in reversed(address[0:address.index(searchWord)]):
if each == 'St':
print each
Once you find the city, reverse traverse the list using reversed

Remove numbers conditionally?

I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case

If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK

Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK

Rearrange position of words in string conditionally

I've spent the last few months developing a program that my company is using to clean and geocode addresses on a large scale (~5,000/day). It is functioning adequately well, however, there are certain address formats that I see daily that are causing issues for me.
Addresses with a format such as this park avenue 1 are causing issues with my geocoding. My thought process to tackle this issue is as follows:
Split the address into a list
Find the index of my delimiter word in the list. The delimiter words are words such as avenue, street, road, etc. I have a list of these delimiters called patterns.
Check to see if the word immediately following the delimiter is composed of digits with a length of 4 or less. If the number has a length of higher than 4 it is likely to be a zip code, which I do not need. If it's less than 4 it will most likely be the house number.
If the word meets the criteria that I explained in the previous step, I need to move it to the first position in the list.
Finally, I will put the list back together into a string.
Here is my initial attempt at putting my thoughts into code:
patterns ['my list of delimiters']
address = 'park avenue 1' # this is an example address
address = address.split(' ')
for pattern in patterns:
location = address.index(pattern) + 1
if address[location].isdigit() and len(address[location]) <= 4:
# here is where i'm getting a bit confused
# what would be a good way to go about moving the word to the first position in the list
address = ' '.join(address)
Any help would be appreciated. Thank you folks in advance.

Make the string address[location] into a list by wrapping it in brackets, then concatenate the other pieces.
address = [address[location]] + address[:location] + address[location+1:]
An example:
address = ['park', 'avenue', '1']
location = 2
address = [address[location]] + address[:location] + address[location+1:]
print(' '.join(address)) # => '1 park avenue'

Here's a modified version of your code. It uses simple list slicing to rearrange the parts of the address list.
Rather than using a for loop to search for a matching road type it uses set operations.
This code isn't perfect: it won't catch "numbers" like 12a, and it won't handle weird street names like "Avenue Road".
road_patterns = {'avenue', 'street', 'road', 'lane'}
def fix_address(address):
address_list = address.split()
road = road_patterns.intersection(address_list)
if len(road) == 0:
print("Can't find a road pattern in ", address_list)
elif len(road) > 1:
print("Ambiguous road pattern in ", address_list, road)
else:
road = road.pop()
index = address_list.index(road) + 1
if index < len(address_list):
number = address_list[index]
if number.isdigit() and len(number) <= 4:
address_list = [number] + address_list[:index] + address_list[index + 1:]
address = ' '.join(address_list)
return address
addresses = (
'42 tobacco road',
'park avenue 1 a',
'penny lane 17',
'nonum road 12345',
'strange street 23 london',
'baker street 221b',
'37 gasoline alley',
'83 avenue road',
)
for address in addresses:
fixed = fix_address(address)
print('{!r} -> {!r}'.format(address, fixed))
output
'42 tobacco road' -> '42 tobacco road'
'park avenue 1 a' -> '1 park avenue a'
'penny lane 17' -> '17 penny lane'
'nonum road 12345' -> 'nonum road 12345'
'strange street 23 london' -> '23 strange street london'
'baker street 221b' -> 'baker street 221b'
Can't find a road pattern in ['37', 'gasoline', 'alley']
'37 gasoline alley' -> '37 gasoline alley'
Ambiguous road pattern in ['83', 'avenue', 'road'] {'avenue', 'road'}
'83 avenue road' -> '83 avenue road'

Inserting ',' based on conditionals within a STR - Python

I am working with a very long list of street names that look like this:
1820 W 9000 SWest Jordan
455 S 500 ESalt Lake City
555 S 200 WBountiful
1000 N Green Valley PkwyHenderson
10100 W Tropicana AveLas Vegas
10305 S 1300 ESandy
10600 Southern Highlands PkwyLas Vegas
10616 S Eastern AveHenderson
111 Coors Blvd NWAlbuquerque
1170 E Gentile StLayton
1174 W 600 NSalt Lake City
1200 W Main StRiverton
....
....
I am trying to insert a ',' before the city name, which it appears is always after a lowercase character followed by NO SPACE and an UPPERCASE character.
So this is my thinking:
How do I write something that says, more or less,:
for cities in lst:
if [char] is lower and [nextchar] is UPPER:
[insert] ',' before UPPER

Following #Martijn's suggestion to take the last uppercase letter in a group, maybe:
import re
def fix(s):
return re.sub("([a-z]|[A-Z]+)([A-Z])",r"\1,\2", s)
which gives
>>> for line in lines:
... print fix(line)
...
1820 W 9000 S,West Jordan
455 S 500 E,Salt Lake City
555 S 200 W,Bountiful
1000 N Green Valley Pkwy,Henderson
10100 W Tropicana Ave,Las Vegas
10305 S 1300 E,Sandy
10600 Southern Highlands Pkwy,Las Vegas
10616 S Eastern Ave,Henderson
111 Coors Blvd NW,Albuquerque
1170 E Gentile St,Layton
1174 W 600 N,Salt Lake City
1200 W Main St,Riverton
[Disclaimer: I'm terrible with regexes.]

Something like this?
for big_index, cities in enumerate(lst):
for index, char in enumerate(cities):
if char == char.lower() and cities[index+1] != cities[index+1].lower():
lst[big_index] = cities[:index] + "," + cities[index:]
Disclaimer** Not tested. Since I don't have all your data, I won't attempt it, but this should give the output you're describing
**Edit: In fact, it doesn't look like your data follows these rules at all. Like the example in the comments, what about Coors Blvd NWAlbuquerque? Anyway, I'll keep the code here unless you change your question

cities = [re.sub(r'(?<=[a-z])(?=[A-Z])', ',', x) for x in cities]

This would be a solution without regex, basically implementing your logic:
new_list = []
for line in big_list:
for c in xrange(len(line)-1):
if 97 <= ord(line[c]) <= 122 and 65 <= ord(line[c+1]) <= 90:
line = line[:c+1]+","+line[c+1:]
break
new_list.append(line)
>>> new_list
['1820 W 9000 SWest Jordan', '455 S 500 ESalt Lake City', '555 S 200 WBountiful', '1000 N Green Valley Pkwy,Henderson', '10100 W Tropicana Ave,Las Vegas', '10305 S 1300 ESandy', '10600 Southern Highlands Pkwy,Las Vegas', '10616 S Eastern Ave,Henderson', '111 Coors Blvd NWAlbuquerque', '1170 E Gentile St,Layton', '1174 W 600 NSalt Lake City', '1200 W Main St,Riverton']
In case you're wondering what the ord function does: It translates a character into ASCII code. In ASCII, lower-case letters are bound to 97-122, so if ord(char) is in that range, it's a lower-case letter. Same goes for upper-case letters, except they're bound to 65-90.

Your core problem is the missing space between the street and the state portions. Assuming a solution where you have already split this string into component address parts (perhaps using #jazzpi 's solution), you can solve this secondary problem by building a collection of strings that match postal designations, such as ['Ave', 'E', 'Pkwy'] and so on, then look for matches to that collection on the left end of the state string.
Once you find a match, check to see if removing that sub-string leaves the state with an initial capital letter. If it leaves an initial capital letter intact, then you are free to truncate the substring and append the truncated street designation to the street portion.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.