Split at first and last occurrence of a character? - python

I have a list of strings as such (amount, address, payment):
"44.53 54 orchard rd Cash"
"32.34 600 sprout brook lane Card"
I am just trying to get the address from each string. It seems to me the best way to go about this would be to split at the first and last occurrence of a space. Is there any way to do this?

Python split function is defined like this: str.split(sep=None, maxsplit=-1).
Similarly, there isstr.rsplit(sep=None, maxsplit=-1).
This means that you can split off just the beginning and the ending:
>>> s = "44.53 54 orchard rd Cash"
>>> s.split(maxsplit=1)
['44.53', '54 orchard rd Cash']
>>> s.rsplit(maxsplit=1)
['44.53 54 orchard rd', 'Cash']
Then, to simply split the string into 3, you can write a simple function:
>>> def purchase_parts(purchase):
... lsplit = purchase.split(maxsplit=1)
... rsplit = lsplit[1].rsplit(maxsplit=1)
... return (lsplit[0], rsplit[0], rsplit[1])
...
>>> purchase_parts("44.53 54 orchard rd Cash")
('44.53', '54 orchard rd', 'Cash')
>>> purchase_parts("32.34 600 sprout brook lane Card")
('32.34', '600 sprout brook lane', 'Card')
Still, I would suggest to switch to separated value list, because then you can just split using that separator, but also directly support importing/exporting of csv format (comma separated values) files.
Manual solution:
>>> [p.strip() for p in "32.34, 600 sprout brook lane, Card".split(',')]
['32.34', '600 sprout brook lane', 'Card']

You could potentially do something like:
line = "44.53 54 orchard rd Cash"
line_parts = line.split(" ")
address = " ".join(line_parts[1:-1])
It's a bit untidy and definitely brittle to changes in line format, but would do the job.

You can use your method, splitting at the first and last spaces, but you need to join back the rest (in the middle):
def get_address(s):
s = s.split()
return ' '.join(s[1:-1])
# s[1:-1] will remove the first (amount) and the last (payment) values
# ' '.join will then put back the spaces that were removed from the address by s.split
Input:
print(get_address("44.53 54 orchard rd Cash"))
print(get_address("32.34 600 sprout brook lane Cash"))
Output:
54 orchard rd
600 sprout brook lane

You could also use a regular expression to be a bit more flexible and robust. Here, the first two \d+ elements say that you must at first have two digits separated by a dot, then a space, then your address as returned result (in parenthesis ()) consisting of any characters (\w) or ([]) whitespace characters (\W) until a space and another sequence of characters (\w+).
import re
addresses = [
"44.53 54 orchard rd Cash",
"32.34 600 sprout brook lane Card"
]
addresses = [re.findall(r'\d+\.\d+ ([\w\W]+) \w+', address)[0] for address in addresses]
print(addresses) # ['54 orchard rd', '600 sprout brook lane']

You could get the first and last using unpacking and the reassemble then rest to form the address:
amount,*rest,payment = s.split()
address = " ".join(rest)

Related

Remove numbers conditionally?

I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK

Rearrange position of words in string conditionally

I've spent the last few months developing a program that my company is using to clean and geocode addresses on a large scale (~5,000/day). It is functioning adequately well, however, there are certain address formats that I see daily that are causing issues for me.
Addresses with a format such as this park avenue 1 are causing issues with my geocoding. My thought process to tackle this issue is as follows:
Split the address into a list
Find the index of my delimiter word in the list. The delimiter words are words such as avenue, street, road, etc. I have a list of these delimiters called patterns.
Check to see if the word immediately following the delimiter is composed of digits with a length of 4 or less. If the number has a length of higher than 4 it is likely to be a zip code, which I do not need. If it's less than 4 it will most likely be the house number.
If the word meets the criteria that I explained in the previous step, I need to move it to the first position in the list.
Finally, I will put the list back together into a string.
Here is my initial attempt at putting my thoughts into code:
patterns ['my list of delimiters']
address = 'park avenue 1' # this is an example address
address = address.split(' ')
for pattern in patterns:
location = address.index(pattern) + 1
if address[location].isdigit() and len(address[location]) <= 4:
# here is where i'm getting a bit confused
# what would be a good way to go about moving the word to the first position in the list
address = ' '.join(address)
Any help would be appreciated. Thank you folks in advance.
Make the string address[location] into a list by wrapping it in brackets, then concatenate the other pieces.
address = [address[location]] + address[:location] + address[location+1:]
An example:
address = ['park', 'avenue', '1']
location = 2
address = [address[location]] + address[:location] + address[location+1:]
print(' '.join(address)) # => '1 park avenue'
Here's a modified version of your code. It uses simple list slicing to rearrange the parts of the address list.
Rather than using a for loop to search for a matching road type it uses set operations.
This code isn't perfect: it won't catch "numbers" like 12a, and it won't handle weird street names like "Avenue Road".
road_patterns = {'avenue', 'street', 'road', 'lane'}
def fix_address(address):
address_list = address.split()
road = road_patterns.intersection(address_list)
if len(road) == 0:
print("Can't find a road pattern in ", address_list)
elif len(road) > 1:
print("Ambiguous road pattern in ", address_list, road)
else:
road = road.pop()
index = address_list.index(road) + 1
if index < len(address_list):
number = address_list[index]
if number.isdigit() and len(number) <= 4:
address_list = [number] + address_list[:index] + address_list[index + 1:]
address = ' '.join(address_list)
return address
addresses = (
'42 tobacco road',
'park avenue 1 a',
'penny lane 17',
'nonum road 12345',
'strange street 23 london',
'baker street 221b',
'37 gasoline alley',
'83 avenue road',
)
for address in addresses:
fixed = fix_address(address)
print('{!r} -> {!r}'.format(address, fixed))
output
'42 tobacco road' -> '42 tobacco road'
'park avenue 1 a' -> '1 park avenue a'
'penny lane 17' -> '17 penny lane'
'nonum road 12345' -> 'nonum road 12345'
'strange street 23 london' -> '23 strange street london'
'baker street 221b' -> 'baker street 221b'
Can't find a road pattern in ['37', 'gasoline', 'alley']
'37 gasoline alley' -> '37 gasoline alley'
Ambiguous road pattern in ['83', 'avenue', 'road'] {'avenue', 'road'}
'83 avenue road' -> '83 avenue road'

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Split long string of addresses into list of addresses in Python

I have a string of a couple thousand addresses in python, like such:
'123 Chestnut Way 4567 Oak Lane 890 South Pine Court'
What is the easiest way to split this long string into separate addresses? I'm trying to write a program that splits based on 3 or 4 characters in a row where 47 < ord(i) < 58, but I'm having trouble.
Assuming all of the addresses are like those given, you can use re.findall:
>>> from re import findall
>>> string = '123 Chestnut Way 4567 Oak Lane 890 South Pine Court'
>>> findall("\d+\D+(?=\s\d|$)", string)
['123 Chestnut Way', '4567 Oak Lane', '890 South Pine Court']
>>>
All of the Regex syntax used above is explained here, but below is a quick breakdown:
\d+ # One or more digits
\D+ # One or more non-digits
(?= # The start of a lookahead assertion
\s # A space
\d|$ # A digit or the end of the string
) # The end of the lookahead assertion
You can do this with regular expressions fairly easily,
import re
txt = '123 Chestnut Way 4567 Oak Lane 890 South Pine Court'
re.findall( r'\d+', txt )
the last will return all runs of digits,
['123', '4567', '890']
you can then use that information to parse the string. there are lots of ways, but you could just find the index of the numbers in the original string and get the text in between. you could also make the regeular expression a little more advanced. The following will match any number of digits followed by a space followed by any number of non-digits (including spaces)
re.findall( r'\d+ \D+', txt )
and will return,
['123 Chestnut Way ', '4567 Oak Lane ', '890 South Pine Court']

Inserting ',' based on conditionals within a STR - Python

I am working with a very long list of street names that look like this:
1820 W 9000 SWest Jordan
455 S 500 ESalt Lake City
555 S 200 WBountiful
1000 N Green Valley PkwyHenderson
10100 W Tropicana AveLas Vegas
10305 S 1300 ESandy
10600 Southern Highlands PkwyLas Vegas
10616 S Eastern AveHenderson
111 Coors Blvd NWAlbuquerque
1170 E Gentile StLayton
1174 W 600 NSalt Lake City
1200 W Main StRiverton
....
....
I am trying to insert a ',' before the city name, which it appears is always after a lowercase character followed by NO SPACE and an UPPERCASE character.
So this is my thinking:
How do I write something that says, more or less,:
for cities in lst:
if [char] is lower and [nextchar] is UPPER:
[insert] ',' before UPPER
Following #Martijn's suggestion to take the last uppercase letter in a group, maybe:
import re
def fix(s):
return re.sub("([a-z]|[A-Z]+)([A-Z])",r"\1,\2", s)
which gives
>>> for line in lines:
... print fix(line)
...
1820 W 9000 S,West Jordan
455 S 500 E,Salt Lake City
555 S 200 W,Bountiful
1000 N Green Valley Pkwy,Henderson
10100 W Tropicana Ave,Las Vegas
10305 S 1300 E,Sandy
10600 Southern Highlands Pkwy,Las Vegas
10616 S Eastern Ave,Henderson
111 Coors Blvd NW,Albuquerque
1170 E Gentile St,Layton
1174 W 600 N,Salt Lake City
1200 W Main St,Riverton
[Disclaimer: I'm terrible with regexes.]
Something like this?
for big_index, cities in enumerate(lst):
for index, char in enumerate(cities):
if char == char.lower() and cities[index+1] != cities[index+1].lower():
lst[big_index] = cities[:index] + "," + cities[index:]
Disclaimer** Not tested. Since I don't have all your data, I won't attempt it, but this should give the output you're describing
**Edit: In fact, it doesn't look like your data follows these rules at all. Like the example in the comments, what about Coors Blvd NWAlbuquerque? Anyway, I'll keep the code here unless you change your question
cities = [re.sub(r'(?<=[a-z])(?=[A-Z])', ',', x) for x in cities]
This would be a solution without regex, basically implementing your logic:
new_list = []
for line in big_list:
for c in xrange(len(line)-1):
if 97 <= ord(line[c]) <= 122 and 65 <= ord(line[c+1]) <= 90:
line = line[:c+1]+","+line[c+1:]
break
new_list.append(line)
>>> new_list
['1820 W 9000 SWest Jordan', '455 S 500 ESalt Lake City', '555 S 200 WBountiful', '1000 N Green Valley Pkwy,Henderson', '10100 W Tropicana Ave,Las Vegas', '10305 S 1300 ESandy', '10600 Southern Highlands Pkwy,Las Vegas', '10616 S Eastern Ave,Henderson', '111 Coors Blvd NWAlbuquerque', '1170 E Gentile St,Layton', '1174 W 600 NSalt Lake City', '1200 W Main St,Riverton']
In case you're wondering what the ord function does: It translates a character into ASCII code. In ASCII, lower-case letters are bound to 97-122, so if ord(char) is in that range, it's a lower-case letter. Same goes for upper-case letters, except they're bound to 65-90.
Your core problem is the missing space between the street and the state portions. Assuming a solution where you have already split this string into component address parts (perhaps using #jazzpi 's solution), you can solve this secondary problem by building a collection of strings that match postal designations, such as ['Ave', 'E', 'Pkwy'] and so on, then look for matches to that collection on the left end of the state string.
Once you find a match, check to see if removing that sub-string leaves the state with an initial capital letter. If it leaves an initial capital letter intact, then you are free to truncate the substring and append the truncated street designation to the street portion.

Categories

Resources