I am working with a very long list of street names that look like this:
1820 W 9000 SWest Jordan
455 S 500 ESalt Lake City
555 S 200 WBountiful
1000 N Green Valley PkwyHenderson
10100 W Tropicana AveLas Vegas
10305 S 1300 ESandy
10600 Southern Highlands PkwyLas Vegas
10616 S Eastern AveHenderson
111 Coors Blvd NWAlbuquerque
1170 E Gentile StLayton
1174 W 600 NSalt Lake City
1200 W Main StRiverton
....
....
I am trying to insert a ',' before the city name, which it appears is always after a lowercase character followed by NO SPACE and an UPPERCASE character.
So this is my thinking:
How do I write something that says, more or less,:
for cities in lst:
if [char] is lower and [nextchar] is UPPER:
[insert] ',' before UPPER
Following #Martijn's suggestion to take the last uppercase letter in a group, maybe:
import re
def fix(s):
return re.sub("([a-z]|[A-Z]+)([A-Z])",r"\1,\2", s)
which gives
>>> for line in lines:
... print fix(line)
...
1820 W 9000 S,West Jordan
455 S 500 E,Salt Lake City
555 S 200 W,Bountiful
1000 N Green Valley Pkwy,Henderson
10100 W Tropicana Ave,Las Vegas
10305 S 1300 E,Sandy
10600 Southern Highlands Pkwy,Las Vegas
10616 S Eastern Ave,Henderson
111 Coors Blvd NW,Albuquerque
1170 E Gentile St,Layton
1174 W 600 N,Salt Lake City
1200 W Main St,Riverton
[Disclaimer: I'm terrible with regexes.]
Something like this?
for big_index, cities in enumerate(lst):
for index, char in enumerate(cities):
if char == char.lower() and cities[index+1] != cities[index+1].lower():
lst[big_index] = cities[:index] + "," + cities[index:]
Disclaimer** Not tested. Since I don't have all your data, I won't attempt it, but this should give the output you're describing
**Edit: In fact, it doesn't look like your data follows these rules at all. Like the example in the comments, what about Coors Blvd NWAlbuquerque? Anyway, I'll keep the code here unless you change your question
cities = [re.sub(r'(?<=[a-z])(?=[A-Z])', ',', x) for x in cities]
This would be a solution without regex, basically implementing your logic:
new_list = []
for line in big_list:
for c in xrange(len(line)-1):
if 97 <= ord(line[c]) <= 122 and 65 <= ord(line[c+1]) <= 90:
line = line[:c+1]+","+line[c+1:]
break
new_list.append(line)
>>> new_list
['1820 W 9000 SWest Jordan', '455 S 500 ESalt Lake City', '555 S 200 WBountiful', '1000 N Green Valley Pkwy,Henderson', '10100 W Tropicana Ave,Las Vegas', '10305 S 1300 ESandy', '10600 Southern Highlands Pkwy,Las Vegas', '10616 S Eastern Ave,Henderson', '111 Coors Blvd NWAlbuquerque', '1170 E Gentile St,Layton', '1174 W 600 NSalt Lake City', '1200 W Main St,Riverton']
In case you're wondering what the ord function does: It translates a character into ASCII code. In ASCII, lower-case letters are bound to 97-122, so if ord(char) is in that range, it's a lower-case letter. Same goes for upper-case letters, except they're bound to 65-90.
Your core problem is the missing space between the street and the state portions. Assuming a solution where you have already split this string into component address parts (perhaps using #jazzpi 's solution), you can solve this secondary problem by building a collection of strings that match postal designations, such as ['Ave', 'E', 'Pkwy'] and so on, then look for matches to that collection on the left end of the state string.
Once you find a match, check to see if removing that sub-string leaves the state with an initial capital letter. If it leaves an initial capital letter intact, then you are free to truncate the substring and append the truncated street designation to the street portion.
Related
I have a pandas dataframe column street_address with strings that looks like this:
id | street_address
----------------------
1 | 3510 WILSHIRE BLVD #1500
2 | PO BOX 29043
3 | RE HIAM S ABU QARTUMI 4676 ADMIRALTY WAY STE 632
4 | RE: SOON, LEE YEE 3510 WILSHIRE BLVD #150
5 | LAW OFFICES OF JOE M DOE 133 SANDSTONE ST STE 901
6 | SUITE 940, 1500 N CENTRAL AVE
I want to remove the text before the numeric values (actual address), but need to exclude PO Boxes and the address that begins with Suite number.
I want the output to be something like this:
street_address
----------------------
3510 WILSHIRE BLVD #1500
PO BOX 29043
4676 ADMIRALTY WAY STE 632
3510 WILSHIRE BLVD #150
133 SANDSTONE ST STE 901
SUITE 940, 1500 N CENTRAL AVE
Thanks for your help!
EDIT
Thanks everyone for the help!
However, for my example I made it work by using replace
# When an address starts with a string,
# remove that string though the first number
# unless that string is similar to 'PO BOX' or 'SUITE'.
# This catches variants like
# PO BOX, P.O BOX, PMB, STE, Suite, ste, etc.
pattern = r"^(?![PO.\sBX]{2,}|[PMB]{2,}|[\d]|[SUITE])(\D+)(.+)"
df['str_addr'] = df['street_address'].str.replace(pattern,'\\2')
Use the following regex:
r'^(?:(?!(PO BOX|SUITE|\d+)))([a-zA-Z :,]+)'
The first part uses a non-capturing group to identify rows that don't start with a number, "PO BOX", or "SUITE". The second part ([a-zA-Z :,]+) captures the start of those addresses that were flagged by the first half of the regex. You can extract this capture group from the offending lines and strip them down to the address. Obviously, if there are more characters besides [a-zA-Z :,], add them to this bracketed list to grab them too.
I have a list of strings as such (amount, address, payment):
"44.53 54 orchard rd Cash"
"32.34 600 sprout brook lane Card"
I am just trying to get the address from each string. It seems to me the best way to go about this would be to split at the first and last occurrence of a space. Is there any way to do this?
Python split function is defined like this: str.split(sep=None, maxsplit=-1).
Similarly, there isstr.rsplit(sep=None, maxsplit=-1).
This means that you can split off just the beginning and the ending:
>>> s = "44.53 54 orchard rd Cash"
>>> s.split(maxsplit=1)
['44.53', '54 orchard rd Cash']
>>> s.rsplit(maxsplit=1)
['44.53 54 orchard rd', 'Cash']
Then, to simply split the string into 3, you can write a simple function:
>>> def purchase_parts(purchase):
... lsplit = purchase.split(maxsplit=1)
... rsplit = lsplit[1].rsplit(maxsplit=1)
... return (lsplit[0], rsplit[0], rsplit[1])
...
>>> purchase_parts("44.53 54 orchard rd Cash")
('44.53', '54 orchard rd', 'Cash')
>>> purchase_parts("32.34 600 sprout brook lane Card")
('32.34', '600 sprout brook lane', 'Card')
Still, I would suggest to switch to separated value list, because then you can just split using that separator, but also directly support importing/exporting of csv format (comma separated values) files.
Manual solution:
>>> [p.strip() for p in "32.34, 600 sprout brook lane, Card".split(',')]
['32.34', '600 sprout brook lane', 'Card']
You could potentially do something like:
line = "44.53 54 orchard rd Cash"
line_parts = line.split(" ")
address = " ".join(line_parts[1:-1])
It's a bit untidy and definitely brittle to changes in line format, but would do the job.
You can use your method, splitting at the first and last spaces, but you need to join back the rest (in the middle):
def get_address(s):
s = s.split()
return ' '.join(s[1:-1])
# s[1:-1] will remove the first (amount) and the last (payment) values
# ' '.join will then put back the spaces that were removed from the address by s.split
Input:
print(get_address("44.53 54 orchard rd Cash"))
print(get_address("32.34 600 sprout brook lane Cash"))
Output:
54 orchard rd
600 sprout brook lane
You could also use a regular expression to be a bit more flexible and robust. Here, the first two \d+ elements say that you must at first have two digits separated by a dot, then a space, then your address as returned result (in parenthesis ()) consisting of any characters (\w) or ([]) whitespace characters (\W) until a space and another sequence of characters (\w+).
import re
addresses = [
"44.53 54 orchard rd Cash",
"32.34 600 sprout brook lane Card"
]
addresses = [re.findall(r'\d+\.\d+ ([\w\W]+) \w+', address)[0] for address in addresses]
print(addresses) # ['54 orchard rd', '600 sprout brook lane']
You could get the first and last using unpacking and the reassemble then rest to form the address:
amount,*rest,payment = s.split()
address = " ".join(rest)
I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals
If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071
With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'
I have a dataframe with 6000 records and need to extract/split the column with streetname into: "Streetname", "Housingnumber" and "Adjectives". Unfortunately, the problem is not solved yet using regex functions because there is no structure in the notation of df["streetname"]:
**Input from df["Streetname"]**
St. edward's Lane 26
Vineyardlane3a
High Street 0-9
ParkRoad near #33
Queens Road ??
s-Georgelane9abc
Kings Road 9b
1st Park Avenue 67 near cyclelane
**Output that I would like:
df["Street"] df["housingnumber"] df["adjective"]**
St. Edward's lane 26
Vineyardlane 3 a
High Street 0-9
ParkRoad 33
Queens Road
s-Georgelane 9 abc
Kings Road 9 b
1st Park Avenue 67
I tried this:
Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'
df["Streetname"] = df["Streetname"].str.extract(Filter)
I lose a lot of data and the result is only written into one column... Hope that someone can help!
Not 100% perfect (I doubt that this will be possible without a database or machine learning algorithms) but a starting point:
^ # start of line/string
(?P<street>\w+?\D+) # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*) # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*) # a-z
.* # consume the rest of the string
See a demo on regex101.com.
You might want to strip of #, whitespaces or ? from the end of street afterwards.
I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK