Hey,
I was wondering how I can find a Street Address in a string in Python/Ruby?
Perhaps by a regex?
Also, it's gonna be in the following format (US)
420 Fanboy Lane, Cupertino CA
Thanks!
Maybe you want to have a look at pypostal. pypostal are the official Python bindings to libpostal.
With the Examples from Mike Bethany i made this little Example:
from postal.parser import parse_address
addresses = [
"420 Fanboy Lane, Cupertino CA 12345",
"1829 William Tell Oveture, by Gioachino Rossini 88421",
"114801 Western East Avenue Apt. B32, Funky Township CA 12345",
"1 Infinite Loop, Cupertino CA 12345-1234",
"420 time!",
]
for address in addresses:
print parse_address(address)
print "*" * 60
> [(u'420', u'house_number'), (u'fanboy lane', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
> ************************************************************
> [(u'1829', u'house_number'), (u'william tell', u'road'), (u'oveture by gioachino', u'house'), (u'rossini', u'road'), (u'88421',
> u'postcode')]
> ************************************************************
> [(u'114801', u'house_number'), (u'western east avenue apt.', u'road'), (u'b32', u'postcode'), (u'funky', u'road'), (u'township',
> u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
> ************************************************************
> [(u'1', u'house_number'), (u'infinite loop', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345-1234',
> u'postcode')]
> ************************************************************
> [(u'420', u'house_number'), (u'time !', u'house')]
> ************************************************************
Using your example this is what I came up with in Ruby (I edited it to include ZIP code and an optional +4 ZIP):
regex = Regexp.new(/^[0-9]* (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?$/)
addresses = ["420 Fanboy Lane, Cupertino CA 12345"]
addresses << "1829 William Tell Oveture, by Gioachino Rossini 88421"
addresses << "114801 Western East Avenue Apt. B32, Funky Township CA 12345"
addresses << "1 Infinite Loop, Cupertino CA 12345-1234"
addresses << "420 time!"
addresses.each do |address|
print address
if address.match(regex)
puts " is an address"
else
puts " is not an address"
end
end
# Outputs:
> 420 Fanboy Lane, Cupertino CA 12345 is an address
> 1829 William Tell Oveture, by Gioachino Rossini 88421 is not an address
> 114801 Western East Avenue Apt. B32, Funky Township CA 12345 is an address
> 1 Infinite Loop, Cupertino CA 12345-1234 is an address
> 420 time! is not an address
Here's what I used:
(\d{1,10}( \w+){1,10}( ( \w+){1,10})?( \w+){1,10}[,.](( \w+){1,10}(,)? [A-Z]{2}( [0-9]{5})?)?)
It's not perfect and doesn't match edge cases but it works for most regularly typed addresses and partial addresses.
It finds addresses in text such as
Hi! I'm at 12567 Some St. Fairfax, VA. Come get me!
some text 12567 Some St. is my home
something else 123 My Street Drive, Fairfax VA 22033
Hope this helps someone
\d{1,4}( \w+){1,3},( \w+){1,3} [A-Z]{2}
Not fully tested, but should work. Just use it with your favorite function from re (e.g. re.findall. Assumptions:
A house number can be between 1 and 4 digits long
1-3 words follow a house number, and they're all separated by spaces
City name is 1-3 words (needs to match Cupertino, Los Angeles, and San Luis Obispo)
Okay, Based on the very helpful Mike Bethany and Rafe Kettler responses ( thanks!)
I get this REGEX works for python and ruby.
/[0-9]{1,4} (.), (.) [a-zA-Z]{2} [0-9]{5}/
Ruby Code - Results in 12 Argonaut Lane, Lexington MA 02478
myregex=Regexp.new(/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/)
print "We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!".match(myregex)
Python Code - doesnt work quite the same, but this is the base code.
import re
myregex = re.compile(r'/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/')
search = myregex.findall("We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!")
As stated, addresses are very free-form. Rather than the REGEX approach how about a service that provides accurate, standardized address data? I work for SmartyStreets, where we provide an API that does this very thing. One simple GET request and you've got your address parsed for you. Try this python sample out (you'll need to start a trial):
https://github.com/smartystreets/smartystreets-python-sdk/blob/master/examples/us_street_single_address_example.py
Related
I need to find a regex that will extract the city name from strings below.
The order of string is the restaurant name, address, city, phone, cuisine type
Chinois on Main 2709 Main St. Santa Monica 310-392-9025 Pacific New Wave
Benita's Frites 1433 Third St. Promenade Santa Monica 310-458-2889 Fast Food
Indo Cafe 10428 1/2 National Blvd. LA 310-815-1290 Indonesian
Diaghilev 1020 N. San Vicente Blvd. W. Hollywood 310-854-1111 Russian
Jody Maroni's Sausage Kingdom 2011 Ocean Front Walk Venice 310-306-1995 Hot Dogs
I tried this regex, but it doesn't work:
zagat['city'] = zagat['raw'].str.extract("""
((?<=Ave.|Rd.|St.|Blvd.|Dr.|Way.|Pl.|Ln.|Ct.|Beach|Way ).+(?=...-...-....))
""", expand=True)
Can you help?
You may use
rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)
See the regex demo
Details
(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk) - Ave, Rd, St, Blvd, Dr, Way, Pl, Ln or Ct followed with . or Beach, Way or Walk
\s* - 0+ whitespaces
(.+?) - Group 1 (this value will be returned by .extract): any one or more chars other than line break chars, as few as possible
\s* - 0+ whitespaces
\d{3}-\d{3}-\d{4} - 3 digits, -, 3 digits, - and 4 digits.
So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford 56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling 56764
North Italian Folk, by Alice Vansittart Strettel Carr 56763
and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.
Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.
It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...
I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK
I have a feature class which contains 40,000 mailing addresses. Each address contains the street address, city, state and zipcode separated by spaces.
Example 1: 123 Northwest Johnson St Cleveland Ohio 12345
Example 2: PO Box 3 Pine Springs Ohio 12345
I want to pull out just the street addresses. How do I say: trim off the string starting at the 3rd or 4th to last space?
Thanks. Any help would be appreciated. I'm trying combinations of split, trim, etc. but can't get it right.
This is how you can do it in pure Python, I am not sure about differences when using ArcGIS:
ad1 = "123 Northwest Johnson St Cleveland Ohio 12345"
ad2 = "PO Box 3 Pine Springs Ohio 12345"
ad1split = ad1.split(" ")
ad2split = ad2.split(" ")
print ' '.join( ad1split[: len(ad1split)-3 ] ) # 123 Northwest Johnson
print ' '.join( ad2split[: len(ad1split)-3 ] ) # PO Box 3
This however only works if all addresses have the same format.
Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?
I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.