FInd a US street address in text (preferably using Python regex) - python

Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?

I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.

How can I use a regex function to extract european streetnames, housing numbers and adjectives with pandas?

I have a dataframe with 6000 records and need to extract/split the column with streetname into: "Streetname", "Housingnumber" and "Adjectives". Unfortunately, the problem is not solved yet using regex functions because there is no structure in the notation of df["streetname"]:
**Input from df["Streetname"]**
St. edward's Lane 26
Vineyardlane3a
High Street 0-9
ParkRoad near #33
Queens Road ??
s-Georgelane9abc
Kings Road 9b
1st Park Avenue 67 near cyclelane
**Output that I would like:
df["Street"] df["housingnumber"] df["adjective"]**
St. Edward's lane 26
Vineyardlane 3 a
High Street 0-9
ParkRoad 33
Queens Road
s-Georgelane 9 abc
Kings Road 9 b
1st Park Avenue 67
I tried this:
Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'
df["Streetname"] = df["Streetname"].str.extract(Filter)
I lose a lot of data and the result is only written into one column... Hope that someone can help!
Not 100% perfect (I doubt that this will be possible without a database or machine learning algorithms) but a starting point:
^ # start of line/string
(?P<street>\w+?\D+) # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*) # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*) # a-z
.* # consume the rest of the string
See a demo on regex101.com.
You might want to strip of #, whitespaces or ? from the end of street afterwards.

Remove numbers conditionally?

I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Street Address search in a string - Python or Ruby

Hey,
I was wondering how I can find a Street Address in a string in Python/Ruby?
Perhaps by a regex?
Also, it's gonna be in the following format (US)
420 Fanboy Lane, Cupertino CA
Thanks!
Maybe you want to have a look at pypostal. pypostal are the official Python bindings to libpostal.
With the Examples from Mike Bethany i made this little Example:
from postal.parser import parse_address
addresses = [
"420 Fanboy Lane, Cupertino CA 12345",
"1829 William Tell Oveture, by Gioachino Rossini 88421",
"114801 Western East Avenue Apt. B32, Funky Township CA 12345",
"1 Infinite Loop, Cupertino CA 12345-1234",
"420 time!",
]
for address in addresses:
print parse_address(address)
print "*" * 60
> [(u'420', u'house_number'), (u'fanboy lane', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
> ************************************************************
> [(u'1829', u'house_number'), (u'william tell', u'road'), (u'oveture by gioachino', u'house'), (u'rossini', u'road'), (u'88421',
> u'postcode')]
> ************************************************************
> [(u'114801', u'house_number'), (u'western east avenue apt.', u'road'), (u'b32', u'postcode'), (u'funky', u'road'), (u'township',
> u'city'), (u'ca', u'state'), (u'12345', u'postcode')]
> ************************************************************
> [(u'1', u'house_number'), (u'infinite loop', u'road'), (u'cupertino', u'city'), (u'ca', u'state'), (u'12345-1234',
> u'postcode')]
> ************************************************************
> [(u'420', u'house_number'), (u'time !', u'house')]
> ************************************************************
Using your example this is what I came up with in Ruby (I edited it to include ZIP code and an optional +4 ZIP):
regex = Regexp.new(/^[0-9]* (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?$/)
addresses = ["420 Fanboy Lane, Cupertino CA 12345"]
addresses << "1829 William Tell Oveture, by Gioachino Rossini 88421"
addresses << "114801 Western East Avenue Apt. B32, Funky Township CA 12345"
addresses << "1 Infinite Loop, Cupertino CA 12345-1234"
addresses << "420 time!"
addresses.each do |address|
print address
if address.match(regex)
puts " is an address"
else
puts " is not an address"
end
end
# Outputs:
> 420 Fanboy Lane, Cupertino CA 12345 is an address
> 1829 William Tell Oveture, by Gioachino Rossini 88421 is not an address
> 114801 Western East Avenue Apt. B32, Funky Township CA 12345 is an address
> 1 Infinite Loop, Cupertino CA 12345-1234 is an address
> 420 time! is not an address
Here's what I used:
(\d{1,10}( \w+){1,10}( ( \w+){1,10})?( \w+){1,10}[,.](( \w+){1,10}(,)? [A-Z]{2}( [0-9]{5})?)?)
It's not perfect and doesn't match edge cases but it works for most regularly typed addresses and partial addresses.
It finds addresses in text such as
Hi! I'm at 12567 Some St. Fairfax, VA. Come get me!
some text 12567 Some St. is my home
something else 123 My Street Drive, Fairfax VA 22033
Hope this helps someone
\d{1,4}( \w+){1,3},( \w+){1,3} [A-Z]{2}
Not fully tested, but should work. Just use it with your favorite function from re (e.g. re.findall. Assumptions:
A house number can be between 1 and 4 digits long
1-3 words follow a house number, and they're all separated by spaces
City name is 1-3 words (needs to match Cupertino, Los Angeles, and San Luis Obispo)
Okay, Based on the very helpful Mike Bethany and Rafe Kettler responses ( thanks!)
I get this REGEX works for python and ruby.
/[0-9]{1,4} (.), (.) [a-zA-Z]{2} [0-9]{5}/
Ruby Code - Results in 12 Argonaut Lane, Lexington MA 02478
myregex=Regexp.new(/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/)
print "We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!".match(myregex)
Python Code - doesnt work quite the same, but this is the base code.
import re
myregex = re.compile(r'/[0-9]{1,4} (.*), (.*) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/')
search = myregex.findall("We're Having a pizza party at 12 Argonaut Lane, Lexington MA 02478 Come join the party!")
As stated, addresses are very free-form. Rather than the REGEX approach how about a service that provides accurate, standardized address data? I work for SmartyStreets, where we provide an API that does this very thing. One simple GET request and you've got your address parsed for you. Try this python sample out (you'll need to start a trial):
https://github.com/smartystreets/smartystreets-python-sdk/blob/master/examples/us_street_single_address_example.py

Categories

Resources