I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress
Related
Using Python and pdf2text I'm trying to extract a postcode from a 4000 odd single page PDF files I have received to print and mail - unfortunately I do not have access to the original files so can't adjust when creating files.
My end goal here is to rename all the PDF files with the Postalcode_ExistingFilename.pdf so I can sort them for the postal network. I'll also need to combine PDF"s for the same customer into one file but that's another problem.
In the PDF we have the word "Dear" and the postal code is before that (albeit a few lines up):
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample Suburb
Sample City 1234
Dear Sam
I've managed to get it work with
(\d+)\s*Dear
until the number of address lines changes which causes the conversion to text to add a block of text between the Dear and postcode.
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample City 1234
PO Box 1234
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam
I tried to get this working from the top and look for the first 4 digit excluding 2018, however any 4 digit street numbers were being matched which isn't what I'm after.
Any advice you can give would be awesome.
You can use regular expression:
\b\d{4}$\b(?<!2018)
\b Open word boundary.
\d{4}$ Match exactly four digits at the end of line.
\b Close word boundary.
(?<!2018) Negative lookbehind to check that the group of four digits is not 2018.
You can try it live here. The regular expression is based on the assumptions, as per the comments, that the postcode occurs at the end of the line. If you are expecting different years, you can simply adjust the negative lookbehind to deal with additional years. For example:
(?<!2018|2017) will exclude 2017 or 2018.
(?<!201[0-9]) will exclude years from 2010 to 2019.
According to your Python version you might need to specify the re.MULTILINE flag for start and end of line assertions.
>>> str = """04 Jul 2018
Mr Sam Sample
1235 Sample Street
Sample City 1234
PO Box 1237
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam"""
>>>re.findall(r"\b\d{4}$\b(?<!2018)",str,re.MULTILINE)
['1234', '1237']
How about trying to match 4 digit numbers at the end of line, on lines that doesn't contain date (that is line beginning with number)?
import re
re.findall(r'^[^\d].*?\s+(\d{4})\s*$', data, re.MULTILINE)
# ['1234']
I have spent the last 2 months working on a script that cleans, formats, and geocodes addresses. It is quite successful right now, however, there are some addresses that are giving me problems.
For example:
Addresses such as 7TH AVENUE 530 xxxxxxxxxxx are causing my geolocation module to fail. You can assume that the x's are other text. The other text isn't causing errors, it's purely due to the street number coming after avenue. I currently have filters in my program to truncate the address after street suffixes, such as avenue, street, etc. Due to this the program will ultimately only send 7th avenue to the cleaning module, which isn't accurate.
How can I account for instances where there are a group of numbers immediately preceding the street suffix and then move them to the front of the address. Then I can continue as I already am and truncate the string after the suffix.
You can assume that I have a list of all of the street suffixes named patterns.
Thank you. Any help is greatly appreciated.
FURTHER CLARIFICATION: I would only need to perform this rearrangement of the string if the group of numbers was 3 digits or less, because the zip code will frequently come after the address suffix, and in cases like that I wouldn't want to rearrange the string.
I am not sure if this helps, but you can start with this:
import re
address = '7TH AVENUE 530 xxxxxxxxxxx'
m = re.search('(?<=AVENUE )\d{1,3}', address)
print (m.group(0))
>>> 530
Edit based on your comment:
import re
original = '7TH AVENUE 530 xxxxxxxxxxx'
patterns = ['street', 'avenue', 'road', 'place']
regex = r'(.*)(' + '|'.join(patterns) + r')(.*)'
address = re.sub(regex, r'\1\2', original.lower()).lstrip()
new_addr = re.search(r'(?<=%s )\d{1,3} ' % address, original.lower())
resulting_address = ' '.join([new_addr.group(0).strip(),address]) if new_addr else address
address = ' '.join(map(str.capitalize, resulting_address.split()))
I'm sorry if the title isn't very descriptive. I don't exactly know how to sum up my problem in a few words.
Here's my issue. I'm cleaning addresses and some of them are causing some issues.
I have a list of delimiters (avenue, street, road, place, etc etc etc) named patterns.
Let's say I have this address for example: SUITE 1603 200 PARK AVENUE SOUTH NEW YORK
I would like the output to be SUITE 200 PARK AVENUE SOUTH NEW YORK
Is there any way I could somehow look to see if there are 2 batches of numbers (in this case 1603 and 200) before one of my patterns and if so, strip the first batch of numbers from my string? i.e remove 1603 and keep 200.
Update: I've added this line to my code:
address = re.sub("\d+", "", address) however it's currently removing all the numbers. I thought that by putting ,1 after address it would only remove the first occurrence but that wasn't the case
If you want to apply this replacement only when one of your "separator" words is used, and only when there are two numbers, you can use a fancier regular expression.
import re
pattern = r"\d+ +(\d+ .*(STREET|AVENUE|ROAD|WHATEVER))"
input = "SUITE 1603 200 PARK AVENUE SOUTH NEW YORK"
output = re.sub(pattern, "\\1", input)
print(output) #SUITE 200 PARK AVENUE SOUTH NEW YORK
Your description of what you want to do isn't very clear, but if I understand correctly you want to is to delete the first occurrence of a number sequence?
You could do this without using a regex,
s = 'SUITE 1603 200 PARK AVENUE SOUTH NEW YORK'
l = s.split(' ')
for i, w in enumerate(l):
for c in w:
if c.isdigit():
del l[i]
break
print ' '.join(l)
Output: >>> SUITE 200 PARK AVENUE SOUTH NEW YORK
I have 1 Billion addresses which are kinda in a bad format like:
'12-as FS street, 456 DLGG Area, Rand. District, Sydney, Australia 32 1020203'
I need the output like
Column1:12AS
Column2: FS 456 DLGG Area
Column3: Rand
Column4: Sydney
Column5: Australia
Column6: 32
Column7: 1020203
So basically i need them to be separated as house number, address line, state, country, statecode, pincode and remove words like street, district, countryside, road etc.
Also I need to search for the most frequent words above a particular threshold.
You just need to write a parser. Its code would depend on data. Unless somebody has written parser for your specific data format.
List of immediate questions (incomplete):
1) Is comma the separator for all lines?
2) Is comma used inside values (e.g. inside street name)?
3) List of all words to be removed (road, rd., blvd. etc.)
4) Can address be in the form of "house name" instead of street with number?
This is a random example of address parser with some learning functionality:
https://github.com/datamade/usaddress
If your format and requirements are not exactly matching some existing parser, then you have to write on your own.
Disclaimer: I read very carefully this thread:
Street Address search in a string - Python or Ruby
and many other resources.
Nothing works for me so far.
In some more details here is what I am looking for is:
The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:
a) Street number (1...N digits);
b) Street name : one or more words capitalized;
b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."
c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters
d) Street "type": one of ("st.", "ave.", "way");
e) City name : 1 or more Capitalized words;
f) (optional) state abbreviation (2 letters)
g) (optional) zip which is any 5 digits.
None of the above needs to be a valid thing (e.g. an existing city or zip).
I am trying expressions like these so far:
pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)
>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")
Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?
Anyhow: here is an example of what I am hoping to get:
Given
def ex_addr(text):
# does the re magic
# returns 1st address (all addresses?) or None if nothing found
for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. #10am-sh in Chadds # 123 S. Vancouver avenue in Ottawa? \nThanks!!!',
'This was written in 1999 in Montreal',
"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",
"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)
I would like to get:
'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"
Could you please help?
I just ran across this in GitHub as I am having a similar problem. Appears to work and be more robust than your current solution.
https://github.com/madisonmay/CommonRegex
Looking at the code, the regex for street address accounts for many more scenarios. '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?
In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.
I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches.
You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.
In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.