Understanding Pyparsing for street addresses

Understanding Pyparsing for street addresses - python

While searching for ways to build a better address locator for processing a single field address table I came across the Pyparsing module. On the Examples page there is a script called "streetAddressParser" (author unknown) that I've copied in full below. While I've read the documentation and looked at O'Reilly Recursive Decent Parser tutorials I'm still confused about the code for this address parser. I'm aware that this parser would represent just one component of an address locator application, but my Python experience is limited to GIS scripting and I'm struggling to understand certain parts of this code.
First, what is the purpose of defining numbers as "Zero One Two Three...Eleven Twelve Thirteen...Ten Twenty Thirty..."? If we know an address field starts with integers representing the street number why not just extract that as the first token?
Second, why does this script use so many bitwise operators (^, |, ~)? Is this because of performance gains or are they treated differently in the Pyparsing module? Could other operators be used in place of them and produce the same result?
I'm grateful for any guidance offered and I appreciate your patience in reading this.
Thank you!
from pyparsing import *
# define number as a set of words
units = oneOf("Zero One Two Three Four Five Six Seven Eight Nine Ten"
"Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen",
caseless=True)
tens = oneOf("Ten Twenty Thirty Forty Fourty Fifty Sixty Seventy Eighty Ninety",caseless=True)
hundred = CaselessLiteral("Hundred")
thousand = CaselessLiteral("Thousand")
OPT_DASH = Optional("-")
numberword = ((( units + OPT_DASH + Optional(thousand) + OPT_DASH +
Optional(units + OPT_DASH + hundred) + OPT_DASH +
Optional(tens)) ^ tens )
+ OPT_DASH + Optional(units) )
# number can be any of the forms 123, 21B, 222-A or 23 1/2
housenumber = originalTextFor( numberword | Combine(Word(nums) +
Optional(OPT_DASH + oneOf(list(alphas))+FollowedBy(White()))) +
Optional(OPT_DASH + "1/2")
)
numberSuffix = oneOf("st th nd rd").setName("numberSuffix")
streetnumber = originalTextFor( Word(nums) +
Optional(OPT_DASH + "1/2") +
Optional(numberSuffix) )
# just a basic word of alpha characters, Maple, Main, etc.
name = ~numberSuffix + Word(alphas)
# types of streets - extend as desired
type_ = Combine( MatchFirst(map(Keyword,"Street St Boulevard Blvd Lane Ln Road Rd Avenue Ave "
"Circle Cir Cove Cv Drive Dr Parkway Pkwy Court Ct Square Sq"
"Loop Lp".split())) + Optional(".").suppress())
# street name
nsew = Combine(oneOf("N S E W North South East West NW NE SW SE") + Optional("."))
streetName = (Combine( Optional(nsew) + streetnumber +
Optional("1/2") +
Optional(numberSuffix), joinString=" ", adjacent=False )
^ Combine(~numberSuffix + OneOrMore(~type_ + Combine(Word(alphas) + Optional("."))), joinString=" ", adjacent=False)
^ Combine("Avenue" + Word(alphas), joinString=" ", adjacent=False)).setName("streetName")
# PO Box handling
acronym = lambda s : Regex(r"\.?\s*".join(s)+r"\.?")
poBoxRef = ((acronym("PO") | acronym("APO") | acronym("AFP")) +
Optional(CaselessLiteral("BOX"))) + Word(alphanums)("boxnumber")
# basic street address
streetReference = streetName.setResultsName("name") + Optional(type_).setResultsName("type")
direct = housenumber.setResultsName("number") + streetReference
intersection = ( streetReference.setResultsName("crossStreet") +
( '#' | Keyword("and",caseless=True)) +
streetReference.setResultsName("street") )
streetAddress = ( poBoxRef("street")
^ direct.setResultsName("street")
^ streetReference.setResultsName("street")
^ intersection )
tests = """\
3120 De la Cruz Boulevard
100 South Street
123 Main
221B Baker Street
10 Downing St
1600 Pennsylvania Ave
33 1/2 W 42nd St.
454 N 38 1/2
21A Deer Run Drive
256K Memory Lane
12-1/2 Lincoln
23N W Loop South
23 N W Loop South
25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main St
19th St
1500 Deer Creek Lane
186 Avenue A
2081 N Webb Rd
2081 N. Webb Rd
1515 West 22nd Street
2029 Stierlin Court
P.O. Box 33170
The Landmark # One Market, Suite 200
One Market, Suite 200
One Market
One Union Square
One Union Square, Apt 22-C
""".split("\n")
# how to add Apt, Suite, etc.
suiteRef = (
oneOf("Suite Ste Apt Apartment Room Rm #", caseless=True) +
Optional(".") +
Word(alphanums+'-')("suitenumber"))
streetAddress = streetAddress + Optional(Suppress(',') + suiteRef("suite"))
for t in map(str.strip,tests):
if t:
#~ print "1234567890"*3
print t
addr = streetAddress.parseString(t, parseAll=True)
#~ # use this version for testing
#~ addr = streetAddress.parseString(t)
print "Number:", addr.street.number
print "Street:", addr.street.name
print "Type:", addr.street.type
if addr.street.boxnumber:
print "Box:", addr.street.boxnumber
print addr.dump()
print

In some addresses, the primary number is spelt out as a word, as you can see from a few addresses in their tests, near the end of the list. Your statement, "If we know an address field starts with integers representing the street number..." is a big "if". Many, many addresses do not start with a number.
The bitwise operators are probably used to set flags to classify tokens as having certain properties. For the purpose of setting bits/flags, the bitwise operators are very efficient and convenient.
It's refreshing to see a parser that attempts to parse street addresses without using a regular expression... also see this page about some of the challenges of parsing freeform addresses.
However, it's worth noting that this parser looks like it will miss a wide variety of addresses. It doesn't seem to consider some of the special address formats common in Utah, Wisconsin, and rural areas. It also is missing a significant number of secondary designators and street suffixes.

Related

how to find all matches in a string with regex in python?

I have a string and I want to find all 13 digits numbers in it.
I wrote the code like this, but the problem is that I get a list which contains just the first 13 digits number.
Any one knows where's the problem.
my text:
9311105005816 POTTING MIX OSMOCOTE PRO 501 PREMIUM 107899 4711414284189 BBQ ACC CLEANING SMALL GRILL BRUSH^ 9312566048022 SPRAY PAINT FIDDLY BITS 250G GREY PRIMER 9312324001115 GARDEN BASICS 25L ALL PURPOSE POTTING MIX 2 # $3.50 8711167004368 FIRE IGNITION FIRELIGHTR SAMBA 36PK WHITE BRICK SAKF36 6 #
my code:
import re
with open("Receipt.txt") as f:
lines = f.readlines()
index_subtotal = lines[0].find("SubTotal")
index_tax_invoice = lines[0].find("TAX INVOICE 'Kip' ")
len_tax_invoice = len("TAX INVOICE 'Kip' ")
print(lines[0][index_tax_invoice + len_tax_invoice:index_subtotal])
print("*" * 30)
my_pattern = lines[0][index_tax_invoice + len_tax_invoice:index_subtotal]
my_pattern_list = re.findall("^(?:.*?(\d{10,13}).*|.*)$", my_pattern)
print(my_pattern_list)

regular expression to exclude 2 consecutive capital letters

I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals

If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071

With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'

cutting strings onto new lines

My string is too long to fit in TkInter, therefore I'm trying to split the list every 15 spaces onto a new line.
so far I have counted the spaces and everytime I get to 15 it adds the string '\n', which should put it on a new line, however it just places it in the string.
How can I fix this?`
def stringCutter(movie):
n = 0
strings = []
spaces = 0
curFilms = db.CurrentFilm(movie)
tempOverview = curFilms[5]
for i in tempOverview:
n += 1
if i == ' ':
spaces += 1
if (spaces % 15)== 0:
string = tempOverview[:n]
tempOverview = tempOverview[n:]
strings.append(string)
n = 0
spaces = 0
if n == len(tempOverview):
strings.append(tempOverview)
overview = '\n'.join(strings)
return overview`
curFilms takes lots of movie info and the 5 element is the overview, which is a long string.
I want it to return the overview like this:
After a global war the seaside kingdom known as the Valley Of The Wind remains
one of the last strongholds on Earth untouched by a poisonous jungle and the powerful
insects that guard it. Led by the courageous Princess Nausicaa the people of the Valley
engage in an epic struggle to restore the bond between humanity and Earth.
Instead of that though, it does this:
After a global war the seaside kingdom known as the Valley Of The Wind remains \none of the last strongholds on Earth untouched by a poisonous jungle and the powerful \ninsects that guard it. Led by the courageous Princess Nausicaa the people of the Valley \nengage in an epic struggle to restore the bond between humanity and Earth.

Python, Canadian Address RegEx validation

Im trying to write a Python script that validates Canadian Addresses using RegEx.
For example this address is valid:
" 123 4th Street, Toronto, Ontario, M1A 1A1 "
But this one is not valid:
" 56 Winding Way, Thunder Bay, Ontario, D56 4A3"
I have tried many different combinations keeping the rules of Canadian Postal codes such as the last 6 alphanumeric bits cannot contain the letters (D,F,I,O,Q,U,W,Z) but all entries seem to come out as invalid. and I tried
" ('^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$') " but still invalid
this is what I have so far
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
#Validation Statement
test=re.compile('^\d{1}[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$')
if test.match(line) is not None:
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line

Canadian postal codes can't contain the letters D, F, I, O, Q, or U, and cannot start with W or Z:
This will work for you:
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
if re.search("[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]", line):
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line
WRONG - 56 Winding Way, Thunder Bay, Ontario, D56 4A3
CORRECT - 123 4th Street, Toronto, Ontario, M1A 1A1
Demo
https://ideone.com/OyVB9h

It's been like this... forever:
/[A-Z][0-9][A-Z] ?[0-9][A-Z][0-9]/
If you want to restrict the first letter to only valid first letters then that's fine, but the rest is too complex to vary from that very much.

[ABCEGHJKLMNPRSTVXY]\d[A-Z] \d[A-Z]\d
Maybe it will work :D

Inserting ',' based on conditionals within a STR - Python

I am working with a very long list of street names that look like this:
1820 W 9000 SWest Jordan
455 S 500 ESalt Lake City
555 S 200 WBountiful
1000 N Green Valley PkwyHenderson
10100 W Tropicana AveLas Vegas
10305 S 1300 ESandy
10600 Southern Highlands PkwyLas Vegas
10616 S Eastern AveHenderson
111 Coors Blvd NWAlbuquerque
1170 E Gentile StLayton
1174 W 600 NSalt Lake City
1200 W Main StRiverton
....
....
I am trying to insert a ',' before the city name, which it appears is always after a lowercase character followed by NO SPACE and an UPPERCASE character.
So this is my thinking:
How do I write something that says, more or less,:
for cities in lst:
if [char] is lower and [nextchar] is UPPER:
[insert] ',' before UPPER

Following #Martijn's suggestion to take the last uppercase letter in a group, maybe:
import re
def fix(s):
return re.sub("([a-z]|[A-Z]+)([A-Z])",r"\1,\2", s)
which gives
>>> for line in lines:
... print fix(line)
...
1820 W 9000 S,West Jordan
455 S 500 E,Salt Lake City
555 S 200 W,Bountiful
1000 N Green Valley Pkwy,Henderson
10100 W Tropicana Ave,Las Vegas
10305 S 1300 E,Sandy
10600 Southern Highlands Pkwy,Las Vegas
10616 S Eastern Ave,Henderson
111 Coors Blvd NW,Albuquerque
1170 E Gentile St,Layton
1174 W 600 N,Salt Lake City
1200 W Main St,Riverton
[Disclaimer: I'm terrible with regexes.]

Something like this?
for big_index, cities in enumerate(lst):
for index, char in enumerate(cities):
if char == char.lower() and cities[index+1] != cities[index+1].lower():
lst[big_index] = cities[:index] + "," + cities[index:]
Disclaimer** Not tested. Since I don't have all your data, I won't attempt it, but this should give the output you're describing
**Edit: In fact, it doesn't look like your data follows these rules at all. Like the example in the comments, what about Coors Blvd NWAlbuquerque? Anyway, I'll keep the code here unless you change your question

cities = [re.sub(r'(?<=[a-z])(?=[A-Z])', ',', x) for x in cities]

This would be a solution without regex, basically implementing your logic:
new_list = []
for line in big_list:
for c in xrange(len(line)-1):
if 97 <= ord(line[c]) <= 122 and 65 <= ord(line[c+1]) <= 90:
line = line[:c+1]+","+line[c+1:]
break
new_list.append(line)
>>> new_list
['1820 W 9000 SWest Jordan', '455 S 500 ESalt Lake City', '555 S 200 WBountiful', '1000 N Green Valley Pkwy,Henderson', '10100 W Tropicana Ave,Las Vegas', '10305 S 1300 ESandy', '10600 Southern Highlands Pkwy,Las Vegas', '10616 S Eastern Ave,Henderson', '111 Coors Blvd NWAlbuquerque', '1170 E Gentile St,Layton', '1174 W 600 NSalt Lake City', '1200 W Main St,Riverton']
In case you're wondering what the ord function does: It translates a character into ASCII code. In ASCII, lower-case letters are bound to 97-122, so if ord(char) is in that range, it's a lower-case letter. Same goes for upper-case letters, except they're bound to 65-90.

Your core problem is the missing space between the street and the state portions. Assuming a solution where you have already split this string into component address parts (perhaps using #jazzpi 's solution), you can solve this secondary problem by building a collection of strings that match postal designations, such as ['Ave', 'E', 'Pkwy'] and so on, then look for matches to that collection on the left end of the state string.
Once you find a match, check to see if removing that sub-string leaves the state with an initial capital letter. If it leaves an initial capital letter intact, then you are free to truncate the substring and append the truncated street designation to the street portion.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding Pyparsing for street addresses - python

Related

how to find all matches in a string with regex in python?

regular expression to exclude 2 consecutive capital letters

cutting strings onto new lines

Python, Canadian Address RegEx validation

Inserting ',' based on conditionals within a STR - Python

Categories

Resources