Python, Canadian Address RegEx validation

Python, Canadian Address RegEx validation - python

Im trying to write a Python script that validates Canadian Addresses using RegEx.
For example this address is valid:
" 123 4th Street, Toronto, Ontario, M1A 1A1 "
But this one is not valid:
" 56 Winding Way, Thunder Bay, Ontario, D56 4A3"
I have tried many different combinations keeping the rules of Canadian Postal codes such as the last 6 alphanumeric bits cannot contain the letters (D,F,I,O,Q,U,W,Z) but all entries seem to come out as invalid. and I tried
" ('^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$') " but still invalid
this is what I have so far
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
#Validation Statement
test=re.compile('^\d{1}[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$')
if test.match(line) is not None:
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line

Canadian postal codes can't contain the letters D, F, I, O, Q, or U, and cannot start with W or Z:
This will work for you:
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
if re.search("[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]", line):
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line
WRONG - 56 Winding Way, Thunder Bay, Ontario, D56 4A3
CORRECT - 123 4th Street, Toronto, Ontario, M1A 1A1
Demo
https://ideone.com/OyVB9h

It's been like this... forever:
/[A-Z][0-9][A-Z] ?[0-9][A-Z][0-9]/
If you want to restrict the first letter to only valid first letters then that's fine, but the rest is too complex to vary from that very much.

[ABCEGHJKLMNPRSTVXY]\d[A-Z] \d[A-Z]\d
Maybe it will work :D

Related

regular expression to exclude 2 consecutive capital letters

I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals

If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071

With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')

Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks

For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Canadian postal code validation - python - regex

Below is the code I have written for a Canadian postal code validation script. It's supposed to read in a file:
123 4th Street, Toronto, Ontario, M1A 1A1
12456 Pine Way, Montreal, Quebec H9Z 9Z9
56 Winding Way, Thunder Bay, Ontario, D56 4A3
34 Cliff Drive, Bishop's Falls, Newfoundland B7E 4T
and output whether the phone number is valid or not. All of my postal codes are returning as invalid when postal codes 1, and 2 are valid and 3 and 4 are invalid.
import re
filename = input("Please enter the name of the file containing the input Canadian postal code: ")
fo = open(filename, "r")
for line in open(filename):
regex = '^(?!.*[DFIOQU])[A-VXY][0-9][A-Z]●?[0-9][A-Z][0-9]$'
m = re.match(regex, line)
if m is not None:
print("Valid: ", line)
else: print("Invalid: ", line)
fo.close

I do not guarantee that I fully understand the format, but this seems to work:
\b(?!.{0,7}[DFIOQU])[A-VXY]\d[A-Z][^-\w\d]\d[A-Z]\d\b
Demo
You can also fix yours (at least for the example) with this change:
(?!.*[DFIOQU])[A-VXY][0-9][A-Z].?[0-9][A-Z][0-9]
(except that it accepts a hyphen, which is forbidden)
Demo
But in this case, an explicit pattern may be best:
\b[ABCEGHJ-NPRSTVXY]\d[ABCEGHJ-NPRSTV-Z]\s\d[ABCEGHJ-NPRSTV-Z]\d\b
Which completes is 1/4 the steps of the others.
Demo

This generic code can help you
import re
PIN = input("Enter your Address")
PIN1= PIN.upper()
if (len(re.findall(r'[A-Z]{1}[0-9]{1}[A-Z]{1}\s*[0-9]{1}[A-Z]{1}[0-9]{1}',PIN1)))==1:
print("valid")
else:
print("invalid")
As we are taking input from user. So there is many chances that user can type postal code without space, in lower case letters. so this code can help you out with
1) Improper spacing
2)Lower case letter

Understanding Pyparsing for street addresses

While searching for ways to build a better address locator for processing a single field address table I came across the Pyparsing module. On the Examples page there is a script called "streetAddressParser" (author unknown) that I've copied in full below. While I've read the documentation and looked at O'Reilly Recursive Decent Parser tutorials I'm still confused about the code for this address parser. I'm aware that this parser would represent just one component of an address locator application, but my Python experience is limited to GIS scripting and I'm struggling to understand certain parts of this code.
First, what is the purpose of defining numbers as "Zero One Two Three...Eleven Twelve Thirteen...Ten Twenty Thirty..."? If we know an address field starts with integers representing the street number why not just extract that as the first token?
Second, why does this script use so many bitwise operators (^, |, ~)? Is this because of performance gains or are they treated differently in the Pyparsing module? Could other operators be used in place of them and produce the same result?
I'm grateful for any guidance offered and I appreciate your patience in reading this.
Thank you!
from pyparsing import *
# define number as a set of words
units = oneOf("Zero One Two Three Four Five Six Seven Eight Nine Ten"
"Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen",
caseless=True)
tens = oneOf("Ten Twenty Thirty Forty Fourty Fifty Sixty Seventy Eighty Ninety",caseless=True)
hundred = CaselessLiteral("Hundred")
thousand = CaselessLiteral("Thousand")
OPT_DASH = Optional("-")
numberword = ((( units + OPT_DASH + Optional(thousand) + OPT_DASH +
Optional(units + OPT_DASH + hundred) + OPT_DASH +
Optional(tens)) ^ tens )
+ OPT_DASH + Optional(units) )
# number can be any of the forms 123, 21B, 222-A or 23 1/2
housenumber = originalTextFor( numberword | Combine(Word(nums) +
Optional(OPT_DASH + oneOf(list(alphas))+FollowedBy(White()))) +
Optional(OPT_DASH + "1/2")
)
numberSuffix = oneOf("st th nd rd").setName("numberSuffix")
streetnumber = originalTextFor( Word(nums) +
Optional(OPT_DASH + "1/2") +
Optional(numberSuffix) )
# just a basic word of alpha characters, Maple, Main, etc.
name = ~numberSuffix + Word(alphas)
# types of streets - extend as desired
type_ = Combine( MatchFirst(map(Keyword,"Street St Boulevard Blvd Lane Ln Road Rd Avenue Ave "
"Circle Cir Cove Cv Drive Dr Parkway Pkwy Court Ct Square Sq"
"Loop Lp".split())) + Optional(".").suppress())
# street name
nsew = Combine(oneOf("N S E W North South East West NW NE SW SE") + Optional("."))
streetName = (Combine( Optional(nsew) + streetnumber +
Optional("1/2") +
Optional(numberSuffix), joinString=" ", adjacent=False )
^ Combine(~numberSuffix + OneOrMore(~type_ + Combine(Word(alphas) + Optional("."))), joinString=" ", adjacent=False)
^ Combine("Avenue" + Word(alphas), joinString=" ", adjacent=False)).setName("streetName")
# PO Box handling
acronym = lambda s : Regex(r"\.?\s*".join(s)+r"\.?")
poBoxRef = ((acronym("PO") | acronym("APO") | acronym("AFP")) +
Optional(CaselessLiteral("BOX"))) + Word(alphanums)("boxnumber")
# basic street address
streetReference = streetName.setResultsName("name") + Optional(type_).setResultsName("type")
direct = housenumber.setResultsName("number") + streetReference
intersection = ( streetReference.setResultsName("crossStreet") +
( '#' | Keyword("and",caseless=True)) +
streetReference.setResultsName("street") )
streetAddress = ( poBoxRef("street")
^ direct.setResultsName("street")
^ streetReference.setResultsName("street")
^ intersection )
tests = """\
3120 De la Cruz Boulevard
100 South Street
123 Main
221B Baker Street
10 Downing St
1600 Pennsylvania Ave
33 1/2 W 42nd St.
454 N 38 1/2
21A Deer Run Drive
256K Memory Lane
12-1/2 Lincoln
23N W Loop South
23 N W Loop South
25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main St
19th St
1500 Deer Creek Lane
186 Avenue A
2081 N Webb Rd
2081 N. Webb Rd
1515 West 22nd Street
2029 Stierlin Court
P.O. Box 33170
The Landmark # One Market, Suite 200
One Market, Suite 200
One Market
One Union Square
One Union Square, Apt 22-C
""".split("\n")
# how to add Apt, Suite, etc.
suiteRef = (
oneOf("Suite Ste Apt Apartment Room Rm #", caseless=True) +
Optional(".") +
Word(alphanums+'-')("suitenumber"))
streetAddress = streetAddress + Optional(Suppress(',') + suiteRef("suite"))
for t in map(str.strip,tests):
if t:
#~ print "1234567890"*3
print t
addr = streetAddress.parseString(t, parseAll=True)
#~ # use this version for testing
#~ addr = streetAddress.parseString(t)
print "Number:", addr.street.number
print "Street:", addr.street.name
print "Type:", addr.street.type
if addr.street.boxnumber:
print "Box:", addr.street.boxnumber
print addr.dump()
print

In some addresses, the primary number is spelt out as a word, as you can see from a few addresses in their tests, near the end of the list. Your statement, "If we know an address field starts with integers representing the street number..." is a big "if". Many, many addresses do not start with a number.
The bitwise operators are probably used to set flags to classify tokens as having certain properties. For the purpose of setting bits/flags, the bitwise operators are very efficient and convenient.
It's refreshing to see a parser that attempts to parse street addresses without using a regular expression... also see this page about some of the challenges of parsing freeform addresses.
However, it's worth noting that this parser looks like it will miss a wide variety of addresses. It doesn't seem to consider some of the special address formats common in Utah, Wisconsin, and rural areas. It also is missing a significant number of secondary designators and street suffixes.

pywikipedia (python) regex to add string if lacking

I have a set of records like:
Name
Name Paul Berry: present
Address George Necky: not present
Name Bob van Basten: present
Name Richard Von Rumpy: not present
Name Daddy Badge: not present
Name Paul Berry: present
Street George Necky: not present
Street Bob van Basten: present
Name Richard Von Rumpy: not present
City Daddy Badge: not present
and I want that all the records beginning with Name be in the form
Name Name Surname: not present
leaving untouched the records beginnning with other word.
i.e. I want to add the string "not" to the records beginning with Name where it isn't. I'm working with python (pywikipediabot)
Trying
python replace.py -dotall -regex 'Name ((?!not ).*?)present' 'Name \1not present'
but it adds the "not" even where it is already present.
Perhaps I haven't understood the negative lookahead syntax?

Just look for : present and replace it with : not present.
Edit: Improved answer:
for line in lines:
m = re.match('^Name[^:]*: present', line)
if m:
print re.sub(': present', ': not present', line)
else:
print line

You need a "negative look-behind" expression. This substitution will work:
'Name (.*)(?<!not )present' -> 'Name \1not present'
The .* matches everything between "Name" and "present", but the whole regexp matches only if "present" is not preceded by "not".
And are you sure you need -dotall? It looks like you want .* to match within a line only.

The following will do it:
re.sub(r'(Name.*?)(not )?present$', r'\1not present', s)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Canadian Address RegEx validation - python

It's been like this... forever: /[A-Z][0-9][A-Z] ?[0-9][A-Z][0-9]/ If you want to restrict the first letter to only valid first letters then that's fine, but the rest is too complex to vary from that very much.

[ABCEGHJKLMNPRSTVXY]\d[A-Z] \d[A-Z]\d Maybe it will work :D

Related

regular expression to exclude 2 consecutive capital letters

How can I extract address from raw text using NLTK in python?

Canadian postal code validation - python - regex

Understanding Pyparsing for street addresses

pywikipedia (python) regex to add string if lacking

Categories

Resources