Canadian postal code validation - python - regex - python

Below is the code I have written for a Canadian postal code validation script. It's supposed to read in a file:
123 4th Street, Toronto, Ontario, M1A 1A1
12456 Pine Way, Montreal, Quebec H9Z 9Z9
56 Winding Way, Thunder Bay, Ontario, D56 4A3
34 Cliff Drive, Bishop's Falls, Newfoundland B7E 4T
and output whether the phone number is valid or not. All of my postal codes are returning as invalid when postal codes 1, and 2 are valid and 3 and 4 are invalid.
import re
filename = input("Please enter the name of the file containing the input Canadian postal code: ")
fo = open(filename, "r")
for line in open(filename):
regex = '^(?!.*[DFIOQU])[A-VXY][0-9][A-Z]●?[0-9][A-Z][0-9]$'
m = re.match(regex, line)
if m is not None:
print("Valid: ", line)
else: print("Invalid: ", line)
fo.close

I do not guarantee that I fully understand the format, but this seems to work:
\b(?!.{0,7}[DFIOQU])[A-VXY]\d[A-Z][^-\w\d]\d[A-Z]\d\b
Demo
You can also fix yours (at least for the example) with this change:
(?!.*[DFIOQU])[A-VXY][0-9][A-Z].?[0-9][A-Z][0-9]
(except that it accepts a hyphen, which is forbidden)
Demo
But in this case, an explicit pattern may be best:
\b[ABCEGHJ-NPRSTVXY]\d[ABCEGHJ-NPRSTV-Z]\s\d[ABCEGHJ-NPRSTV-Z]\d\b
Which completes is 1/4 the steps of the others.
Demo

This generic code can help you
import re
PIN = input("Enter your Address")
PIN1= PIN.upper()
if (len(re.findall(r'[A-Z]{1}[0-9]{1}[A-Z]{1}\s*[0-9]{1}[A-Z]{1}[0-9]{1}',PIN1)))==1:
print("valid")
else:
print("invalid")
As we are taking input from user. So there is many chances that user can type postal code without space, in lower case letters. so this code can help you out with
1) Improper spacing
2)Lower case letter

Related

RegEx for extracting specific variables and values

I am using Google Vision API to extract the text (handwritten plus computer-written) from images of application forms. The response is a long string like the following.
The string:
"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"
The whole response isn't useful for me, however I need to parse the response to get specific fields like Name, Father's Name, NIC No., Gender, Age, DoB, Domicile, and Contact No.
I am defining patterns for each of these fields using regular expression library (re) in Python. For example:
import re
name ='Name: \w+\s\w+'
fatherName = 'Father\'s Name: \w+\s\w+\s\w+'
age ='Age: \D+\d+'
print(re.search(name,string).group())
print(re.search(fatherName, string).group())
print(re.search(age,string).group())
Output:
"Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Age: (in years) 22"
However these are not robust patterns, and I don't know whether this approach is good or not. I also cannot extract the fields that are on same line, like Gender and Age.
How do I solve this problem?
It may not be robust, however it is possible to design an expression to extract the three parameters that you wish. This tool can help you to do so. Maybe, you might want to have an expression with several boundaries:
(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))
It might be good to focus on the text you wish to extract.
Variances
Age: This variable seems to be simple to extract
Name and Father's Name: You might want to check how the values may look like in these two variables so that to add it to a char list. I've just assumed that, maybe this would be a list of char: [A-Z-a-z\s\.]. However, you can change/simplify it, as you wish.
RegEx Descriptive Graph
This link helps you to visualizes your expressions:
Python Test
# -*- coding: UTF-8 -*-
import re
string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')
Output
YAAAY! "Name: MUHAMMAD HANIE" is a match 💚💚💚

Regex to extract digits before word while ignoring certain lines

Using Python and pdf2text I'm trying to extract a postcode from a 4000 odd single page PDF files I have received to print and mail - unfortunately I do not have access to the original files so can't adjust when creating files.
My end goal here is to rename all the PDF files with the Postalcode_ExistingFilename.pdf so I can sort them for the postal network. I'll also need to combine PDF"s for the same customer into one file but that's another problem.
In the PDF we have the word "Dear" and the postal code is before that (albeit a few lines up):
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample Suburb
Sample City 1234
Dear Sam
I've managed to get it work with
(\d+)\s*Dear
until the number of address lines changes which causes the conversion to text to add a block of text between the Dear and postcode.
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample City 1234
PO Box 1234
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam
I tried to get this working from the top and look for the first 4 digit excluding 2018, however any 4 digit street numbers were being matched which isn't what I'm after.
Any advice you can give would be awesome.
You can use regular expression:
\b\d{4}$\b(?<!2018)
\b Open word boundary.
\d{4}$ Match exactly four digits at the end of line.
\b Close word boundary.
(?<!2018) Negative lookbehind to check that the group of four digits is not 2018.
You can try it live here. The regular expression is based on the assumptions, as per the comments, that the postcode occurs at the end of the line. If you are expecting different years, you can simply adjust the negative lookbehind to deal with additional years. For example:
(?<!2018|2017) will exclude 2017 or 2018.
(?<!201[0-9]) will exclude years from 2010 to 2019.
According to your Python version you might need to specify the re.MULTILINE flag for start and end of line assertions.
>>> str = """04 Jul 2018
Mr Sam Sample
1235 Sample Street
Sample City 1234
PO Box 1237
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam"""
>>>re.findall(r"\b\d{4}$\b(?<!2018)",str,re.MULTILINE)
['1234', '1237']
How about trying to match 4 digit numbers at the end of line, on lines that doesn't contain date (that is line beginning with number)?
import re
re.findall(r'^[^\d].*?\s+(\d{4})\s*$', data, re.MULTILINE)
# ['1234']

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

How to Python split by a character yet maintain that character?

Google Maps results are often displayed thus:
'\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
Another variation:
'Clayton Village Shopping Center, 14856 Clayton Rd\nChesterfield, MO, United States\n(636) 227-2844'
And another:
'Wildwood, MO\nUnited States\n(636) 458-7707'
Notice the variation in the placement of the \n characters.
I'm looking to extract the first X lines as address, and the last line as phone number. A regex such as (.*\n.*)\n(.*) would suffice for the first example, but falls short for the other two. The only thing I can rely on is that the phone number will be in the form (ddd) ddd-dddd.
I think a regex that will allow for each and every possible variation will be hard to come by. Is it possible to use split(), but maintain the character we have split by? So in this example, split by "(", to split out the address and phone number, but retain this character in the phone number? I could concatenate the "(" back into split("(")[1], but is there a neater way?
Don't use regex. Just split the string on the '\n'. The last index is a phone number, the other indexes are the address.
lines = inputString.split('\n')
phone = lines[-1] if lines[-1].match(REGEX_PHONE_US) else None
address = '\n'.join(lines[:-1]) if phone else inputString
Python has a lot of great built in tools for manipulating strings in a more... human way... than regex allows.
If I understand you correctly, you want to "extract the first X lines as address". Assuming that all the addresses you need are in the US this regex code should work for you. In any case, it works on the 3 examples you provided:
import re
x = 'Wildwood, MO\nUnited States\n(636) 458-7707'
print re.findall(r'.*\n+.*\States', x)
The output is:
['Wildwood, MO\nUnited States']
If you want to print it later without the \n you can do it this way:
x = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
y = re.findall(r'.*\n+.*\States', x)
y = y[0].rstrip()
When you print y the output:
113 W 5th St
Eureka, MO, United States
And, if you want to extract the phone number separately you can do this:
tel = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
num = re.findall(r'.*\d+\-\d+', tel)
num = num[0].rstrip()
When you print num the output:
(636) 938-9310

Python, Canadian Address RegEx validation

Im trying to write a Python script that validates Canadian Addresses using RegEx.
For example this address is valid:
" 123 4th Street, Toronto, Ontario, M1A 1A1 "
But this one is not valid:
" 56 Winding Way, Thunder Bay, Ontario, D56 4A3"
I have tried many different combinations keeping the rules of Canadian Postal codes such as the last 6 alphanumeric bits cannot contain the letters (D,F,I,O,Q,U,W,Z) but all entries seem to come out as invalid. and I tried
" ('^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$') " but still invalid
this is what I have so far
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
#Validation Statement
test=re.compile('^\d{1}[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$')
if test.match(line) is not None:
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line
Canadian postal codes can't contain the letters D, F, I, O, Q, or U, and cannot start with W or Z:
This will work for you:
import re
postalCode = " 123 4th Street, Toronto, Ontario, M1A 1A1 "
#read First Line
line = postalCode
if re.search("[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]", line):
print 'Match found - valid Canadian address: ', line
else:
print 'Error - no match - invalid Canadian address:', line
WRONG - 56 Winding Way, Thunder Bay, Ontario, D56 4A3
CORRECT - 123 4th Street, Toronto, Ontario, M1A 1A1
Demo
https://ideone.com/OyVB9h
It's been like this... forever:
/[A-Z][0-9][A-Z] ?[0-9][A-Z][0-9]/
If you want to restrict the first letter to only valid first letters then that's fine, but the rest is too complex to vary from that very much.
[ABCEGHJKLMNPRSTVXY]\d[A-Z] \d[A-Z]\d
Maybe it will work :D

Categories

Resources