RegEx for extracting specific variables and values

RegEx for extracting specific variables and values - python

I am using Google Vision API to extract the text (handwritten plus computer-written) from images of application forms. The response is a long string like the following.
The string:
"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"
The whole response isn't useful for me, however I need to parse the response to get specific fields like Name, Father's Name, NIC No., Gender, Age, DoB, Domicile, and Contact No.
I am defining patterns for each of these fields using regular expression library (re) in Python. For example:
import re
name ='Name: \w+\s\w+'
fatherName = 'Father\'s Name: \w+\s\w+\s\w+'
age ='Age: \D+\d+'
print(re.search(name,string).group())
print(re.search(fatherName, string).group())
print(re.search(age,string).group())
Output:
"Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Age: (in years) 22"
However these are not robust patterns, and I don't know whether this approach is good or not. I also cannot extract the fields that are on same line, like Gender and Age.
How do I solve this problem?

It may not be robust, however it is possible to design an expression to extract the three parameters that you wish. This tool can help you to do so. Maybe, you might want to have an expression with several boundaries:
(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))
It might be good to focus on the text you wish to extract.
Variances
Age: This variable seems to be simple to extract
Name and Father's Name: You might want to check how the values may look like in these two variables so that to add it to a char list. I've just assumed that, maybe this would be a list of char: [A-Z-a-z\s\.]. However, you can change/simplify it, as you wish.
RegEx Descriptive Graph
This link helps you to visualizes your expressions:
Python Test
# -*- coding: UTF-8 -*-
import re
string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')
Output
YAAAY! "Name: MUHAMMAD HANIE" is a match 💚💚💚

Related

Regex to extract digits before word while ignoring certain lines

Using Python and pdf2text I'm trying to extract a postcode from a 4000 odd single page PDF files I have received to print and mail - unfortunately I do not have access to the original files so can't adjust when creating files.
My end goal here is to rename all the PDF files with the Postalcode_ExistingFilename.pdf so I can sort them for the postal network. I'll also need to combine PDF"s for the same customer into one file but that's another problem.
In the PDF we have the word "Dear" and the postal code is before that (albeit a few lines up):
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample Suburb
Sample City 1234
Dear Sam
I've managed to get it work with
(\d+)\s*Dear
until the number of address lines changes which causes the conversion to text to add a block of text between the Dear and postcode.
04 Jul 2018
Mr Sam Sample
123 Sample Street
Sample City 1234
PO Box 1234
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam
I tried to get this working from the top and look for the first 4 digit excluding 2018, however any 4 digit street numbers were being matched which isn't what I'm after.
Any advice you can give would be awesome.

You can use regular expression:
\b\d{4}$\b(?<!2018)
\b Open word boundary.
\d{4}$ Match exactly four digits at the end of line.
\b Close word boundary.
(?<!2018) Negative lookbehind to check that the group of four digits is not 2018.
You can try it live here. The regular expression is based on the assumptions, as per the comments, that the postcode occurs at the end of the line. If you are expecting different years, you can simply adjust the negative lookbehind to deal with additional years. For example:
(?<!2018|2017) will exclude 2017 or 2018.
(?<!201[0-9]) will exclude years from 2010 to 2019.
According to your Python version you might need to specify the re.MULTILINE flag for start and end of line assertions.
>>> str = """04 Jul 2018
Mr Sam Sample
1235 Sample Street
Sample City 1234
PO Box 1237
Sample City
Phone: 01234567
Fax: 01234568
Email: email#email.com
Website: email.com
Dear Sam"""
>>>re.findall(r"\b\d{4}$\b(?<!2018)",str,re.MULTILINE)
['1234', '1237']

How about trying to match 4 digit numbers at the end of line, on lines that doesn't contain date (that is line beginning with number)?
import re
re.findall(r'^[^\d].*?\s+(\d{4})\s*$', data, re.MULTILINE)
# ['1234']

Compare multiple strings to find best match

So what I want to be able to do is compare a string with a lot of other strings to see which of those strings has a better match
Currently I'm using re.search to get the matching strings, which I then use to split the string and take the half I want
company = re.search("Supplier Address:?|Supplier Identification:?|Supplier
Name:?|Supplier:?|Company Information:?|Company's Name:?|Manufacturer's
Name|Manufacturer:?|MANUFACTURER:?|Manufacturer Name:?", arg)
But this isn't really working out that well especially because I have a couple strings like this
"SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD
LABORATORIES Emergency"
I want
HAYWARD LABORATORIES
out of this string, they way I'm doing it now, it matches with MANUFACTURER currently getting:
'S INFORMATION Manufacturer Name HAYWARD LABORATORIES
How do I fix this? And Is there a better way to do this?
Thanks
EDIT:
Some more strings I'm dealing with:
"Identification of the company Lutex Company Limited 20/F., "
Lutex Company Limited
"Product and Company Information Product Name: Lip Balm Base Product Code: A462-BALM Client Code: 900 Company: Ni Hau Industrial Co., Ltd. Company Address:"
Ni Hau Industrial Co., Ltd.

If all of your sections are the same in terms of the pattern Name FACTORY NAME, then you can try this:
import re
s = "SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD LABORATORIES Emergency"
final_data = re.findall("(?<=Name\s)[A-Z]+\s[A-Z]+", s)
Output:
['HAYWARD LABORATORIES']

you could use the fuzzywuzzy module to achieve some sort of fuzzy matching, basically you would calculate the distance between two strings, the smaller the distance the most closer thoses strings are.
for example, let's say you have a list of strings that you are searching for the closest match you would go as follow:
from fuzzywuzzy import fuzz
string_to_be_matched = 'string_sth'
list_of_strings = ['string_1', 'string_2',.., 'string_n']
# we will store the index , plus the distance for each string in list_of_strings
result = [ (i, fuzz.ratio(string_to_be_matched, x)) for x, i in enumerate(list_of_strings) ]
for more information about the fuzzywuzzy module refer to link

Regex for negation of three # followed by number and three # at end

I have a requirement to build a regex which is negation of three at the rate symbol # at begining, followed by numbers of varied length between 1 to 12 digits and ending with three # symbol. Anything other than that should be selected.
Basically my challenge is that i have a dataframe which has a text corpus and a value in pattern ###0-9### I want to remove everything except this pattern. I have been able to develop the regex as [#][#][#]\d{1,12}[#][#][#] however i want negation of this pattern as i want to do find and replace. For example
my name is x and i work at ###12354### and i am happy with my job. what is your company name? is it ###42334###? you look happy as well!!
should return ###12354### ###42334### it will be great to have a space delimier between individual elements thus fethced. Any help?
I will be using this regex in a python pandas dataframe uisng str.replace function.
I have tried regexr.com and regex101.com and have come thusfar
**Edit:**Below is data
SNo details
1 account ###0000082569### / department stores uk & ie credit control operations
2 academic ###0000060910### , administrative, and ###0000039198### liaison coordinator
3 account executive, financial ###0000060910### , enterprise and partner group
4 2015-nasa summer internship- space power system ###0000129849### and testing
5 account technical ###0000185187### , technical presales, systems engineer
6 account ###0000082569### for car, van & 4x4 products in the east of england
7 account ###0000082569### for mikro segment and owners of the enterprises
8 account ###0000082569### - affinity digital display, mobile & publishing
9 account ###0000082569### ###0000060905### -energy and commodities ###0000086889### candidate
10 account ###0000082569### for companies department of external relevance

Here is what I meant in my comment:
>>> df = pd.DataFrame({'col1':['at ###12354### and i am happy with my job. what is your company name? is it ###42334###? you look happy as well!!', 'at ###222### and t ###888888###?' ]})
>>> df['col1'].str.findall(r'#{3}\d+#{3}').apply(' '.join)
0 ###12354### ###42334###
1 ###222### ###888888###
The #{3}\d+#{3} will match any 1+ digits enclosed with 3 # symbols and .findall will extract all matches. .apply(' '.join) will join the values with a space.

Instead of replace with a complicated regex, you can use join with findall and use simpler regex as this:
>>> str = 'my name is x and i work at ###12354### and i am happy with my job. what is your company name? is it ###42334###? you look happy as well!!'
>>> ' '.join(re.findall(r'#{3}\d{1,12}#{3}', str))
'###12354### ###42334###'

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')

Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks

For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Canadian postal code validation - python - regex

Below is the code I have written for a Canadian postal code validation script. It's supposed to read in a file:
123 4th Street, Toronto, Ontario, M1A 1A1
12456 Pine Way, Montreal, Quebec H9Z 9Z9
56 Winding Way, Thunder Bay, Ontario, D56 4A3
34 Cliff Drive, Bishop's Falls, Newfoundland B7E 4T
and output whether the phone number is valid or not. All of my postal codes are returning as invalid when postal codes 1, and 2 are valid and 3 and 4 are invalid.
import re
filename = input("Please enter the name of the file containing the input Canadian postal code: ")
fo = open(filename, "r")
for line in open(filename):
regex = '^(?!.*[DFIOQU])[A-VXY][0-9][A-Z]●?[0-9][A-Z][0-9]$'
m = re.match(regex, line)
if m is not None:
print("Valid: ", line)
else: print("Invalid: ", line)
fo.close

I do not guarantee that I fully understand the format, but this seems to work:
\b(?!.{0,7}[DFIOQU])[A-VXY]\d[A-Z][^-\w\d]\d[A-Z]\d\b
Demo
You can also fix yours (at least for the example) with this change:
(?!.*[DFIOQU])[A-VXY][0-9][A-Z].?[0-9][A-Z][0-9]
(except that it accepts a hyphen, which is forbidden)
Demo
But in this case, an explicit pattern may be best:
\b[ABCEGHJ-NPRSTVXY]\d[ABCEGHJ-NPRSTV-Z]\s\d[ABCEGHJ-NPRSTV-Z]\d\b
Which completes is 1/4 the steps of the others.
Demo

This generic code can help you
import re
PIN = input("Enter your Address")
PIN1= PIN.upper()
if (len(re.findall(r'[A-Z]{1}[0-9]{1}[A-Z]{1}\s*[0-9]{1}[A-Z]{1}[0-9]{1}',PIN1)))==1:
print("valid")
else:
print("invalid")
As we are taking input from user. So there is many chances that user can type postal code without space, in lower case letters. so this code can help you out with
1) Improper spacing
2)Lower case letter

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.