Compare multiple strings to find best match

Compare multiple strings to find best match - python

So what I want to be able to do is compare a string with a lot of other strings to see which of those strings has a better match
Currently I'm using re.search to get the matching strings, which I then use to split the string and take the half I want
company = re.search("Supplier Address:?|Supplier Identification:?|Supplier
Name:?|Supplier:?|Company Information:?|Company's Name:?|Manufacturer's
Name|Manufacturer:?|MANUFACTURER:?|Manufacturer Name:?", arg)
But this isn't really working out that well especially because I have a couple strings like this
"SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD
LABORATORIES Emergency"
I want
HAYWARD LABORATORIES
out of this string, they way I'm doing it now, it matches with MANUFACTURER currently getting:
'S INFORMATION Manufacturer Name HAYWARD LABORATORIES
How do I fix this? And Is there a better way to do this?
Thanks
EDIT:
Some more strings I'm dealing with:
"Identification of the company Lutex Company Limited 20/F., "
Lutex Company Limited
"Product and Company Information Product Name: Lip Balm Base Product Code: A462-BALM Client Code: 900 Company: Ni Hau Industrial Co., Ltd. Company Address:"
Ni Hau Industrial Co., Ltd.

If all of your sections are the same in terms of the pattern Name FACTORY NAME, then you can try this:
import re
s = "SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD LABORATORIES Emergency"
final_data = re.findall("(?<=Name\s)[A-Z]+\s[A-Z]+", s)
Output:
['HAYWARD LABORATORIES']

you could use the fuzzywuzzy module to achieve some sort of fuzzy matching, basically you would calculate the distance between two strings, the smaller the distance the most closer thoses strings are.
for example, let's say you have a list of strings that you are searching for the closest match you would go as follow:
from fuzzywuzzy import fuzz
string_to_be_matched = 'string_sth'
list_of_strings = ['string_1', 'string_2',.., 'string_n']
# we will store the index , plus the distance for each string in list_of_strings
result = [ (i, fuzz.ratio(string_to_be_matched, x)) for x, i in enumerate(list_of_strings) ]
for more information about the fuzzywuzzy module refer to link

Related

Find best matches of substring from list in corpus

I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.

You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |

Fuzzy match and get index of a pattern from a string

I have a list of company names that I want to match against a list of sentences and get the index start and end position if a keyword is present in any of the sentences.
I wrote the code for matching the keywords exactly but realized that names in the sentences won't always be an exact match. For example, my keywords list can contain Company One Two Ltd but the sentences can be -
Company OneTwo Ltd won the auction
Company One Two Limited won the auction
The auction was won by Co. One Two Ltd and other variations
Given a company name, I want to find out the index start and end position even if the company name in the sentence is not an exact match but a variation. Below is the code I wrote for exact matching.
def find_index(texts, target):
idxs = []
for i, each_sent in enumerate(texts):
add = [(m.start(0), m.end(0)) for m in re.finditer(target, each_sent)]
if len(add):
idxs.append([(i, m.start(0), m.end(0)) for m in re.finditer(target, each_sent)])
return idxs

I can think of 2-3 possibilities all with varying pros/cons:
Create More Versatile regex
(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)
Building on the previous suggestion, iterate through company list and create fuzzy search
Company->(Company|Co\.?), ' '->\s?, imited->(Limited|Ltd), etc
Levenshtein distance calculator
example
which references external library fuzzywuzzy, there are alternatives fuzzy

RegEx for extracting specific variables and values

I am using Google Vision API to extract the text (handwritten plus computer-written) from images of application forms. The response is a long string like the following.
The string:
"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"
The whole response isn't useful for me, however I need to parse the response to get specific fields like Name, Father's Name, NIC No., Gender, Age, DoB, Domicile, and Contact No.
I am defining patterns for each of these fields using regular expression library (re) in Python. For example:
import re
name ='Name: \w+\s\w+'
fatherName = 'Father\'s Name: \w+\s\w+\s\w+'
age ='Age: \D+\d+'
print(re.search(name,string).group())
print(re.search(fatherName, string).group())
print(re.search(age,string).group())
Output:
"Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Age: (in years) 22"
However these are not robust patterns, and I don't know whether this approach is good or not. I also cannot extract the fields that are on same line, like Gender and Age.
How do I solve this problem?

It may not be robust, however it is possible to design an expression to extract the three parameters that you wish. This tool can help you to do so. Maybe, you might want to have an expression with several boundaries:
(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))
It might be good to focus on the text you wish to extract.
Variances
Age: This variable seems to be simple to extract
Name and Father's Name: You might want to check how the values may look like in these two variables so that to add it to a char list. I've just assumed that, maybe this would be a list of char: [A-Z-a-z\s\.]. However, you can change/simplify it, as you wish.
RegEx Descriptive Graph
This link helps you to visualizes your expressions:
Python Test
# -*- coding: UTF-8 -*-
import re
string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')
Output
YAAAY! "Name: MUHAMMAD HANIE" is a match 💚💚💚

How can I extract address from raw text using NLTK in python?

I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.

Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')

Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks

For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Use regex to extract unit number

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!

As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare multiple strings to find best match - python

If all of your sections are the same in terms of the pattern Name FACTORY NAME, then you can try this: import re s = "SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD LABORATORIES Emergency" final_data = re.findall("(?<=Name\s)[A-Z]+\s[A-Z]+", s) Output: ['HAYWARD LABORATORIES']

Related

Find best matches of substring from list in corpus

Fuzzy match and get index of a pattern from a string

RegEx for extracting specific variables and values

How can I extract address from raw text using NLTK in python?

Use regex to extract unit number

Categories

Resources