fuzzy matching lots of strings - python

I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:
REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH
COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA
TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT
From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:
1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources
2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.
Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.

That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:
1) find a set of unique strings by using p = unique(a)
2) remove punctuation, split strings in p by whitespaces, make a table of words' frequencies, create a set of rules and use gsub to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa)
3) recover typos if I spot them by eye
4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").

You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.

Related

Can I use ml/nlp to determine pattern in the type of usernames are generated?

I have a dataset which has first and last names along with their respective email ids. Some of the email ids follow a certain pattern such as:
Fn1 = John , Ln1 = Jacobs, eid1= jj#xyz.com
Fn2 = Emily , Ln2 = Queens, eid2= eq#pqr.com
Fn3 = Harry , Ln3 = Smith, eid3= hsm#abc.com
The content after # has no importance for finding the pattern. I want to find out how many people follow a certain pattern and what is that pattern. Is it possible to do so using nlp and python?
EXTRA: To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
You certainly could - e.g., you could try to learn a relationship between your input and output data as
(Fn, Ln) --> eid
and further disect this relationship into patterns.
However before hitting the problem with complex tools (especially if new to ml/nlp), I'd do further analysis of the data first.
For example, I'd first be curious to see what portion of your data displays the clear patterns you've shown in the examples - using first character(s) from the individual's first/last name to build the corresponding eid (which could be determined easily programatically).
Setting aside that portion of the data that satisfies this clear pattern - what does the remainder look like?
Is there are another clear, but different pattern to some of this data?
If there is - I'd then perform the same exercise again - construct a proper filter to collect and set aside data satisfying that pattern - and examine the remainder.
Doing this analysis might help determine at least a partial answer to your inquiry rather quickly
To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
Moreover it will help determine
a) whether you need to even use more complex tooling (if enough patterns can be easily seived out this way - is it worth the investment to go heavier?) or
b) if not, which portion of the data to target with heavier tools (the remainder of this process - those not containing simple patterns).

split text by periods except in certain cases [duplicate]

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 1 year ago.
I am currently trying to split a string containing an entire text document by sentences so that I can convert it to a csv. Naturally, I would use periods as the delimiter and perform str.split('.'), however, the document contains abbreviations 'i.e.' and 'e.g.' which I would want to ignore the periods in this case.
For example,
Original Sentence: During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.
Resulting List: ["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]
My only workaround so far is replacing all 'i.e' and 'e.g.' with 'ie' and 'eg' which is both inefficient and grammatically undesirable. I am fiddling with Python's regex library which I suspect holds the answer I desire but my knowledge with it is novice at best.
It is my first time posting a question on here so I apologize if I am using incorrect format or wording.
See How can I split a text into sentences? which suggests the natural language toolkit.
A deeper explanation as to why this is how it is done by way of an example:
I go by the name of I. Brown. I bet I could make a sentence difficult to parse. No one is more suited to this task than I.
How do you break this into different sentences?
You need semantics (a formal sentence is usually made up of a subject, an object, and a verb) which a regular expression won't capture. RegEx does syntax very well, but not semantics (meaning).
To prove this, the answer someone else suggested that involves a lot of complex regex and is fairly slow, with 115 votes, would break with my simple sentence.
It's an NLP problem, so I linked to an answer that gave an NLP package.
This is a crude implementation.
inp = input()
res = []
last = 0
for x in range(len(inp)):
if (x>1):
if (inp[x] == "." and inp[x-2] != "."):
if (x < len(inp)-2):
if (inp[x+2] != "."):
res.append(inp[last:x])
last = x+2
res.append(inp[last:-1])
print(res)
If I use your input, I get this output (hopefully, this is what you are looking for):
['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']
Note: You might have to adjust this code if the text you are using does not follow grammar rules (no spaces between letters or after starting a new sentence...)
This one should work!
import re
p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."
list = []
while(len(p) > 0):
string = ""
while(True):
match = re.search("[A-Z]+[^A-Z]+",p)
if(match == None):
break
p = p[len(match.group(0)):]
string += match.group(0)
if(match.group(0).endswith(". ") ):
break
list.append(string)
print(list)

How to count [any name from a list of names] + [specific last name] in a block of text?

first time post here. I’m hoping I can find a little help on something I’m trying to accomplish in terms of text analysis.
First, I’m doing this in python and would like to remain in python as this function would be part of a larger, otherwise healthy tool I’m happy with. I have NLKT and Anaconda all set up as well, so drawing on those resources is also possible.
I’ve been working on a tool that tracks and adds up references to city names in large blocks of text. For instance, the tool can count how many times “Chicago,” “New York” or “Los Angeles,” “San Francisco” etc… are detected in a text chunk and can rank them.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
What I would like to do however is figure out a way to account for any false positive that might be [any name from a long list of first names] + [Select last name].
I have assembled a list of ~5000 first names from the census data that I can also bring into python as a list. I can also check true/false to find if a name is on that list, so I know I’m getting closer.
However, what I can’t figure out is how to express what I want, which is something like (I’ll use Jackson as an example again):
totalfirstnamejacksoncount = count (“[any name from census list] + Jackson”)
More or less. Is there some way I can phrase it as a wildcard from census list so ? Set a variable that would read as “any item in this list” so I could go “anynamevariable + Jackson,”? Or is there any other way to denote something like “any word in census list + Jackson”?
Ideally, my aim is to get a total count of “[Any first name] + [Specified last name]” so I can a) subtract them from the total of [Last name that is also a city name] count and maybe use that count for some other refinements.
In a worst case scenario I can see a way I could directly modify the census list and add Jackson (or whatever last name I need) to each name and have the lines manually add, but I feel like that would make a complete mess of my code when you look at ~5000 names for each name I’d like to do.
Sorry for the long-winded post. I appreciate your help with all this. If you have other suggestions you think might be better ways to approach it I’m happy to hear those out as well.
I propose to use regular expressions in combination with the list of names from NLTK. Suppose your text is:
text = "I met Fred Jackson and Mary Jackson in Jackson Mississippi"
Take a list of all names and convert it in a (huge) regular expression:
jackson_names = re.compile("|".join(w + r"\s+" + "Jackson" \
for w in nltk.corpus.names.words()))
In case you are not familiar with regular expressions, r'\s+' means "separated by one or more white spaces" and "|" means "or". The regular expression can be expanded to handle other last names.
Now, extract all "Jackson" matches from your text:
jackson_catch = jackson_names.findall(text)
#['Fred Jackson', 'Mary Jackson']
len(jackson_catch)
#2
Let's start by assuming that you are able to work with your data by iterating through words, e.g.
s = 'Hello. I am a string.'
s.split()
Output: ['Hello.', 'I', 'am', 'a', 'string.']
and you have managed to normalize the words by eliminating punctuation, capitalization, etc.
So you have a list of words words_list (which is your text converted into a list) and an index i at which you think there might be a city name, OR it might be someone's last name falsely identified as a city name. Let's call your list of first names FIRST_NAMES, which should be of type set (see comments).
if i >= 1:
prev_word = words_list[i-1]
if prev_word in FIRST_NAMES:
# put false positive code here
else:
# put true positive code here
You may also prefer to use regular expressions, as they are more flexible and more powerful. For example, you may notice that even after implementing this, you still have false positives or false negatives for some previously unforeseen reason. RE's could allow you to quickly adapt to the new problem.
On the other hand, if performance is a primary concern, you may be better off not using something so powerful and flexible, so that you can hone your algorithm to fit your specific requirements and run as efficiently as possible.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
The problem you have is called "named entity recognition", and is best solved with a classifier that takes multiple cues into account to find the named entities and classify them according to type (PERSON, ORGANIZATION, LOCATION, etc., or a similar list).
Chapter 7 in the nltk book, and especially section 3, Developing and evaluating chunkers, walks you through the process of building and training a recognizer. Alternatively, you could install the Stanford named-entity recognizer and measure its performance on your data.

Remove nearly duplicate string from a list of strings using Difflib

I am using python and mysql. Here is my code
cur.execute("SELECT distinct product_type FROM cloth_table")
Product_type_list = cur.fetchall()
Now Product_type_list is a list of strings describing the product_type like this
product_type_list =['T_shirts', 'T_shirt', 'T-shirt', 'Jeans', 'Jean', 'Formal Shirt' 'Shirt']
Here in product_type_list there is a 3 duplicate entry for t-shirt and 2 for each jeans and shirt.
Now i want my Product_type_list to be like this
Product_type_list=['T_shirt' , 'Jeans', 'Shirt']
I think we can use Difflib.Sequencematcher's quickratio. But how to do that
I don't know much about Difflib.Sequencematcher package. But for this like a fuzzy match will be done by using MySql full text search concept.
Try to get the FTS matching logic and solve this issue. And also some Soundex concept is there in DB as well as Python.
Using FTS we get the comparision score like a rank, based on the rank we will filter our list. I done this like a similar task using SQL Server FTS.
I think, you could define your own algo to solve this as most of the stuff is domain dependent and your product types are not that big, I presume. For instance Formal in your formal shirt shall be ignored as per your requirement, and this may not be true in other domains. So define first off, your own stop words(words which can be ignored in product name) and remove ending 's' and trim white-spaces and '-', '_' kind of non letters and convert to Upper case. Given this, you could build your own matching algo to solve this problem. I had such a problem, and solve it with my own implementation after trying several existing libraries.
And you should keep on improving your algo as it is based on heuristics and assumptions.

Regular expressions+python challenge! Wrangling data that's almost-regular

I'm sorry to be posting this, but I've killed a lot of time working on this unsuccessfully. So, a regular expressions+Python challenge for one and all:
I'm working with data that's mostly regularly formatted. Lists of companies are combined into a string like
`Company Inc,Company, LLC,Company`
without quotes to delineate the entries. Using the regular, above example, I can do:
>>> re.split(r',\b', 'Company Inc,Company, LLC,Company')
['Company Inc', 'Company, LLC', 'Company']
Unfortunately, some strings are irregularly-formatted like:
`IBP, Inc,Tyson Foods,Inc.`
wherein ,Inc is not separated from Foods by a space. So, using r',\b', I get this:
>>> re.split(r',\b', 'IBP, Inc,Tyson Foods,Inc.')
['IBP, Inc', 'Tyson Foods', 'Inc.']
I would like to get this:
['IBP, Inc', 'Tyson Foods,Inc.']
What would you do in this situation?
If known, you could add the split-prevention strings to a negative lookahead
r',\b(?!Inc\.)'
To put Mike M's response in slightly different terms, if you can build a reliable list of non-relevant tokens like 'Inc.', 'Inc' and 'LLC', then you might have a way to parse. Even then, you're probably not going to get something automatic like split() to work for you. You'll probably have to roll your own.
I would make a first split on the comma to get lists such as:
['IBP', 'Inc', 'Tyson Foods', 'Inc.']
and then do a second pass through the data where highly improbable company names such as 'Inc', 'Inc.', 'LLC', 'GmbH', etc. get combined with the previous entry in the list:
badList = originalData.split(',')
goodList = []
rejectList = ['Inc', 'Inc.', 'LLC', 'GmbH'] # etc.
for pseudoName in badList:
pseudoName = pseudoName.strip()
if pseudoName in rejectList:
goodList[-1] = goodList[-1] + ", " + pseudoName
else:
goodList.append(pseudoName)
This method would also let you do more sophisticated manipulations if you later find that your data has entries such as "Farmers Group, The" and put the articles in the right place.
It depends on the number of entries you have to figure out. Basically, as far as high-quality data goes, you're screwed. That means any automation you try to apply will have problems dealing with your data.
You're going to have to fix this by hand to put data quality back into it. Data quality issues are one of those things that computers have a very hard time dealing with.
What I personally would do is to write a quick-and-dirty heuristic to try to determine entries that don't fit the expected results. For instance, in your example, I would look for split entries that are "Inc" or "LLC" plus or minus a couple of characters. That would catch entries which seem to not provide much above a corporation type. You would catch the "Inc." and know that the real corporation name must be nearby.
Once you have that, you can clean your data, by hand, and reprocess. This is the best bet up to a gajillion entries or so, when you can justify writing these sort of corrective actions as a part of your program. Unless you're Google, though, it's almost guaranteed that you'll find it quickest and easiest to put human eyes on it.

Categories

Resources