I'm sorry to be posting this, but I've killed a lot of time working on this unsuccessfully. So, a regular expressions+Python challenge for one and all:
I'm working with data that's mostly regularly formatted. Lists of companies are combined into a string like
`Company Inc,Company, LLC,Company`
without quotes to delineate the entries. Using the regular, above example, I can do:
>>> re.split(r',\b', 'Company Inc,Company, LLC,Company')
['Company Inc', 'Company, LLC', 'Company']
Unfortunately, some strings are irregularly-formatted like:
`IBP, Inc,Tyson Foods,Inc.`
wherein ,Inc is not separated from Foods by a space. So, using r',\b', I get this:
>>> re.split(r',\b', 'IBP, Inc,Tyson Foods,Inc.')
['IBP, Inc', 'Tyson Foods', 'Inc.']
I would like to get this:
['IBP, Inc', 'Tyson Foods,Inc.']
What would you do in this situation?
If known, you could add the split-prevention strings to a negative lookahead
r',\b(?!Inc\.)'
To put Mike M's response in slightly different terms, if you can build a reliable list of non-relevant tokens like 'Inc.', 'Inc' and 'LLC', then you might have a way to parse. Even then, you're probably not going to get something automatic like split() to work for you. You'll probably have to roll your own.
I would make a first split on the comma to get lists such as:
['IBP', 'Inc', 'Tyson Foods', 'Inc.']
and then do a second pass through the data where highly improbable company names such as 'Inc', 'Inc.', 'LLC', 'GmbH', etc. get combined with the previous entry in the list:
badList = originalData.split(',')
goodList = []
rejectList = ['Inc', 'Inc.', 'LLC', 'GmbH'] # etc.
for pseudoName in badList:
pseudoName = pseudoName.strip()
if pseudoName in rejectList:
goodList[-1] = goodList[-1] + ", " + pseudoName
else:
goodList.append(pseudoName)
This method would also let you do more sophisticated manipulations if you later find that your data has entries such as "Farmers Group, The" and put the articles in the right place.
It depends on the number of entries you have to figure out. Basically, as far as high-quality data goes, you're screwed. That means any automation you try to apply will have problems dealing with your data.
You're going to have to fix this by hand to put data quality back into it. Data quality issues are one of those things that computers have a very hard time dealing with.
What I personally would do is to write a quick-and-dirty heuristic to try to determine entries that don't fit the expected results. For instance, in your example, I would look for split entries that are "Inc" or "LLC" plus or minus a couple of characters. That would catch entries which seem to not provide much above a corporation type. You would catch the "Inc." and know that the real corporation name must be nearby.
Once you have that, you can clean your data, by hand, and reprocess. This is the best bet up to a gajillion entries or so, when you can justify writing these sort of corrective actions as a part of your program. Unless you're Google, though, it's almost guaranteed that you'll find it quickest and easiest to put human eyes on it.
Related
I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?
I'm pre-processing trump-hillary debate script text to create 3 lists which will including each 3 person's saying.
Entire script is 1046 lists
some of text are as following
for i in range(len(loaded_txt)):
print("load_text[i]",load_text[i])
loaded_txt[i] TRUMP: No, it's going to totally help you. And one thing we have to do: Repeal and replace the disaster known as Obamacare. It's destroying our country. It's destroying our businesses, our small business and our big businesses. We have to repeal and replace Obamacare.
loaded_txt[i]
loaded_txt[i] You take a look at the kind of numbers that that will cost us in the year '17, it is a disaster. If we don't repeal and replace -- now, it's probably going to die of its own weight. But Obamacare has to go. It's -- the premiums are going up 60 percent, 70 percent, 80 percent. Next year they're going to go up over 100 percent.
loaded_txt[i]
loaded_txt[i] And I'm really glad that the premiums have started -- at least the people see what's happening, because she wants to keep Obamacare and she wants to make it even worse, and it can't get any worse. Bad health care at the most expensive price. We have to repeal and replace Obamacare.
loaded_txt[i]
loaded_txt[i] WALLACE: And, Secretary Clinton, same question, because at this point, Social Security and Medicare are going to run out, the trust funds are going to run out of money. Will you as president entertain -- will you consider a grand bargain, a deal that includes both tax increases and benefit cuts to try to save both programs?
I tried to append list into TRUMP_script_list=[], if it has "TRUMP:" in list like this
TRUMP_script_list=[]
for i in range(len(loaded_txt)):
if "TRUMP:" in loaded_txt[i]:
TRUMP_script_list.append(loaded_txt[i])
But the problem is list without name.
But text without name should be trump's saying if it is under text with name of trump, UNTIL list meets texts including names not trump(wallace or clinton)
I tried "while" loop which will be terminated if list would contain other names(wallace, clinton). But failed to implement.
How can I implement this algorithm or any other good idea?
define function to get title:
def get_title(text, titles, previous_title):
for title in titles:
if title in text:
return title
return previous_title
define reference dictionary:
name_script_list = {'TRUMP:':TRUMP_script_list, 'HILLARY:':HILLARY_script_list, 'WALLACE:':WALLACE_script_list}
titles = set(name_script_list.keys())
title = ''
iterate through list in for loop:
for text in loaded_txt:
title = get_title(text, titles, title)
name_script_list[title].append(text)
basically the idea is that get_title() gets a series of titles to try, as well as what the last title was. If any of the titles appear, it returns that one. Otherwise, it returns the prior title
I initialized the initial title as ''. This will work, so long as there is a title in the first line of text. If there isn't, it will throw an error. The fix for this is up to how you want it implemented. Do you just not want to consider such a case (indicates error in loaded_txt, or list of possible titles)? Do you want to set a specific person's name as the default initial title? Do you want to skip lines until the initial title? There are a number of approaches, and I'm not sure which you would prefer
first time post here. I’m hoping I can find a little help on something I’m trying to accomplish in terms of text analysis.
First, I’m doing this in python and would like to remain in python as this function would be part of a larger, otherwise healthy tool I’m happy with. I have NLKT and Anaconda all set up as well, so drawing on those resources is also possible.
I’ve been working on a tool that tracks and adds up references to city names in large blocks of text. For instance, the tool can count how many times “Chicago,” “New York” or “Los Angeles,” “San Francisco” etc… are detected in a text chunk and can rank them.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
What I would like to do however is figure out a way to account for any false positive that might be [any name from a long list of first names] + [Select last name].
I have assembled a list of ~5000 first names from the census data that I can also bring into python as a list. I can also check true/false to find if a name is on that list, so I know I’m getting closer.
However, what I can’t figure out is how to express what I want, which is something like (I’ll use Jackson as an example again):
totalfirstnamejacksoncount = count (“[any name from census list] + Jackson”)
More or less. Is there some way I can phrase it as a wildcard from census list so ? Set a variable that would read as “any item in this list” so I could go “anynamevariable + Jackson,”? Or is there any other way to denote something like “any word in census list + Jackson”?
Ideally, my aim is to get a total count of “[Any first name] + [Specified last name]” so I can a) subtract them from the total of [Last name that is also a city name] count and maybe use that count for some other refinements.
In a worst case scenario I can see a way I could directly modify the census list and add Jackson (or whatever last name I need) to each name and have the lines manually add, but I feel like that would make a complete mess of my code when you look at ~5000 names for each name I’d like to do.
Sorry for the long-winded post. I appreciate your help with all this. If you have other suggestions you think might be better ways to approach it I’m happy to hear those out as well.
I propose to use regular expressions in combination with the list of names from NLTK. Suppose your text is:
text = "I met Fred Jackson and Mary Jackson in Jackson Mississippi"
Take a list of all names and convert it in a (huge) regular expression:
jackson_names = re.compile("|".join(w + r"\s+" + "Jackson" \
for w in nltk.corpus.names.words()))
In case you are not familiar with regular expressions, r'\s+' means "separated by one or more white spaces" and "|" means "or". The regular expression can be expanded to handle other last names.
Now, extract all "Jackson" matches from your text:
jackson_catch = jackson_names.findall(text)
#['Fred Jackson', 'Mary Jackson']
len(jackson_catch)
#2
Let's start by assuming that you are able to work with your data by iterating through words, e.g.
s = 'Hello. I am a string.'
s.split()
Output: ['Hello.', 'I', 'am', 'a', 'string.']
and you have managed to normalize the words by eliminating punctuation, capitalization, etc.
So you have a list of words words_list (which is your text converted into a list) and an index i at which you think there might be a city name, OR it might be someone's last name falsely identified as a city name. Let's call your list of first names FIRST_NAMES, which should be of type set (see comments).
if i >= 1:
prev_word = words_list[i-1]
if prev_word in FIRST_NAMES:
# put false positive code here
else:
# put true positive code here
You may also prefer to use regular expressions, as they are more flexible and more powerful. For example, you may notice that even after implementing this, you still have false positives or false negatives for some previously unforeseen reason. RE's could allow you to quickly adapt to the new problem.
On the other hand, if performance is a primary concern, you may be better off not using something so powerful and flexible, so that you can hone your algorithm to fit your specific requirements and run as efficiently as possible.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
The problem you have is called "named entity recognition", and is best solved with a classifier that takes multiple cues into account to find the named entities and classify them according to type (PERSON, ORGANIZATION, LOCATION, etc., or a similar list).
Chapter 7 in the nltk book, and especially section 3, Developing and evaluating chunkers, walks you through the process of building and training a recognizer. Alternatively, you could install the Stanford named-entity recognizer and measure its performance on your data.
I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:
REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH
COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA
TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT
From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:
1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources
2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.
Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.
That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:
1) find a set of unique strings by using p = unique(a)
2) remove punctuation, split strings in p by whitespaces, make a table of words' frequencies, create a set of rules and use gsub to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa)
3) recover typos if I spot them by eye
4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").
You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.
I am using python and mysql. Here is my code
cur.execute("SELECT distinct product_type FROM cloth_table")
Product_type_list = cur.fetchall()
Now Product_type_list is a list of strings describing the product_type like this
product_type_list =['T_shirts', 'T_shirt', 'T-shirt', 'Jeans', 'Jean', 'Formal Shirt' 'Shirt']
Here in product_type_list there is a 3 duplicate entry for t-shirt and 2 for each jeans and shirt.
Now i want my Product_type_list to be like this
Product_type_list=['T_shirt' , 'Jeans', 'Shirt']
I think we can use Difflib.Sequencematcher's quickratio. But how to do that
I don't know much about Difflib.Sequencematcher package. But for this like a fuzzy match will be done by using MySql full text search concept.
Try to get the FTS matching logic and solve this issue. And also some Soundex concept is there in DB as well as Python.
Using FTS we get the comparision score like a rank, based on the rank we will filter our list. I done this like a similar task using SQL Server FTS.
I think, you could define your own algo to solve this as most of the stuff is domain dependent and your product types are not that big, I presume. For instance Formal in your formal shirt shall be ignored as per your requirement, and this may not be true in other domains. So define first off, your own stop words(words which can be ignored in product name) and remove ending 's' and trim white-spaces and '-', '_' kind of non letters and convert to Upper case. Given this, you could build your own matching algo to solve this problem. I had such a problem, and solve it with my own implementation after trying several existing libraries.
And you should keep on improving your algo as it is based on heuristics and assumptions.