Data anonymization using python - python

I have an unstructured, free form text (taken from emails, phone conversation transcriptions), a list of first names and a list of last names.
What would be the most effective and pythonistic method to replace all the first names in the text with "--FIRSTNAME--" and last names with "--LASTNAME--" based on the lists I have?
I could iterate over each of the first name list and do a
text.replace(firstname, '--FIRSTNAME--')
but that seems very inefficient, especially for a very long list of names and many long texts to process. Are there better options?
Example:
Text: "Hello, this is David, how may I help you? Hi, my name is Alex Bender and I am trying to install my new coffee machine."
First name list: ['Abe', 'Alex', 'Andy', 'David', 'Mark', 'Timothy']
Last name list: ['Baxter', Bender', 'King', McLoud']
Expected output: "Hello, this is --FIRSTNAME--, how may I help you? Hi, my name is --FIRSTNAME-- --LASTNAME-- and I am trying to install my new coffee machine."

I followed the advice of #furas and checked out the flashtext module. This pretty much answers my need to the fullest.
I did run into a problem as I am working with Hebrew (non ASCII characters) and the text replacement would not follow word boundaries.
There is a method (add_non_word_boundary(self, character)) of class KeywordProcessor, which for some reason is not documented, that allows to add characters which are not to be considered as boundary characters (in addition to the default [a-zA-Z0-9_], allowing for whole word replacement only.

Related

Python text to sentences when uppercase word appears

I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))
A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.

regex to find LastnameFirstname with no space between in Python

i currently have several names that look like this
SmithJohn
smithJohn
O'BrienPeter
both of these have no spaces, but have a capital letter in between.
is there a regex to match these types of names (but won't match names like Smith, John, Smith John or Smith.John)? furthermore, how could i split up the last name and first name into two different variables?
thanks
If all you want is a string with a capital letter in the middle and lowercase letters around it, this should work okay: [a-z][A-Z] (make sure you use re.search and not match). It handles "O'BrienPeter" fine, but might match names like "McCutchon" when it shouldn't. It's impossible to come up with a regex, or any program really, that does that you want for all names (see Falsehoods Programmers Believe About Names).
As Brian points out, there's a question you need to ask yourself here: What guarantees do you have about the strings you will be processing?
Do you know without a doubt that the only capitals will be the beginnings of the names? Or could something like "McCutchonBrian", or in my case "Mallegol-HansenPhilip" have found its way in there as well?
In the greater context of software in general, you need to consider the assumptions you are going in with. Otherwise you're going to be solving a problem, that is in fact not the problem you have.

How to parse names from raw text

I was wondering if anyone knew of any good libraries or methods of parsing names from raw text.
For example, let's say I've got these as examples: (note sometimes they are capitalized tuples, other times not)
James Vaynerchuck and the rest of the group will be meeting at 1PM.
Sally Johnson, Jim White and brad burton.
Mark angleman Happiness, Productivity & blocks. Mark & Evan at 4pm.
My first thought is to load some sort of Part Of Speech tagger (like Pythons NLTK), tag all of the words. Then strip out only nouns, then compare the nouns against a database of known words (ie a literal dictionary), if they aren't in the dictionary, assume they are a name.
Other thoughts would be to delve into machine learning, but that might be beyond the scope of what I need here.
Any thoughts, suggestions or libraries you could point me to would be very helpful.
Thanks!
I don't know why you think you need NLTK just to rule out dictionary words; a simple dictionary (which you might have installed somewhere like /usr/share/dict/words, or you can download one off the internet) is all you need:
with open('/usr/share/dict/words') as f:
dictwords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.lower() not in dictwords]
Your words list may include names, but if so, it will include them capitalized, so:
dictwords = {word.strip() for word in f if word.islower()}
Or, if you want to whitelist proper names instead of blacklisting dictionary words:
with open('/usr/share/dict/propernames') as f:
namewords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.title() in namewords]
But this really isn't going to work. Look at "Jim White" from your example. His last name is obviously going to be in any dictionary, and his first name will be in many (as a short version of "jimmy", as a common romanization of the Arabic letter "jīm", etc.). "Mark" is also a common dictionary word. And the other way around, "Will" is a very common name even though you want to treat it as a word, and "Happiness" is an uncommon name, but at least a few people have it.
So, to make this work even the slightest bit, you probably want to combine multiple heuristics. First, instead of a word being either always a name or never a name, each word has a probability of being used as a name in some relevant corpus—White may be a name 13.7% of the time, Mark 41.3%, Jim 99.1%, Happiness 0.1%, etc. Next, if it's not the first word in a sentence, but is capitalized, it's much more likely to be a name (how much more? I don't know, you'll need to test and tune for your particular input), and if it's lowercase, it's less likely to be a name. You could bring in more context—for example, you have a lot of full names, so if something is a possible first name and it appears right next to something that's a common last name, it's more likely to be a first name. You could even try to parse the grammar (it's OK if you bail on some sentences; they just won't get any input from the grammar rule), so if two adjacent words only work as part of a sentence one if the second one is a verb, they're probably not a first and last name, even if that same second word could be a noun (and a name) in other contexts. And so on.
I found this library quite useful for parsing names: Python Name Parser
It can also deal with names that are formatted Lastname, Firstname.

Is there a standard way in Python to fuzzy match a string with arbitrary list of acceptable values?

I am hoping for a function like this:
def findSimilar(string, options):
....
return aString
Where aString is similar to the passed string but is present in options. I'm using this function to normalize user input from the toy application I'm working on. I read about using levenshtein distance, but I decided to ask here, as I'm hoping there is a simple solution in the Python standard libraries.
Use difflib.get_close_matches.
get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and
possibilities is a list of sequences against which to match word
(typically a list of strings).
Calculate the Levenshtein distance:
http://en.wikipedia.org/wiki/Levenshtein_distance
There are already python implementations, although I don't know about their quality...
I think you may want to take a look at this post. You just need a fuzzy string comparator.
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
I would suggest using from fuzzywuzzy Seat Geek. They have a fantastic function called process that does exactly what you are looking for from their website, but adapted to your question:
string = "new york jets"
options = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract(string, options, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
From the description of your question, you don't need any kind of string similarity, you just need to know if the input string is in the list. For that just use a set instead, and test to see if the string is in the set, like this:
def isStringAcceptable(string, set):
return string in set
If you want to be tolerant to users inputting the wrong string, you need to decide what kind of errors you are going to tolerate. Using something like Levinshtein distance might be severe overkill for what you want, and it might give you funny results. If you just want to check for casing, then call string.lower() and make sure all of the strings in your set are lower case. You probably don't need something so fancy as a string similarity metric.

How to format search autocompletion part lists?

I'm currently working on an AppEngine project, and I'd like to implement autocompletion of search terms. The items that can be searched for are reasonably unambiguous and short, so I was thinking of implementing it by giving each item a list of incomplete typings. So foobar would get a list like [f, fo, foo, foob, fooba, foobar]. The user's text in the searchbox is then compared to this list, and positive matches are suggested.
There are a couple of possible optimizations in this list that I was thinking of:
Removing spaces punctuation from search terms. Foo. Bar to FooBar.
Removing capital letters
Removing leading particles like "the", "a", "an". The Guy would be guy, and indexed as [g, gu, guy].
Only adding substring longer than 2 or 3 to the indexing list. So The Guy would be indexed as [gu, guy]. I thought that suggestions that only match the first letter would not be so relevant.
The users search term would also be formatted in this way, after which the DB is searched. Upon suggesting a search term, the particles, punctuation, and capital letters would be added according to the suggested object's full name. So searching for "the" would give no suggestions, but searching for "The Gu.." or "gu" would suggest "The Guy".
Is this a good idea? Mainly: would this formatting help, or only cause trouble?
I have already run into the same problem and the solution that I adopted was very similar to your idea. I split the items into words, convert them to lowercase, remove accents, and create a list of startings. For instance, "Báz Bar" would become ['b', 'ba', 'bar', 'baz'].
I have posted the code in this thread. The search box of this site is using it. Feel free to use it if you like.

Categories

Resources