How to parse names from raw text - python

I was wondering if anyone knew of any good libraries or methods of parsing names from raw text.
For example, let's say I've got these as examples: (note sometimes they are capitalized tuples, other times not)
James Vaynerchuck and the rest of the group will be meeting at 1PM.
Sally Johnson, Jim White and brad burton.
Mark angleman Happiness, Productivity & blocks. Mark & Evan at 4pm.
My first thought is to load some sort of Part Of Speech tagger (like Pythons NLTK), tag all of the words. Then strip out only nouns, then compare the nouns against a database of known words (ie a literal dictionary), if they aren't in the dictionary, assume they are a name.
Other thoughts would be to delve into machine learning, but that might be beyond the scope of what I need here.
Any thoughts, suggestions or libraries you could point me to would be very helpful.
Thanks!

I don't know why you think you need NLTK just to rule out dictionary words; a simple dictionary (which you might have installed somewhere like /usr/share/dict/words, or you can download one off the internet) is all you need:
with open('/usr/share/dict/words') as f:
dictwords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.lower() not in dictwords]
Your words list may include names, but if so, it will include them capitalized, so:
dictwords = {word.strip() for word in f if word.islower()}
Or, if you want to whitelist proper names instead of blacklisting dictionary words:
with open('/usr/share/dict/propernames') as f:
namewords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.title() in namewords]
But this really isn't going to work. Look at "Jim White" from your example. His last name is obviously going to be in any dictionary, and his first name will be in many (as a short version of "jimmy", as a common romanization of the Arabic letter "jīm", etc.). "Mark" is also a common dictionary word. And the other way around, "Will" is a very common name even though you want to treat it as a word, and "Happiness" is an uncommon name, but at least a few people have it.
So, to make this work even the slightest bit, you probably want to combine multiple heuristics. First, instead of a word being either always a name or never a name, each word has a probability of being used as a name in some relevant corpus—White may be a name 13.7% of the time, Mark 41.3%, Jim 99.1%, Happiness 0.1%, etc. Next, if it's not the first word in a sentence, but is capitalized, it's much more likely to be a name (how much more? I don't know, you'll need to test and tune for your particular input), and if it's lowercase, it's less likely to be a name. You could bring in more context—for example, you have a lot of full names, so if something is a possible first name and it appears right next to something that's a common last name, it's more likely to be a first name. You could even try to parse the grammar (it's OK if you bail on some sentences; they just won't get any input from the grammar rule), so if two adjacent words only work as part of a sentence one if the second one is a verb, they're probably not a first and last name, even if that same second word could be a noun (and a name) in other contexts. And so on.

I found this library quite useful for parsing names: Python Name Parser
It can also deal with names that are formatted Lastname, Firstname.

Related

Is there a way to detect if unnecessary characters are added to strings to bypass spam detection?

I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.
This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.

Data anonymization using python

I have an unstructured, free form text (taken from emails, phone conversation transcriptions), a list of first names and a list of last names.
What would be the most effective and pythonistic method to replace all the first names in the text with "--FIRSTNAME--" and last names with "--LASTNAME--" based on the lists I have?
I could iterate over each of the first name list and do a
text.replace(firstname, '--FIRSTNAME--')
but that seems very inefficient, especially for a very long list of names and many long texts to process. Are there better options?
Example:
Text: "Hello, this is David, how may I help you? Hi, my name is Alex Bender and I am trying to install my new coffee machine."
First name list: ['Abe', 'Alex', 'Andy', 'David', 'Mark', 'Timothy']
Last name list: ['Baxter', Bender', 'King', McLoud']
Expected output: "Hello, this is --FIRSTNAME--, how may I help you? Hi, my name is --FIRSTNAME-- --LASTNAME-- and I am trying to install my new coffee machine."
I followed the advice of #furas and checked out the flashtext module. This pretty much answers my need to the fullest.
I did run into a problem as I am working with Hebrew (non ASCII characters) and the text replacement would not follow word boundaries.
There is a method (add_non_word_boundary(self, character)) of class KeywordProcessor, which for some reason is not documented, that allows to add characters which are not to be considered as boundary characters (in addition to the default [a-zA-Z0-9_], allowing for whole word replacement only.

regex to find LastnameFirstname with no space between in Python

i currently have several names that look like this
SmithJohn
smithJohn
O'BrienPeter
both of these have no spaces, but have a capital letter in between.
is there a regex to match these types of names (but won't match names like Smith, John, Smith John or Smith.John)? furthermore, how could i split up the last name and first name into two different variables?
thanks
If all you want is a string with a capital letter in the middle and lowercase letters around it, this should work okay: [a-z][A-Z] (make sure you use re.search and not match). It handles "O'BrienPeter" fine, but might match names like "McCutchon" when it shouldn't. It's impossible to come up with a regex, or any program really, that does that you want for all names (see Falsehoods Programmers Believe About Names).
As Brian points out, there's a question you need to ask yourself here: What guarantees do you have about the strings you will be processing?
Do you know without a doubt that the only capitals will be the beginnings of the names? Or could something like "McCutchonBrian", or in my case "Mallegol-HansenPhilip" have found its way in there as well?
In the greater context of software in general, you need to consider the assumptions you are going in with. Otherwise you're going to be solving a problem, that is in fact not the problem you have.

how to deal with compound words in regex

I am making regexes that return the definitions of abbreviations from a text. I have solved for a number of cases but i cannot make a solution for the case that the abbreviation has different number of characters than its actual words maybe because one word is compound like below.
string = 'CRC comes from the words colorectal cancer'
I would like to get the 'colorectal cancer' based on its short-form. Do you have any advice on what steps I should take? I thought of splitting compounds words, but it will lead to other problems.
In CRC the first word should begin with C. and the next word could be either R or C, if second word is R , third word should be C or there is not a third word at all.
at the same time you should check second word starts with C. If so you dont need to check for third word. OR condition in regex maybe upto help. I cannot pinpoint how, if I dont have enough data samples

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?
The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.
It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en
i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Categories

Resources