NLTK Most common synonym (Wordnet) for each word

NLTK Most common synonym (Wordnet) for each word - python

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.
If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.
Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.
Input: "Hello my valued friend"
Return: "Hi my dear friend"

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.
The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:
from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())
You can then look up the frequency of a word like this:
>>> print(freqs["valued"])
14
Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.
>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in
brown.tagged_words(tagset="universal"))
>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45

Synonyms are a huge and open area of work in natural language processing.
In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.
Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.
Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).
I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!
Edit: I should have included this link to an older question as well:
How to get synonyms from nltk WordNet Python
And the NLTK documentation on its interface with WordNet:
http://www.nltk.org/howto/wordnet.html
I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.

The other answer shows you how to use synonyms:
wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
You now know how to get all the synonyms for a word. That's not the hard part. The hard part is determining what's the most common synonym. This question is highly domain dependent. Most common synonym where? In literature? In common vernacular? In technical speak?
Like you, I wanted to get an idea of how the English language was used. I downloaded 15,000 entire books from (Project Gutenberg) and processed the word and letter pair frequencies on all of them. After ingesting such a large corpus, I could see which words were used most commonly. Like I said above, though, it will depend on what you're trying to process. If it's something like Twitter posts, try ingesting a ton of tweets. Learn from what you have to eventually process.

Related

Is there a way to detect if unnecessary characters are added to strings to bypass spam detection?

I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.

This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.

NLP: How do I combine stemming and tagging?

I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!

You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)

I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?

The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.

It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en

i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Filtering out Non English sentences in a list in Python Pandas

So there is a excel file which i have read through pandas and stored it in a dataframe 'df'. Now that excel file contains 24 columns as 'questions' and 631 rows as 'responses/answers'.
So i converted one such question into a list so that i can tokenize it and apply further nlp related tasks on it.
df_lst = df['Q8 Why do you say so ?'].values.tolist()
Now, this gives me a list that contains 631 sentences, out of which some sentences are non-english.. So i want to filter out the non-english sentences so that in the end I am left with a list that contains only english sentences.
What i have:
df_lst = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', 'This text is in a random non english language'...]
Output (What i want):
english_words = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', ...]
Also, I read about a python library named pyenchant which should be able to do this, but it's not compatible with windows 64bit and python 3.. Is there any other way by which this can be done ?
Thanks!

There is another library (closely related to nltk), TextBlob,
Initially bound to Sentiment analysis,
But you can still use it for translation, see the doc here: https://textblob.readthedocs.io/en/dev/quickstart.html
Section Translation and Language Detection
gl

Have you considered taking advantage of the number of English "stopwords" in your sentences? Give a look at the nltk package. Check English stopwords using the following code:
import nltk
from ntlk.corpus import stopwords
ntlk.download('stopwords') # If you just installed the package
set(stopwords.words('english'))
You could add a new column indicating the number of English stopwords present in each of your sentences. Presence of stopwords could be used as a predictor of English language.
Other thing that could work is, if you know for a fact that most answers are in English to begin with, make a frequency ranking for words (possibly for each question in your data). In your example, it looks like the word "customer" shows up quite consistently for the question under study. So you could engineer a variable that indicates the presence of very frequent words in an answer. That could also serve as a predictor. Do not forget to either make all words lowercase or uppercase and deal with plural or 's so you don't rank "customer", "Customer", "customers", "Customers", "customer's" and "customers'" all as different words.
After engineering the variables above, you can set up a threshold above which you consider the sentence to be written in English, or you can go for something a bit more fancy in terms of unsupervised learning.

Sort words by their usage

I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Python or other language? I heard about NLTK which is the closest library I know that could help. Or is this task for other tool?
thank you

Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.
The following code will print a given wordlist in the order of word frequency in the brown corpus:
import nltk
from nltk.corpus import brown
wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
print w
output:
>>>
the
house
Peter
corpus
asdf
If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.

You can use collections.Counter. The code is then as easy as :
l = get_iterable_or_list_of_words() # That is up to you
c = collections.Counter(l)
print(c.most_common())

I don't know much about natural language processing, but I think Python is an ideal language for you to use for the purpose.
A Google search for "Python natural language" found:
http://www.nltk.org/
A search of StackOverflow found this answer:
Python or Java for text processing (text mining, information retrieval, natural language processing)
Which in turn linked to Pattern:
http://www.clips.ua.ac.be/pages/pattern
You might want to take a look at Pattern, that seems promising.
Good luck and have fun!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.