Sort words by their usage - python

I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Python or other language? I heard about NLTK which is the closest library I know that could help. Or is this task for other tool?
thank you

Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.
The following code will print a given wordlist in the order of word frequency in the brown corpus:
import nltk
from nltk.corpus import brown
wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
print w
output:
>>>
the
house
Peter
corpus
asdf
If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.

You can use collections.Counter. The code is then as easy as :
l = get_iterable_or_list_of_words() # That is up to you
c = collections.Counter(l)
print(c.most_common())

I don't know much about natural language processing, but I think Python is an ideal language for you to use for the purpose.
A Google search for "Python natural language" found:
http://www.nltk.org/
A search of StackOverflow found this answer:
Python or Java for text processing (text mining, information retrieval, natural language processing)
Which in turn linked to Pattern:
http://www.clips.ua.ac.be/pages/pattern
You might want to take a look at Pattern, that seems promising.
Good luck and have fun!

Related

Meaningless Spacy Nouns

I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg

NLTK Most common synonym (Wordnet) for each word

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.
If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.
Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.
Input: "Hello my valued friend"
Return: "Hi my dear friend"
Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.
The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:
from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())
You can then look up the frequency of a word like this:
>>> print(freqs["valued"])
14
Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.
>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in
brown.tagged_words(tagset="universal"))
>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45
Synonyms are a huge and open area of work in natural language processing.
In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.
Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.
Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).
I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!
Edit: I should have included this link to an older question as well:
How to get synonyms from nltk WordNet Python
And the NLTK documentation on its interface with WordNet:
http://www.nltk.org/howto/wordnet.html
I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.
The other answer shows you how to use synonyms:
wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
You now know how to get all the synonyms for a word. That's not the hard part. The hard part is determining what's the most common synonym. This question is highly domain dependent. Most common synonym where? In literature? In common vernacular? In technical speak?
Like you, I wanted to get an idea of how the English language was used. I downloaded 15,000 entire books from (Project Gutenberg) and processed the word and letter pair frequencies on all of them. After ingesting such a large corpus, I could see which words were used most commonly. Like I said above, though, it will depend on what you're trying to process. If it's something like Twitter posts, try ingesting a ton of tweets. Learn from what you have to eventually process.

How can I take a few paragraphs of text, see if any sentence has a pronoun and select all those sentences to make a new paragraph?

Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).

Python script to find word frequencies of a given document

I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer).
Is there any library or simple script that does this process?
use nltk
import nltk
YOUR_STRING = "Your words"
words = [w for w in YOUR_STRING.split()]
freq_dist = nltk.FreqDist(words)
tokens = freq_dist.keys()
#50 most frequent
most_frequent = tokens[:50]
#50 least frequent
least_frequent = tokens[-50:]
You should be able to count words. Use a collections.Counter or a dict, depending on what you need. That part is easy, but if it isn't you can find the answer by searching on SO itself.
I think you also want the Porter Stemmer, which has a Python version at http://tartarus.org/~martin/PorterStemmer/python.txt

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary.
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
def is_english_word(word):
pass # how to I implement is_english_word?
is_english_word(token.lower())
In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?
For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>
PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.
There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.
It won't work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK's words corpus
>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
Using NLTK:
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word
You should refer to this article if you have trouble installing wordnet or want to try other approaches.
Using a set to store the word list because looking them up will be faster:
with open("english_words.txt") as word_file:
english_words = set(word.strip().lower() for word in word_file)
def is_english_word(word):
return word.lower() in english_words
print is_english_word("ham") # should be true if you have a good english_words.txt
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.
As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.
For All Linux/Unix Users
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.
Now, for python specific users, the python code below should assign the list words to have the value of every single word:
import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ", file.read()).split()
file.close()
def is_word(word):
return word.lower() in words
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False
Hope this helps!
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
from nltk.corpus import words as nltk_words
def is_english_word(word):
# creation of this dictionary would be done outside of
# the function because you only need to do it once.
dictionary = dict.fromkeys(nltk_words.words(), None)
try:
x = dictionary[word]
return True
except KeyError:
return False
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by #Sadik, and use 'set(words.words())' to speed up.
First:
pip3 install nltk
python3
import nltk
nltk.download('words')
Then:
from nltk.corpus import words
setofwords = set(words.words())
print("hello" in setofwords)
>>True
With pyEnchant.checker SpellChecker:
from enchant.checker import SpellChecker
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > 4) or len(quote.split()) < 3) else True
print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.
As another idea, you could query Wiktionary's API.
use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.
Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english
>>> with open("/PATH/TO/words_alpha.txt") as f:
>>> words = set(f.read().split('\n'))
>>> len(words)
370106
From here onwards, you can check for existence in constant time using
>>> word_to_check = 'baboon'
>>> word_to_check in words
True
Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

Categories

Resources