Python script to find word frequencies of a given document - python

I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer).
Is there any library or simple script that does this process?

use nltk
import nltk
YOUR_STRING = "Your words"
words = [w for w in YOUR_STRING.split()]
freq_dist = nltk.FreqDist(words)
tokens = freq_dist.keys()
#50 most frequent
most_frequent = tokens[:50]
#50 least frequent
least_frequent = tokens[-50:]

You should be able to count words. Use a collections.Counter or a dict, depending on what you need. That part is easy, but if it isn't you can find the answer by searching on SO itself.
I think you also want the Porter Stemmer, which has a Python version at http://tartarus.org/~martin/PorterStemmer/python.txt

Related

Meaningless Spacy Nouns

I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg

detect English words and nltk's words corpus

Just trying to see of a word is English or not. This:
english_words = set(nltk.corpus.words.words())
print("revised" in english_words)
results in False. Am I doing something wrong? Is this to be expected? Are there better ways of doing this? Thanks.
It seems that "revised" indeed is not in the wordlist:
import nltk
english_words = set(nltk.corpus.words.words())
for w in english_words:
if w.startswith("revise"):
print(w)
prints the following list:
reviser
revise
revisee
revisership
Based on this source, section 4.1, this is where the word list originates from:
The Words Corpus is the /usr/share/dict/words file from Unix
So you'll have to decide for your use case if the provided word list from NLTK is enough or if you want to switch to a more complete (and bigger) one.
Try this
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word

NLP process for combining common collocations

I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.
What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map look like for this?
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
And/or in python?
I recently had a similar question and played around with collocations
This was the solution I chose to identify pairs of collocated words:
from nltk import word_tokenize
from nltk.collocations import *
text = <a long text read in as string string>
tokenized_text = word_tokenize(text)
bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

Sort words by their usage

I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Python or other language? I heard about NLTK which is the closest library I know that could help. Or is this task for other tool?
thank you
Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.
The following code will print a given wordlist in the order of word frequency in the brown corpus:
import nltk
from nltk.corpus import brown
wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
print w
output:
>>>
the
house
Peter
corpus
asdf
If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.
You can use collections.Counter. The code is then as easy as :
l = get_iterable_or_list_of_words() # That is up to you
c = collections.Counter(l)
print(c.most_common())
I don't know much about natural language processing, but I think Python is an ideal language for you to use for the purpose.
A Google search for "Python natural language" found:
http://www.nltk.org/
A search of StackOverflow found this answer:
Python or Java for text processing (text mining, information retrieval, natural language processing)
Which in turn linked to Pattern:
http://www.clips.ua.ac.be/pages/pattern
You might want to take a look at Pattern, that seems promising.
Good luck and have fun!

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary.
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
def is_english_word(word):
pass # how to I implement is_english_word?
is_english_word(token.lower())
In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?
For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>
PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.
There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.
It won't work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK's words corpus
>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
Using NLTK:
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word
You should refer to this article if you have trouble installing wordnet or want to try other approaches.
Using a set to store the word list because looking them up will be faster:
with open("english_words.txt") as word_file:
english_words = set(word.strip().lower() for word in word_file)
def is_english_word(word):
return word.lower() in english_words
print is_english_word("ham") # should be true if you have a good english_words.txt
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.
As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.
For All Linux/Unix Users
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.
Now, for python specific users, the python code below should assign the list words to have the value of every single word:
import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ", file.read()).split()
file.close()
def is_word(word):
return word.lower() in words
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False
Hope this helps!
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
from nltk.corpus import words as nltk_words
def is_english_word(word):
# creation of this dictionary would be done outside of
# the function because you only need to do it once.
dictionary = dict.fromkeys(nltk_words.words(), None)
try:
x = dictionary[word]
return True
except KeyError:
return False
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by #Sadik, and use 'set(words.words())' to speed up.
First:
pip3 install nltk
python3
import nltk
nltk.download('words')
Then:
from nltk.corpus import words
setofwords = set(words.words())
print("hello" in setofwords)
>>True
With pyEnchant.checker SpellChecker:
from enchant.checker import SpellChecker
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > 4) or len(quote.split()) < 3) else True
print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.
As another idea, you could query Wiktionary's API.
use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.
Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english
>>> with open("/PATH/TO/words_alpha.txt") as f:
>>> words = set(f.read().split('\n'))
>>> len(words)
370106
From here onwards, you can check for existence in constant time using
>>> word_to_check = 'baboon'
>>> word_to_check in words
True
Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

Categories

Resources