Sum up the number of words frequency using FreqDist, python - python

How to sum up the number of words frequency using fd.items() from FreqDist?
>>> fd = FreqDist(text)
>>> most_freq_w = fd.keys()[:10] #gives me the most 10 frequent words in the text
>>> #here I should sum up numbers of each of these 10 freq words appear in the text
e.g. if each word in most_freq_w appear 10 times, the result should be 100
!!! I don't need that number of all words in the text, just the 10 most frequent

I'm not familiar with nltk, but since FreqDist derives from dict, then the following should work:
v = fd.values()
v.sort()
count = sum(v[-10:])

To find the number of times a word appears in the corpus(your piece of text):
raw="<your file>"
tokens = nltk.word_tokenize(raw)
fd = FreqDist(tokens)
print fd['<your word here>']

It has a pretty print feature
fd.pprint()
will do it.

If FreqDist is a mapping of words to their frequencies:
sum(map(fd.get, most_freq_w))

Related

word in words.words() check too slow and inaccurate in Python

I am having a dataset, which is consisting of two columns, one is a Myers-Briggs personality type and the other one is containing the last 50 tweets of that person. I have tokenized, removed the URLs and the stop words from the list, and lemmatized the words.
I am then creating a collections.Counter of the most common words and I am checking whether they are valid English words with nltk.
The problem is that checking if the word exists in the corpora vocabulary takes too much time and I also think that a lot of words are missing from this vocabulary. This is my code:
import nltk
import collections
from nltk.corpus import words
# nltk.download("words")
# Creating a frequency Counter of all the words
frequency_counter = collections.Counter(df.posts.explode())
sorted_common_words = sorted(frequency_counter.items(), key = lambda pair: -pair[1])
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
valid_words = []
invalid_words = []
valid_words = [word for word in words_lst if word in words.words()]
invalid_words = [word for word in words_lst if word not in words.words()]
My problem is that the invalid_words list is containing some valid English words like:
f*ck
changed
surprised
girlfriend
avatar
anymore
And some more of course. Even checking manually if those words exist in the words.words() it returns False. I tried initially to stem my text but this produced some root of the words, which didn't look right, and that's why decided to lemmatize them.
Is there a library in Python which have all the stemmed versions of the English words? I guess this will speed up significantly my script.
My original dataframe is around 9000 lines, and a bit more than 5M tokenized words and around 110.000 unique words after cleaning the dataset. 'words.words()is containing 236736 words, so checking if those 110.000 words are withinwords.words()` will take too much time. I have checked and checking of 1000 takes approximately a minute. This is mainly due to the limitation of Python to be run on only one core, so I cannot parallelize the operation on all available cores.
I would suggest this solution:
# your code as it was before
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
import numpy as np
words_arr = np.array(words_lst,dtype=str)
words_dictionary = np.array(words.words(),dtype=str)
mask_valid_words = np.in1d(words_arr, words_dictionary)
valid_words = words_arr[mask_valid_words]
invalid_words = words_arr[~mask_valid_words]

How to determine the number of negation words per sentence

I would like to know how to count how many negative words (no, not) and abbreviation (n't) there are in a sentence and in the whole text.
For number of sentences I am applying the following one:
df["sent"]=df['text'].str.count('[\w][\.!\?]')
However this gives me the count of sentences in a text. I would need to look per each sentence at the number of negation words and within the whole text.
Can you please give me some tips?
The expected output for text column is shown below
text sent count_n_s count_tot
I haven't tried it yet 1 1 1
I do not like it. What do you think? 2 0.5 1
It's marvellous!!! 1 0 0
No, I prefer the other one. 2 1 1
count_n_s is given by counting the total number of negotiation words per sentence, then dividing by the number of sentences.
I tried
split_w = re.split("\w+",df['text'])
neg_words=['no','not','n\'t']
words = [w for i,w in enumerate(split_w) if i and (split_w[i-1] in neg_words)]
This would get a count of total negations in the text (not for individual sentences):
import re
NEG = r"""(?:^(?:no|not)$)|n't"""
NEG_RE = re.compile(NEG, re.VERBOSE)
def get_count(text):
count = 0
for word in text:
if NEG_RE .search(word):
count+=1
continue
else:
pass
return count
df['text_list'] = df['text'].apply(lambda x: x.split())
df['count'] = df['text_list'].apply(lambda x: get_count(x))
To get count of negations for individual lines use the code below. For words like haven't you can add it to neg_words since it is not a negation if you strip the word of everything else if it has n't
import re
str1 = '''I haven't tried it yet
I do not like it. What do you think?
It's marvellous!!!
No, I prefer the other one.'''
neg_words=['no','not','n\'t']
for text in str1.split('\n'):
split_w = re.split("\s", text.lower())
# to get rid of special characters such as comma in 'No,' use the below search
split_w = [re.search('^\w+', w).group(0) for w in split_w]
words = [w for w in split_w if w in neg_words]
print(len(words))

What is the most efficient way to calculate the distance of a word with the other words in a list?

I am working on correcting Turkish words by using Levensthein distance. First of all I detect wrong written words and compare them with a list that contains all Turkish words. The list contains about 1.300.000 words. I use Levensthein distance to compare word with the words in the list. Here is the part of my code.
index_to_track_document_order = 1
log_text = ''
main_directory = "C:\\words.txt"
f= codecs.open(main_directory,mode="rb",encoding="utf-8")
f=f.readlines()
similarity = 0
text_to_find = 'aktarıları'
best_fit_word = ''
for line in f:
word = word_tokenize( line, language= 'turkish')[0]
root = word_tokenize( line, language= 'turkish')[1]
new_similarity = textdistance.levenshtein.normalized_similarity(text_to_find , word) * 100
if new_similarity > similarity:
similarity = new_similarity
best_fit_word = word
if(similarity > 90):
print(best_fit_word, str(similarity))
As I mentioned, word.txt contains more than a million records and so that my code takes more than 5 minutes to complete. How I can optimize the code so that it can complete in a shorter time. Thank you.
Index your words by length. Most similar words are of the same length, or one or two lengthes appart. A word cat(length 3) is similar to the word can(length 3), but it won't be very similar to caterpillar(length 11), so there is no reason to compare the levensthein of the two words with a big difference in length. So in total, you save lot of comparisons, because you only compare words of near similar lengths.
#creating a dictionary of words by length
word_dict = {}
for word in f:
word_length = len(word)
if word_length in word_dict:
word_dict[word_length].append(word)
else:
word_dict[word_length] = [word]
#now lets compare words with nearly the same length as our text_to_find
target_length = len(text_to_find)
x = 2 #the length difference we'd like to look at words
for i in range (target_length-x, target_length+x):
if i in word_dict:
#loop through all the words of that given length.
for word in word_dict:
new_similarity = textdistance.levenshtein.normalized_similarity(text_to_find , word) * 100
if new_similarity > similarity:
similarity = new_similarity
best_fit_word = word
if(similarity > 90):
print(best_fit_word, str(similarity))
Note: The creation of word_dict needs to be calculated only once. You can save it as a pickle if necessary.
Also, I did not test the code, but the general idea should be clear. One could even extend the idea to exend the length difference dynamically, if there was not yet found the most similar word.
Every time you say
similarity = new similarity
the old 'new_similarity' is kept, you are simply copying it to 'similarity'.
Using yield will return a generator which does not store all the values in memory, they generate the values on the fly.

Python NLTK: Count list of word and make probability with valid English words

I have a dirty document which includes invalid English words, numbers, etc.
I just want to take all valid English words and then calculate the ratio of my list of words to the total number of valid English words.
For example, if my document has the sentence:
sentence= ['eishgkej he might be a good person. I might consider this.']
I want to count only "he might be a good person. I might consider this" and count "might".
So, I got the answer 2/10.
I am thinking about using the below code. However, I need to change not the line features[word] = 1 but the count of features...
all_words = nltk.FreqDist(w.lower() for w in reader.words() if w.lower() not in english_sw)
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
if word in document_words:
features[word] = 1
else:
features[word]=0
return features
According to the documentation you can use count(self, sample) to return the count of a word in a FreqDist object. So I think you want something like:
for word in word_features:
if word in document_words:
features[word] = all_words.count(word)
else:
features[word]= 0
Or you could use indexing, i.e. all_words[word] should return the same as all_words.count(word)
If you want the frequency of the word you can do all_words.freq(word)

Some problems with NLTK

I'm pretty new to Python and NLTK but I had a question.
I was writing something to extract only words longer than 7 characters from a self made corpus. But it turns out that it extracts every word...
Anyone know what I did wrong?
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*)
def long_words(corpus)
for cat in corpus.categories():
fileids=corpus.fileids(categories=cat)
words=corpus.words(fileids)
long_tokens=[]
words2=set(words)
if len(words2) >=7:
long_tokens.append(words2)
Print long_tokens
Thanks everyone!
Replace
if len(words2) >=7:
long_tokens.append(words2)
with:
long_tokens += [w for w in words2 if len(w) >= 7]
Explanation: what you were doing was you were appending all the words (tokens) produced by corpus.words(fileids) if the number of words was at least 7 (so I suppose always for your corpus). What you really wanted to do was to filter out the words shorter than 7 characters from the tokens set and append the remaining long words to long_tokens.
Your function should return the result - tokens having 7 characters or more. I assume the way you create and deal with CategorizedPlaintextCorpusReader is OK:
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)
def long_words(corpus = Corpus):
long_tokens=[]
for cat in corpus.categories():
fileids = corpus.fileids(categories=cat)
words = corpus.words(fileids)
long_tokens += [w for w in set(words) if len(w) >= 7]
return set(long_tokens)
print "\n".join(long_words())
Here is an answer to the question you asked in the comments:
for loc in ['cat1','cat2']:
print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc

Categories

Resources