I am currently running this code for search for bigram for entire of my text processing.
Variable alltext is really long text (over 1 million words)
I ran this code to extract bigram
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
tokenizer = RegexpTokenizer(r'([A-za-z]{2,})')
tokens = tokenizer.tokenize(alltext)
stopwords_list = stopwords.words('english')
tokens = [word for word in tokens if word not in stopwords.words('english')]
finder = BigramCollocationFinder.from_words(tokens, window_size = 2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.items():
print k,v
The code above searches for the frequency occurrence for possible bigrams.
The code prints me lots of bigrams and its number of occurrence.
The output is similar to this.
(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)
(('The', 'garment'), 2)
......
type(finder.ngram_fd.items()) is a list.
How can i sort the frequency from highest to lowest number of occurrence. My desire result would be.
(('The', 'garment'), 2)
(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)
Thank you very much, I am quite new to nltk and text processing so my explanation would not be as clear.
It looks like finder.ngram_fd is a dictionary. In that case, in Python 3 the items() method does not return a list, so you'll have to cast it to one.
Once you have a list, you can simply use the key= parameter of the sort() method, which specifies what we're sorting against:
ngram = list(finder.ngram_fd.items())
ngram.sort(key=lambda item: item[-1], reverse=True)
You have to add reverse=True because otherwise the results would be in ascending order. Note that this will sort the list in place. This is best when you want to avoid copying. If instead you wish to obtain a new list, just use the sorted() built-in function with the same arguments.
Alternatively, you can replace the lambda with operator.itemgetter module, which does the same thing:
ngram.sort(key=operator.itemgetter(-1), reverse=True)
Related
I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams
I know it is possible to find bigrams which have a particular word from the example in the link below:
finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)
bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
>>>
nltk: how to get bigrams containing a specific word
But I am not sure how this can be applied if I need bigrams containing both words pre-defined.
Example:
My Sentence: "hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"
Given a list:["yesterday", "other", "I", "side"]
How can I get a list of bi-grams with the given words. i.e:
[("yesterday", "I"), ("other", "side")]?
What you want is probably a word_filter function that returns False only if all the words in a particular bigram are part of the list
def word_filter(x, y):
if x in lst and y in lst:
return False
return True
where lst = ["yesterday", "I", "other", "side"]
Note that this function is accessing the lst from the outer scope - which is a dangerous thing, so make sure you don't make any changes to lst within the word_filter function
First you can create all possible bigrams for your vocabulary and feed that as the input for a countVectorizer, which can transform your given text into bigram counts.
Then, you filter the generated bigrams based on the counts given by countVectorizer.
Note: I have changed the token pattern to account for even single character. By default, it skips the single characters.
from sklearn.feature_extraction.text import CountVectorizer
import itertools
corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count])
output:
['yesterday i', 'other side']
This approach would be a better approach when you have more number of documents and less number of words in the vocabulary. If its other way around, you can find all the bigrams in the document first and then filter it using your vocabulary.
I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
Results,
149.01646934738716
foo: 9998872
18.093295297389773
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
Solution
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
count_dict
Results:
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
nlp.vocab.strings[8566208034543834098]
Returns
'apple'
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:
2
https://spacy.io/api/doc#count_by
How can I ignore some words like 'a', 'the', when counting the frequency of a word accuracy in a text?
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
The answer will be The. But I would like to get distance as the most frequent word.
It would be best to avoid counting the entries to begin with like so.
ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)
Another option is to use the stop_words parameter of CountVectorizer.
These are words that you are not interested in and will be discarded by the analyzer.
f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]
Note that the tokenizer does not perform preprocessing (lowercasing, accent-stripping) or remove stop words, so you need to use the analyzer here.
You can also use stop_words='english' to automatically remove english stop words (see sklearn.feature_extraction.text.ENGLISH_STOP_WORDS for the full list).
I have a data set as follows:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
and so on
I want to find out the most frequently occurring word-pairs e.g.
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
The two words could be in any order and at any distance from each other
Can someone suggest a possible solution in python? This is a very large data set.
Any suggestion is highly appreciated
So this is what I tried after suggestions from #275365
#275365 I tried the following with input read from a file
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
text file has same number of lines as the original one but has only unique tokens in a particular line. I don't know what am I doing wrong since when I run this it splits the words in letters rather than giving output as combinations of words. When I run this file it outputs split letters rather than combinations of words as expected. I dont know where I am making a mistake.
You might start with something like this, depending on how large your corpus is:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
The result:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Do you want numbers included in these combinations or not? Since you didn't specifically mention excluding them, I have included them here.
EDIT: Working with a file object
The function that you posted as your first attempt above is very close to working. The only thing you need to do is change each line (which is a string) into a tuple or list. Assuming your data looks exactly like the data you posted above (with quotation marks around each term and commas separating the terms), I would suggest a simple fix: you can use ast.literal_eval. (Otherwise, you might need to use a regular expression of some kind.) See below for a modified version with ast.literal_eval:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Now you can test it like this:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
There is not that much you can do, except counting all pairs.
Obvious optimizations are to early remove duplicate words and synonyms, perform stemming (anything that reduces the number of distinct tokens is good!), and to only count pairs (a,b) where a<b (in your example, only either count statistics,narnia, or narnia,statistics, but not both!).
If you run out of memory, perform two passes. In the first pass, use one or multiple hash functions to obtain a candidate filter. In the second pass, only count words that pass this filter (MinHash / LSH style filtering).
It's a naive parallel problem, therefore this is also easy to distribute to multiple threads or computers.