I'm trying to work with RegEx to make a dictionary with keys that are the bigrams from a text file, and whose values are the number of occurrences of those bigrams in the text.
I've got this code that gets me the bigrams. It's not perfect, because bigrams should be like "hello, world" , "world, full" ""full, of" "of, wonderful" "wonderful, things", but in my print out the bigrams are ordered differently than that so I'm not sure if I did it right.
I am not sure how to correlate these RegEx phrases that get the bigrams to go into a dictionary with keys that are those bigrams and values that reflect their count throughout the original text file. Any help greatly appreciated.
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
with open('/Users/adamstark/PycharmProjects/Computational_Methods_Course/Assignments/Final/Jules_Verne_From_the_Earth_to_Moon.txt') as file:
txt1 = file.readlines()
# Getting bigrams
txt1 = [remove_string_special_characters(s) for s in txt1]
vectorizer = CountVectorizer(ngram_range = (2,2))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nFeatures : \n", features)
print("\n\nX1 : \n", X1.toarray())
Related
I know it is possible to find bigrams which have a particular word from the example in the link below:
finder = BigramCollocationFinder.from_words(text.split())
word_filter = lambda w1, w2: "man" not in (w1, w2)
finder.apply_ngram_filter(word_filter)
bigram_measures = nltk.collocations.BigramAssocMeasures()
raw_freq_ranking = finder.nbest(bigram_measures.raw_freq, 10) #top-10
>>>
nltk: how to get bigrams containing a specific word
But I am not sure how this can be applied if I need bigrams containing both words pre-defined.
Example:
My Sentence: "hello, yesterday I have seen a man walking. On the other side there was another man yelling: "who are you, man?"
Given a list:["yesterday", "other", "I", "side"]
How can I get a list of bi-grams with the given words. i.e:
[("yesterday", "I"), ("other", "side")]?
What you want is probably a word_filter function that returns False only if all the words in a particular bigram are part of the list
def word_filter(x, y):
if x in lst and y in lst:
return False
return True
where lst = ["yesterday", "I", "other", "side"]
Note that this function is accessing the lst from the outer scope - which is a dangerous thing, so make sure you don't make any changes to lst within the word_filter function
First you can create all possible bigrams for your vocabulary and feed that as the input for a countVectorizer, which can transform your given text into bigram counts.
Then, you filter the generated bigrams based on the counts given by countVectorizer.
Note: I have changed the token pattern to account for even single character. By default, it skips the single characters.
from sklearn.feature_extraction.text import CountVectorizer
import itertools
corpus = ["hello, yesterday I have seen a man walking. On the other side there was another man yelling: who are you, man?"]
unigrams=["yesterday", "other", "I", "side"]
bi_grams=[' '.join(bi_gram).lower() for bi_gram in itertools.combinations(unigrams, 2)]
vectorizer = CountVectorizer(vocabulary=bi_grams,ngram_range=(2,2),token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(corpus)
print([word for count,word in zip(X.sum(0).tolist()[0],vectorizer.get_feature_names()) if count])
output:
['yesterday i', 'other side']
This approach would be a better approach when you have more number of documents and less number of words in the vocabulary. If its other way around, you can find all the bigrams in the document first and then filter it using your vocabulary.
I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
I have a text document, I am using regex and nltk to find top 5 most common words from this document. I have to print out sentences where these words belong to, how do I do that? further, I want to extend this to finding common words in multiple documents and returning their respective sentences.
import nltk
import collections
from collections import Counter
import re
import string
frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15]
fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution from a list
most_common = fdist.max() # returns a single element
top_five = fdist.most_common(5)# returns a list
list_5=[word for (word, freq) in fdist.most_common(5)]
print(top_five)
print(list_5)
Output:
[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)]
['you', 'tuples', 'the', 'are', 'pard']
The output is most commonly occurring words I have to print sentences where these words belong to, how do I do that?
Although it doesn't account for special characters at word boundaries like your code does, the following would be a starting point:
for sentence in text_string.split('.'):
if list(set(list_5) & set(sentence.split(' '))):
print sentence
We first iterate over the sentences, assuming each sentence ends with a . and the . character is nowhere else in the text. Afterwards, we print the sentence if the intersection of its set of words with the set of words in your list_5 is not empty.
You will have to install NLTK Data if you haven't already done so.
From http://www.nltk.org/data.html :
Run the Python interpreter and type the commands:
> >>> import nltk
> >>> nltk.download()
A new window should open, showing the NLTK Downloader. Click on the
File menu and select Change Download
Directory.
Then install the punkt model from the models tab.
Once you have that you can tokenize all sentences and extract the ones with your top 5 words in them as such:
sent_tokenize_list = nltk.sent_tokenize(text_string)
for sentence in sent_tokenize_list:
for word in list_5:
if word in sentence:
print(sentence)
How can I ignore some words like 'a', 'the', when counting the frequency of a word accuracy in a text?
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
The answer will be The. But I would like to get distance as the most frequent word.
It would be best to avoid counting the entries to begin with like so.
ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)
Another option is to use the stop_words parameter of CountVectorizer.
These are words that you are not interested in and will be discarded by the analyzer.
f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]
Note that the tokenizer does not perform preprocessing (lowercasing, accent-stripping) or remove stop words, so you need to use the analyzer here.
You can also use stop_words='english' to automatically remove english stop words (see sklearn.feature_extraction.text.ENGLISH_STOP_WORDS for the full list).
I'm trying to write a script that will look through my corpus which contains 93,000 txt files and find the frequency distributions of the trigrams present across all of them (so not separate frequency distributions but one frequency distribution for the entire corpus). I've gotten it to do the frequency distributions for a single file in the corpus but don't have the skills at all to get any further. Here's the code:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
corpus_root = '/Users/jolijttamanaha/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
f = speeches.open('Mr. THOMPSON of Pennsylvania.2010-12-07.2014sep17_at_233337.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
for k,v in fdist.items():
print k,v
Thank you in advance for your help.
Once you define your speeches corpus with PlaintextCorpusReader as you have, you can get trigrams for the entire corpus very simply:
fdist = nltk.FreqDist(nltk.trigrams(speeches.words()))
But this has an undesirable glitch: It forms trigrams that span from the end of one file to the next. But such trigrams do not represent tokens that could follow each other in a text-- they are completely accidental. What you really want is to combine the trigram counts from each individual file, which you can get like this:
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
fdist.update(nltk.trigrams(speeches.words(filename)))
Your fdist now contains the cumulative statistics, which you can examine in the various available ways. E.g.,
fdist.tabulate(10)
For pre-coded corpora API, instead of using corpus.raw(), you can try also corpus.words(), e.g.
>>> from nltk.util import ngrams
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> trigrams = ngrams(brown.words(), 3)
>>> for i in trigrams:
... print i
As #alexis pointed out, the code above should also work for custom corpora loaded with PlaintextCorpusReader, see http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html