How to get the probability of bigrams in a text of sentences?

How to get the probability of bigrams in a text of sentences? - python

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

Related

TFIDFVectorizer making concatenated word tokens

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content = [line.replace('\n','') for line in f]
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?

This happens because there are two words are commonly used together .It seems that the concatenated words are resulting from the n-gram generation in the TfidfVectorizer. When you set ngram_range=(1,1), the vectorizer only considers single words. However, when you increase the ngram_range, the vectorizer considers n-grams of words
You can use regular expression to avoid this.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
# Remove n-grams that have two words concatenated
pattern = r'\b\w+\w\b'
vectorizer.vocabulary_ = {key: val for key, val in vectorizer.vocabulary_.items() if re.match(pattern, key)}
pattern \b\w+\w\b matches n-grams that have two words concatenated, such as freevibration The resulting vocabulary_ dictionary will not contain these n-grams

My guess would be that the issue is caused by this line:
content = [line.replace('\n','') for line in f]
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content = [line.replace('\n',' ') for line in f]
---
(note the space between '')

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!

I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

Gensim Word2Vec 'you must first build vocabulary before training the model'

I am trying to fit a Word2Vec model. According to the documentation for Gensim's Word2Vec we do not need to call model.build_vocabulary before using it.
But yet it is asking for me to do it. I have tried calling this function and it has not worked. I also fitted a Word2Vec model before without needing to call model.build_vocabulary .
Am I doing something wrong? Here is my code:
from gensim.models import Word2Vec
dataset = pd.read_table('genemap_copy.txt',delimiter='\t', lineterminator='\n')
def row_to_sentences(dataframe):
columns = dataframe.columns.values
corpus = []
for index,row in dataframe.iterrows():
if index == 1000:
break
sentence = ''
for column in columns:
sentence += ' '+str(row[column])
corpus.append([sentence])
return corpus
corpus = row_to_sentences(dataset)
clean_corpus = [[sentence[0].lower()] for sentence in corpus ]
# model = Word2Vec()
# model.build_vocab(clean_corpus)
model = Word2Vec(clean_corpus, size=100, window=5, min_count=5, workers=4)
Help is greatly appreciated!
Also I am using macOS Sierra.
There is not much support online for using Gensim with Mac D: .

I think my problem was having the parameter min_count=5 so it was not considering most of my words if they did not appear more than 5 times.

Try with LineSentence:
from gensim.models.word2vec import LineSentence
and then train your corpus with
model = Word2Vec(LineSentence(clean_corpus), size=100, window=5, min_count=5, workers=4)

Is it that you are appending a new list containing a single sentence each time? corpus.append([sentence]). You need to feed Word2Vec a series of sentences, but not necessarily sentences gathered by document. I'm also not clear on what is in your df but have you tokenised the sentences already?
My generator class I've used before for Word2Vec...
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess
class MySentences(object):
def __init__(self, docs):
self.corpus = docs
def __iter__(self):
for doc in self.corpus:
doc_sentences = sent_tokenize(doc)
for sent in doc_sentences:
yield simple_preprocess(sent) # yields a tokenized
sentence ['like','this','one','.']
sentences = MySentences(df['text'].tolist())
model = gensim.models.Word2Vec(sentences, min_count=5, workers=8, size=300, sg=1)

UnpicklingError was unhandled by user code : invalid load key,'%'

I have referred the website https://radimrehurek.com/gensim/tut2.html. I have come across the error UnpicklingError was unhandled by user code : invalid load key,'%'. How do I clear that error? I had referred the other queries and included the klepto package but still that error persists. I am using anacoanda2. This is the code:-
import logging
import xml.etree.cElementTree
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
import os
import klepto
from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
from pprint import pprint # pretty-printer
pprint(texts)
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('/tmp/deerwester.dict') # store the dictionary, for future reference
print(dictionary)
print(dictionary.token2id)
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.dict', corpus) # store to disk, for later use
for c in corpus:
print(c)
class MyCorpus(object):
def __iter__(self):
for line in open('/datasets/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
for vector in corpus_memory_friendly: # load one vector into memory at a time
print(vector)
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('/datasets/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
# remove gaps in id sequence after words that were removed
dictionary.compactify()
print(dictionary)
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
corpus = corpora.MmCorpus('/tmp/corpus.mm')
print(corpus)
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
from gensim import corpora, models, similarities
if (os.path.exists("/tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm')
print("Used files generated from first tutorial")
else:
print("Please run first tutorial to generate data set")
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
lsi.print_topics(2)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('/tmp/model.lsi')
model = models.TfidfModel(corpus, normalize=True)
model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
model = models.RpModel(tfidf_corpus, num_topics=500)
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model = models.HdpModel(corpus, id2word=dictionary)

n-grams with Naive Bayes classifier

Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'

A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the probability of bigrams in a text of sentences? - python

Related

TFIDFVectorizer making concatenated word tokens

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Gensim Word2Vec 'you must first build vocabulary before training the model'

UnpicklingError was unhandled by user code : invalid load key,'%'

n-grams with Naive Bayes classifier

Categories

Resources