TFIDFVectorizer making concatenated word tokens

TFIDFVectorizer making concatenated word tokens - python

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content = [line.replace('\n','') for line in f]
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?

This happens because there are two words are commonly used together .It seems that the concatenated words are resulting from the n-gram generation in the TfidfVectorizer. When you set ngram_range=(1,1), the vectorizer only considers single words. However, when you increase the ngram_range, the vectorizer considers n-grams of words
You can use regular expression to avoid this.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
# Remove n-grams that have two words concatenated
pattern = r'\b\w+\w\b'
vectorizer.vocabulary_ = {key: val for key, val in vectorizer.vocabulary_.items() if re.match(pattern, key)}
pattern \b\w+\w\b matches n-grams that have two words concatenated, such as freevibration The resulting vocabulary_ dictionary will not contain these n-grams

My guess would be that the issue is caused by this line:
content = [line.replace('\n','') for line in f]
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content = [line.replace('\n',' ') for line in f]
---
(note the space between '')

Related

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!

I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

CSV file with label

As suggested here Python Tf idf algorithm I use this code to get the frequency of words over a set of documents.
import pandas as pd
import csv
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import codecs
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
with codecs.open("book1.txt",'r','utf-8') as i1,\
codecs.open("book2.txt",'r','utf-8') as i2,\
codecs.open("book3.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')
text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")
With the last line, I create a csv file where are listed all words and their frequency. Is there a way to put a label to them, to see if a word belong only to the third document, or to all?
My goal is to delete from the csv file all the words that appear only in the 3rd document (book3)

You can use the isin() attribute to filter out your top_words in the third book from the top_ words in the entire corpus.
(For the example below I downloaded three random books from http://www.gutenberg.org/)
import codecs
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# import nltk
# nltk.download('punkt')
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
with codecs.open("56732-0.txt",'r','utf-8') as i1,\
codecs.open("56734-0.txt",'r','utf-8') as i2,\
codecs.open("56736-0.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')
text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
# top_words for the 3rd book alone
text = [" ".join(tokenize(t3.lower()))]
matrix = vectorizer.fit_transform(text).todense()
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
top_words3 = matrix.sum(axis=0).sort_values(ascending=False)
# Mask out words in t3
mask = ~top_words.index.isin(top_words3.index)
# Filter those words from top_words
top_words = top_words[mask]
top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")

cosine-similarity between consecutive pairs using whole articles in JSON file

I would like to calculate the cosine similarity for the consecutive pairs of articles in a JSON file. So far I manage to do it but.... I just realize that when transforming the tfidf of each article I am not using the terms from all articles available in the file but only those from each pair. Here is the code that I am using which provides the cosine-similarity coefficient of each consecutive pair of articles.
import json
import nltk
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a super function is created that contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
## Calculation one by one of the cosine similatrity
def foo(x, y):
tfidf = vectorizer.fit_transform([x, y])
return ((tfidf * tfidf.T).A)[0,1]
my_funcs = {}
for i in range(len(data) - 1):
x = data[i]['body']
y = data[i+1]['body']
foo.func_name = "cosine_sim%d" % i
my_funcs["cosine_sim%d" % i] = foo
print(foo(x,y))
Any idea of how to develop the cosine-similarity using the whole terms of all articles available in the JSON file rather than only those of each pair?
Kind regards,
Andres

I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided.
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from sklearn.metrics.pairwise import cosine_similarity
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)
#cosine dists
similarity matrix = cosine_similarity(tfidf_data)

how to extract the contextual words of a token in python

Actually i want to extract the contextual words of a specific word. For this purpose i can use the n-gram in python but the draw back of this is that it slides the window by one but i only need the contextual words of a specific word. E.g. my file is like this
IL-2
gene
expression
and
NF-kappa
B
activation
through
CD28
requires
reactive
oxygen
production
by
5-lipoxygenase
.
mean each token on every line. now i want to extract the surrounding words of each e.g. through and requires are the surrounding words of "CD28". I write a python code but did not worked and generating an error of ValueError: list.index(x): x not in list.
My code is
import re;
import nltk;
file=open("C:/Python26/test.txt");
contents= file.read()
tokens = nltk.word_tokenize(contents)
f=open("trigram.txt",'w');
for l in tokens:
print tokens[l],tokens[l+1]
f.close();

First of all, list.index(x) : Return the index in the list of the first item whose value is x.
>>> ["foo", "bar", "baz"].index('bar')
1
In your code, the variable 'word' is populated using range of integers not by actual contents. so we can't directly use 'word' in the list.index() function.
>>> print lines.index(1)
ValueError: 1 is not in list
change your code like this :
file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.close()

I dont really understood what you want to do, but, I'll do my best.
If you want to process words with python there is a library called NLTK which means Natural Language Toolkit.
You may need to tokenize a sentence or a document.
import nltk
def tokenize_query(query):
return nltk.word_tokenize(query)
f = open('C:/Python26/tokens.txt')
raw = f.read()
tokenize_query(raw)
We can also read a file one line at a time using a for loop:
f = open('C:/Python26/tokens.txt', 'rU')
for line in f:
print(line.strip())
r means 'read' and U means 'universal', if you are wondering.
strip() is just cutting '\n' from the text.
The context may be provided by wordnet and all its functions.
I guess you should use synsets with the word's pos (part of speech).
A synset is sort of a synonyms list in a semantic way.
NLTK can provide you some others nice features like sentiment analysis and similarity between synsets.

file="C:/Python26/tokens.txt";
f=open("trigram.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in range(1,len(lines)-1):
f.write(lines[word-1].strip()+"\t"+lines[word].strip()+"\t"+lines[word+1].strip())
f.write("\n")
f.close()

This code also gives the same result
import nltk;
from nltk.util import ngrams
from nltk import word_tokenize
file = open("C:/Python26/tokens.txt");
contents=file.read();
tokens = nltk.word_tokenize(contents);
f_tri = open("trigram.txt",'w');
trigram = ngrams(tokens,3)
for t in trigram:
f_tri.write(str(t)+"\n")
f_tri.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TFIDFVectorizer making concatenated word tokens - python

Related

How to get the probability of bigrams in a text of sentences?

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

CSV file with label

cosine-similarity between consecutive pairs using whole articles in JSON file

how to extract the contextual words of a token in python

Categories

Resources