Python Tf idf algorithm - python

I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?

Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)

Related

How to speed up computation time for stopword removal and lemmatization in NLP

As part of pre-processing for a text classification model, I have added stopword removal and lemmatization steps, using the NLTK library. The code is below:
import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# Stopwords removal
def remove_stopwords(entry):
sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
return " ".join(sentence_list)
df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))
# Lemmatization
lemmatizer = WordNetLemmatizer()
def punct_strip(string):
s = re.sub(r'[^\w\s]',' ',string)
return s
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def lemmatize_rows(entry):
sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
return " ".join(sentence_list)
df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))
The problem is that, when I pre-process a dataset with 27k entries (my test set), it takes 40-45 seconds for stopwords removal and just as long for lemmatization. By contrast, model evaluation only takes 2-3 seconds.
How can I re-write the functions to optimise computation speed? I have read something about vectorization, but the example functions were much simpler than the ones that I have reported, and I wouldn't know how to do it in this case.
A similar question was asked here and suggests that you try caching the stopwords.words("english") object. In your method remove_stopwords you are creating the object every time you evaluate an entry. So, you can definitely improve that. Regarding your lemmatizer, as mentioned here, you can also cache your results to improve performance. I can imagine that your pandas operations are also quite expensive. You may consider converting your dataframe into an array or dictionary and then iterating over it. If you need a dataframe later, you can easily convert it back.

Setting n-grams for sentiment analysis with Python and TextBlob

I want to do sentiment analysis of some sentences with Python and TextBlob lib.
I know how to use that, but Is there any way to set n-grams to that?
Basically, I do not want to analyze word by word, but I want to analyze 2 words, 3 words, because phrases can carry much more meaning and sentiment.
For example, this is what I have done (it works):
from textblob import TextBlob
my_string = "This product is very good, you should try it"
my_string = TextBlob(my_string)
sentiment = my_string.sentiment.polarity
subjectivity = my_string.sentiment.subjectivity
print(sentiment)
print(subjectivity)
But how can I apply, for example n-grams = 2, n-grams = 3 etc?
Is it possible to do that with TextBlob, or VaderSentiment lib?
Here is a solution that finds n-grams without using any libraries.
from textblob import TextBlob
def find_ngrams(n, input_sequence):
# Split sentence into tokens.
tokens = input_sequence.split()
ngrams = []
for i in range(len(tokens) - n + 1):
# Take n consecutive tokens in array.
ngram = tokens[i:i+n]
# Concatenate array items into string.
ngram = ' '.join(ngram)
ngrams.append(ngram)
return ngrams
if __name__ == '__main__':
my_string = "This product is very good, you should try it"
ngrams = find_ngrams(3, my_string)
analysis = {}
for ngram in ngrams:
blob = TextBlob(ngram)
print('Ngram: {}'.format(ngram))
print('Polarity: {}'.format(blob.sentiment.polarity))
print('Subjectivity: {}'.format(blob.sentiment.subjectivity))
To change the ngram lengths, change the n value in the function find_ngrams().
There is no parameter within textblob to define n-grams as opposed to words/unigrams to be used as features for sentiment analysis.
Textblob uses a polarity lexicon to calculate the overall sentiment of a text. This lexicon contains unigrams, which means it can only give you the sentiment of a word but not a n-gram with n>1.
I guess you could work around that by feeding bi- or tri-grams into the sentiment classifier, just like you would feed in a sentence and then create a dictionary of your n-grams with their accumulated sentiment value.
But I'm not sure that this is a good idea. I'm assuming you are looking for bigrams to address problems like negation ("not bad") and the lexicon approach won't be able to use not for flipping the sentiment value for bad.
Textblob also contains an option to use a naiveBayes classifier instead of the lexicon approach. This is trained on a movie review corpus provided by nltk but the default features for training are words/unigrams as far as I can make out from peeking at the source code.
You might be able to implement your own feature extractor within there to extract n-grams instead of words and then re-train it accordingly and use for your data.
Regardless of all that, I would suggest that you use a combination of unigrams and n>1-grams as features, because dropping unigrams entirely is likely to affect your performance negatively. Bigrams are much more sparsely distributed, so you'll struggle with data sparsity problems when training.

Sklearn TfIdfVectorizer remove docs containing all stopwords

I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.
Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.
Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?
Any help would be greatly appreciated.
You can:
specify your sopwords and then, after TfidfVecorizer
filter out empty rows
The following code snippet shows a simplified example that should set you in the right direction:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]
tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]
Feel free to ask questions if you still have any!
So, you can use this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
def tokenize(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
punctuations="?:!.,;'�۪"
for token in tokens:
if token in punctuations:
tokens.remove(token)
if re.search('[a-zA-Z0-9]', token):
filtered_tokens.append(token)
st = ' '.join(filtered_tokens)
return st
tokenize(data)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
use_idf=True,tokenizer=tokenize)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]
This way you can remove stopwords, empty rows and use min_df and max_df.

transform tf idf pandas dataframe into a tf idf matrix

How can I convert a the following pandas dataframe with the tf-idf score of each word in several documents into a matrix named "tfdif" so that I can implement for instance
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])
You need to fit the TfidfVectorizer using the original raw documents before being able to use it to transform a new document.
If you cannot access the original documents you can always recover the idf weights of each word by constructing a dictionary:
idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}
Later you can use that dictionary to calculate the tf-idf weights for the new sentence:
from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}

Finding trigrams for entire corpus with NLTK

I'm trying to write a script that will look through my corpus which contains 93,000 txt files and find the frequency distributions of the trigrams present across all of them (so not separate frequency distributions but one frequency distribution for the entire corpus). I've gotten it to do the frequency distributions for a single file in the corpus but don't have the skills at all to get any further. Here's the code:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
corpus_root = '/Users/jolijttamanaha/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
f = speeches.open('Mr. THOMPSON of Pennsylvania.2010-12-07.2014sep17_at_233337.txt')
raw = f.read()
tokens = nltk.word_tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
for k,v in fdist.items():
print k,v
Thank you in advance for your help.
Once you define your speeches corpus with PlaintextCorpusReader as you have, you can get trigrams for the entire corpus very simply:
fdist = nltk.FreqDist(nltk.trigrams(speeches.words()))
But this has an undesirable glitch: It forms trigrams that span from the end of one file to the next. But such trigrams do not represent tokens that could follow each other in a text-- they are completely accidental. What you really want is to combine the trigram counts from each individual file, which you can get like this:
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
fdist.update(nltk.trigrams(speeches.words(filename)))
Your fdist now contains the cumulative statistics, which you can examine in the various available ways. E.g.,
fdist.tabulate(10)
For pre-coded corpora API, instead of using corpus.raw(), you can try also corpus.words(), e.g.
>>> from nltk.util import ngrams
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> trigrams = ngrams(brown.words(), 3)
>>> for i in trigrams:
... print i
As #alexis pointed out, the code above should also work for custom corpora loaded with PlaintextCorpusReader, see http://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html

Categories

Resources