I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.
Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.
Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?
Any help would be greatly appreciated.
You can:
specify your sopwords and then, after TfidfVecorizer
filter out empty rows
The following code snippet shows a simplified example that should set you in the right direction:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]
tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]
Feel free to ask questions if you still have any!
So, you can use this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
def tokenize(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
punctuations="?:!.,;'�۪"
for token in tokens:
if token in punctuations:
tokens.remove(token)
if re.search('[a-zA-Z0-9]', token):
filtered_tokens.append(token)
st = ' '.join(filtered_tokens)
return st
tokenize(data)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
use_idf=True,tokenizer=tokenize)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]
This way you can remove stopwords, empty rows and use min_df and max_df.
Related
I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?
Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams
I want to do sentiment analysis of some sentences with Python and TextBlob lib.
I know how to use that, but Is there any way to set n-grams to that?
Basically, I do not want to analyze word by word, but I want to analyze 2 words, 3 words, because phrases can carry much more meaning and sentiment.
For example, this is what I have done (it works):
from textblob import TextBlob
my_string = "This product is very good, you should try it"
my_string = TextBlob(my_string)
sentiment = my_string.sentiment.polarity
subjectivity = my_string.sentiment.subjectivity
print(sentiment)
print(subjectivity)
But how can I apply, for example n-grams = 2, n-grams = 3 etc?
Is it possible to do that with TextBlob, or VaderSentiment lib?
Here is a solution that finds n-grams without using any libraries.
from textblob import TextBlob
def find_ngrams(n, input_sequence):
# Split sentence into tokens.
tokens = input_sequence.split()
ngrams = []
for i in range(len(tokens) - n + 1):
# Take n consecutive tokens in array.
ngram = tokens[i:i+n]
# Concatenate array items into string.
ngram = ' '.join(ngram)
ngrams.append(ngram)
return ngrams
if __name__ == '__main__':
my_string = "This product is very good, you should try it"
ngrams = find_ngrams(3, my_string)
analysis = {}
for ngram in ngrams:
blob = TextBlob(ngram)
print('Ngram: {}'.format(ngram))
print('Polarity: {}'.format(blob.sentiment.polarity))
print('Subjectivity: {}'.format(blob.sentiment.subjectivity))
To change the ngram lengths, change the n value in the function find_ngrams().
There is no parameter within textblob to define n-grams as opposed to words/unigrams to be used as features for sentiment analysis.
Textblob uses a polarity lexicon to calculate the overall sentiment of a text. This lexicon contains unigrams, which means it can only give you the sentiment of a word but not a n-gram with n>1.
I guess you could work around that by feeding bi- or tri-grams into the sentiment classifier, just like you would feed in a sentence and then create a dictionary of your n-grams with their accumulated sentiment value.
But I'm not sure that this is a good idea. I'm assuming you are looking for bigrams to address problems like negation ("not bad") and the lexicon approach won't be able to use not for flipping the sentiment value for bad.
Textblob also contains an option to use a naiveBayes classifier instead of the lexicon approach. This is trained on a movie review corpus provided by nltk but the default features for training are words/unigrams as far as I can make out from peeking at the source code.
You might be able to implement your own feature extractor within there to extract n-grams instead of words and then re-train it accordingly and use for your data.
Regardless of all that, I would suggest that you use a combination of unigrams and n>1-grams as features, because dropping unigrams entirely is likely to affect your performance negatively. Bigrams are much more sparsely distributed, so you'll struggle with data sparsity problems when training.
I would like to find the most relevant words over a set of documents.
I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.
After that, I will take only the ones with a high number and I will use them.
I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.
I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.
Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?
Here is an implementation of the Tf-idf algorithm using scikit-learn.
Before applying it, you can word_tokenize() and stem your words.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
How can I convert a the following pandas dataframe with the tf-idf score of each word in several documents into a matrix named "tfdif" so that I can implement for instance
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
str = 'this sentence has unseen text such as computer but also king lord juliet'
response = tfidf.transform([str])
You need to fit the TfidfVectorizer using the original raw documents before being able to use it to transform a new document.
If you cannot access the original documents you can always recover the idf weights of each word by constructing a dictionary:
idfs[word] = log{(# documents) / (# documents where word has non-zero tf-idf weight)}
Later you can use that dictionary to calculate the tf-idf weights for the new sentence:
from collections import Counter
words = sentence.split()
s_tfs = Counter(words)
s_idfs = {word: idfs.get(word, 0) for word in words}
s_tfidf = {word: s_tfs.get(word, 0) * s_idfs.get(word, 0) for word in idfs.keys()}
How can I ignore some words like 'a', 'the', when counting the frequency of a word accuracy in a text?
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
The answer will be The. But I would like to get distance as the most frequent word.
It would be best to avoid counting the entries to begin with like so.
ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)
Another option is to use the stop_words parameter of CountVectorizer.
These are words that you are not interested in and will be discarded by the analyzer.
f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]
Note that the tokenizer does not perform preprocessing (lowercasing, accent-stripping) or remove stop words, so you need to use the analyzer here.
You can also use stop_words='english' to automatically remove english stop words (see sklearn.feature_extraction.text.ENGLISH_STOP_WORDS for the full list).