I'm using scikit-learn to extract text features from a "bag of words" text (text tokenized on single words).
To do so, I'm using a TfidfVectorizer to also reduce the weight of very frequent words (ie: "a", "the", etc).
text = 'Some text, with a lot of words...'
tfidf_vectorizer = TfidfVectorizer(
min_df=1, # min count for relevant vocabulary
max_features=4000, # maximum number of features
strip_accents='unicode', # replace all accented unicode char
# by their corresponding ASCII char
analyzer='word', # features made of words
token_pattern=r'\w{4,}', # tokenize only words of 4+ chars
ngram_range=(1, 1), # features made of a single tokens
use_idf=True, # enable inverse-document-frequency reweighting
smooth_idf=True, # prevents zero division for unseen words
sublinear_tf=False)
# vectorize and re-weight
desc_vect = tfidf_vectorizer.fit_transform([text])
I would now like to be able to link each predicted feature with its corresponding tfidf float value, storing it in a dict
{'feature1:' tfidf1, 'feature2': tfidf2, ...}
I achieved it by using
d = dict(zip(tfidf_vectorizer.get_feature_names(), desc_vect.data))
I would like to know if there was a better, scikit-learn native way to do such a thing.
Thank you very much.
For a single document, this should be fine. An alternative, that works when the document set is small, is this recipe of mine that uses Pandas.
If you want to do this for multiple documents, then you can adapt the code in DictVectorizer.inverse_transform:
desc_vect = desc_vect.tocsr()
n_docs = desc_vect.shape[0]
tfidftables = [{} for _ in xrange(n_docs)]
terms = tfidf_vectorizer.get_feature_names()
for i, j in zip(*desc_vect.nonzero()):
tfidftables[i][terms[j]] = X[i, j]
Related
I have a data frame with sentences and the respective part of speech tag for each word (Below is an extract of the data I'm working with (data taken from SNLI corpus). For each sentence in my collection I would like to extract unigrams and the corresponding pos-tag of that word.
For instance if I've the following:
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent' : ['Two women are embracing while holding to go packages .'], 'tags' : ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
Then I would get the following unigrams output:
array(['embracing', 'holding', 'packages', 'women'], dtype=object)
But I don't know how to retain the part of speech tag after this. I tried to do a lookup version with the unigrams, but as they may differ from the words in the sentence (if you for instance do sentence.split(' ')) you don't necessarily get the same tokens. Any suggestions of how I can extract unigrams and retain the corresponding part-of-speech tag?
After reviewing the source code for the sklearn CountVectorizer class, particularly the fit function, I don't believe the class has any way of tracking the original document element indexes relative to the extracted unigram features: where the unigram features do not necessarily have the same tokens. Other than the simple solution provided below, you might have to rely on some other method/library to achieve your desired results. If there is a particular case that fails, I'd suggest adding that to your question as it might help people generate solutions to your problem.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_unigram = CountVectorizer(analyzer='word', ngram_range=(1, 1), stop_words = 'english')
doc = {'sent': ['Two women are embracing while holding to go packages .'],
'tags': ['NUM NOUN AUX VERB SCONJ VERB PART VERB NOUN PUNCT']}
sentence = vectorizer_unigram.fit(doc['sent'])
sentence_unigrams = sentence.get_feature_names_out()
sent_token_list = doc['sent'][0].split()
tags_token_list = doc['tags'][0].split()
sentence_tags = []
for unigram in sentence_unigrams:
for i in range(len(sent_token_list)):
if sent_token_list[i] == unigram:
sentence_tags.append(tags_token_list[i])
print(sentence_unigrams)
# Output: ['embracing' 'holding' 'packages' 'women']
print(sentence_tags)
# Output: ['VERB', 'VERB', 'NOUN', 'NOUN']
I am trying to practice a classification task on NLP. I am using 20newsgroup dataset and I want to implement a classification model. Before training model, I want to implement:
stopword removal
punctuation removal
converting to lower case - since it's not Sentiment analysis task, so case distinction doesn't matter here according to me.
I am using the following code:
max_len = 0
for sent in x_train:
tokenizer_out = tokenizer(sent)
# convert numerical tokens to alphabetical tokens
encoded_tok = tokenizer.convert_ids_to_tokens(tokenizer_out.input_ids)
tokens_without_sw = [word for word in encoded_tok if not word in stopwords.words()]
new_ids = tokenizer.convert_tokens_to_ids(tokens_without_sw)
max_len = max(max_len, len(new_ids))
I will be using pretrained BERT from hugging face. And before implementing the code above, I had done the following to remove unnecessary lines:
def clean(post: str, remove_it: tuple):
new_lines = []
for line in post.splitlines():
if not line.startswith(remove_it):
new_lines.append(line)
return '\n'.join(new_lines)
remove_it = (
'From:',
'Subject:',
'Reply-To:',
'In-Reply-To:',
'Nntp-Posting-Host:',
'Organization:',
'X-Mailer:',
'In article <',
'Lines:',
'NNTP-Posting-Host:',
'Summary:',
'Article-I.D.:'
)
x_train = [clean(p, remove_it) for p in x_train]
x_test = [clean(p, remove_it) for p in x_test]
My next goal is to clean it further. With my classification, I am able to achieve 90% accuracy but I want to increase it further. SO, I want to remove the stopwords and punctuations, convert to lower case and see what happens. But with the code I use, its taking like forever to run, so I want a faster approach.
Can anyone help me?
I want to do sentiment analysis of some sentences with Python and TextBlob lib.
I know how to use that, but Is there any way to set n-grams to that?
Basically, I do not want to analyze word by word, but I want to analyze 2 words, 3 words, because phrases can carry much more meaning and sentiment.
For example, this is what I have done (it works):
from textblob import TextBlob
my_string = "This product is very good, you should try it"
my_string = TextBlob(my_string)
sentiment = my_string.sentiment.polarity
subjectivity = my_string.sentiment.subjectivity
print(sentiment)
print(subjectivity)
But how can I apply, for example n-grams = 2, n-grams = 3 etc?
Is it possible to do that with TextBlob, or VaderSentiment lib?
Here is a solution that finds n-grams without using any libraries.
from textblob import TextBlob
def find_ngrams(n, input_sequence):
# Split sentence into tokens.
tokens = input_sequence.split()
ngrams = []
for i in range(len(tokens) - n + 1):
# Take n consecutive tokens in array.
ngram = tokens[i:i+n]
# Concatenate array items into string.
ngram = ' '.join(ngram)
ngrams.append(ngram)
return ngrams
if __name__ == '__main__':
my_string = "This product is very good, you should try it"
ngrams = find_ngrams(3, my_string)
analysis = {}
for ngram in ngrams:
blob = TextBlob(ngram)
print('Ngram: {}'.format(ngram))
print('Polarity: {}'.format(blob.sentiment.polarity))
print('Subjectivity: {}'.format(blob.sentiment.subjectivity))
To change the ngram lengths, change the n value in the function find_ngrams().
There is no parameter within textblob to define n-grams as opposed to words/unigrams to be used as features for sentiment analysis.
Textblob uses a polarity lexicon to calculate the overall sentiment of a text. This lexicon contains unigrams, which means it can only give you the sentiment of a word but not a n-gram with n>1.
I guess you could work around that by feeding bi- or tri-grams into the sentiment classifier, just like you would feed in a sentence and then create a dictionary of your n-grams with their accumulated sentiment value.
But I'm not sure that this is a good idea. I'm assuming you are looking for bigrams to address problems like negation ("not bad") and the lexicon approach won't be able to use not for flipping the sentiment value for bad.
Textblob also contains an option to use a naiveBayes classifier instead of the lexicon approach. This is trained on a movie review corpus provided by nltk but the default features for training are words/unigrams as far as I can make out from peeking at the source code.
You might be able to implement your own feature extractor within there to extract n-grams instead of words and then re-train it accordingly and use for your data.
Regardless of all that, I would suggest that you use a combination of unigrams and n>1-grams as features, because dropping unigrams entirely is likely to affect your performance negatively. Bigrams are much more sparsely distributed, so you'll struggle with data sparsity problems when training.
from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?
These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.
I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!
You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.
Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.