I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!
You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.
Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.
Related
I am trying to train a text classifier using sklearn's CountVectorizer. The problem is that my training documents have many tokens that are document-specific. So for example there are regular english words that the CountVectorizer.fit_transform method works perfectly well on, but then there are some tokens that are formatted that would fit the regex: '\w\d\d\w\w\d', such as 'd84ke2'. As it is now, the fit_transform method would just take 'd84ke2' at face value and use that as a feature.
I want to be able to use those specific tokens that match that specific regex as their own feature, and leave the regular english words as their own features, since creating a feature such as 'd84ke2' would be useless as this will not come up again in any other document.
I've yet to find a way to do this, much less the "best" way. Below is an example of code I have, where you can see that the tokens 'j64ke2', 'r32kl4', 'w35kf9', and 'e93mf9' are all turned into their own features. I repeat for clarity: I want to basically condense these features into one and keep the others.
docs = ['the quick brown j64ke2 jumped over the lazy dogs r32kl4.',
'an apple a day keeps the w35kf9 away',
'you got the lions share of the e93mf9']
import numpy as np
# define target and target_names
target_names = ['zero', 'one', 'two']
target = np.array([0, 1, 2])
# Create message bunch.
from sklearn.utils import Bunch
doc_info = Bunch(data=docs, target=target, target_names=target_names)
# Vectorize training data
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(doc_info.data)
vocab = count_vect.vocabulary_
vocab_keys = list(vocab.keys())
#vocab_vals = list(vocab.values())
X_train_counts = count_vect.transform(doc_info.data)
X = X_train_counts.toarray()
import pandas as pd
df = pd.DataFrame(X, columns=vocab_keys)
yatu's comment is a good solution. I was able to clean the document before feeding it to CountVectorizer by substituting a word for each regex that matched.
I have provided an updated solution to the following text pre-processing. I was stumbled into an issue to categorize a sentence either positive, negative or uncertain. I mean, I was not able to create a tuple that Naive Bayes requires for classification and prediction. My updated solution below provides not so efficient but easier way.
I am doing a text preprocessing for Naive Bayes Classifier//Sentiment. I have pandas data frame of data I generated
risk
uncertain
somewhere
cautious
assumptions
somewhat
random
I want to compare that list to for an example text
"The market is uncertain. There is volatility. Investors need to be cautious. It is not just a random walk phenomena."
What I want to do is:
From the list above compare find if the word exists in the text below and result like
"Text", "Label"
"The market is uncertain." "uncertain"
"There is volatility.", "negative"
etc
any thoughts?
UPDATE: TEST SOLUTION TO CATEGORIZE
text = "It was the best article. In fact we have assume. Ints the fact's and worse?. presume. pending"
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize_list = word_tokenize(text)
sent_tokenize_list=sent_tokenize(text)
for sentIndex, sentence in enumerate(sent_tokenize_list):
word_tokenize_list = word_tokenize(sentence)
for wordIndex, word in enumerate(word_tokenize_list):
for j in negativeList():
if j ==word:
negativeSentence= sentence
print "neg", negativeSentence
break
for j in postiveList():
if j ==word:
positiveSentence= sentence
print "posi", positiveSentence
break
for j in uncerList():
if j ==word:
uncertainSentence= sentence
print "uncer",uncertainSentence
break
uncerList(), postiveList() and negativeList() are just a list of the words that I classify and positive, negative and uncertain.
I did not use the text blob I used to ask the question above. But that worked for me as well!
from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?
These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.
As training data, have reviews of restaurants in XML, with associated target expression a sentiment is being expressed toward, a category which is a discrete label this belongs to and the polarity expressed toward this:
<text>With the great variety on the menu , I eat here often and never get bored .</text>
<Opinions>
<Opinion target="menu" category="FOOD#STYLE_OPTIONS" polarity="positive" from="30" to="34"/>
</Opinions>
I have used the TextBlob NB classifier to train targets terms to associated categories.
For test data, my aim is to predict the target expression, given a sentence and the category. I have first extracted nouns and noun phrases from the sentence, assuming the expression will be a subset of these. For the sentence:
"what may be interesting to most is the worst sevice attitude come from the owner of this establishment", these are ['sevice attitude', 'owner', 'establishment'].
I would like to know which of these is most likely given the category, which in this case is SERVICE#GENERAL. How could I go about this?
TextBlob's NB classifier by default extracts the text features as a bag of words. So you can simply concatenate the words in the list of extracted nouns and then concatenate it with the category to use the result as the training text. And use the target as the training label.
Considering the bag of words treat words independently, you should tranform these noun phrases in just one word. You can put a '-' instead of space, for example ('sevice attitude' would be 'sevice-attitude').
Example:
from textblob.classifiers import NaiveBayesClassifier
train = [('sevice-attitude owner establishment SERVICE#GENERAL', 'owner'),
('menu variety FOOD#STYLE_OPTIONS', 'menu')]
cl = NaiveBayesClassifier(train)
If you want you can customize the feature extraction: https://textblob.readthedocs.io/en/dev/classifiers.html#feature-extractors
I'm using scikit-learn to extract text features from a "bag of words" text (text tokenized on single words).
To do so, I'm using a TfidfVectorizer to also reduce the weight of very frequent words (ie: "a", "the", etc).
text = 'Some text, with a lot of words...'
tfidf_vectorizer = TfidfVectorizer(
min_df=1, # min count for relevant vocabulary
max_features=4000, # maximum number of features
strip_accents='unicode', # replace all accented unicode char
# by their corresponding ASCII char
analyzer='word', # features made of words
token_pattern=r'\w{4,}', # tokenize only words of 4+ chars
ngram_range=(1, 1), # features made of a single tokens
use_idf=True, # enable inverse-document-frequency reweighting
smooth_idf=True, # prevents zero division for unseen words
sublinear_tf=False)
# vectorize and re-weight
desc_vect = tfidf_vectorizer.fit_transform([text])
I would now like to be able to link each predicted feature with its corresponding tfidf float value, storing it in a dict
{'feature1:' tfidf1, 'feature2': tfidf2, ...}
I achieved it by using
d = dict(zip(tfidf_vectorizer.get_feature_names(), desc_vect.data))
I would like to know if there was a better, scikit-learn native way to do such a thing.
Thank you very much.
For a single document, this should be fine. An alternative, that works when the document set is small, is this recipe of mine that uses Pandas.
If you want to do this for multiple documents, then you can adapt the code in DictVectorizer.inverse_transform:
desc_vect = desc_vect.tocsr()
n_docs = desc_vect.shape[0]
tfidftables = [{} for _ in xrange(n_docs)]
terms = tfidf_vectorizer.get_feature_names()
for i, j in zip(*desc_vect.nonzero()):
tfidftables[i][terms[j]] = X[i, j]