I want to classify rows of a column using SVM clustering method. I can find so many content on net which produces graphs or print prediction accuracy but i cannot find ways to print my cluster. Below example will better explain what i am trying to do:
I have a dataframe to be used as test dataset
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)
I want to predict whether the text row is talking about Animal/Thing or miscelleneus. The test data i want to pass is
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
Expected result is an additional column 'Classification' getting created in the test dataframe with values ['Animal','Miscellenous','Animal','Animal','Miscellenous']
Here is the solution to your problem:
# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC
import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
}
train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)
test_data = {'Serial': [1,2,3,4,5],
'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
}
test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)
# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()
# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()
# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()
# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()
# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)
# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)
# Get the SVC classifier
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)
# Predict the test samples
print(clf.predict(X_test))
# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)
# Display test dataframe
display(test_df)
As an explanation for the approach:
You have your training data and want to use it to train a SVM and then predict the test data with labels.
That means you need to extract the training data and labels for each data point (so for each phrase, you need to know if its an animal or a thing etc.) and then you need to set up and train a SVM. Here, I used the implementation from scikit-learn.
Moreover you can't just train the SVM with raw text data, because it requires numerical values (numbers). This means you need to transform the text data into numbers. This is "feature extraction from text" and for this one of the common approaches is to use the Term-Frequency Inverted-Document-Frequency (TF-IDF) concept.
Now you can use a vector representation of each phrase coupled with a label for it to train the SVM and then use it to classify the test data :)
In short the steps are:
Extract data points and labels from training
Extract data points from testing
Set up SVM classifier
Set up TF-IDF vectorizer and fit it to training data
Transform training data and testing data with tf-idf vectorizer
Train the SVM classifier
Classify test data with trained classifier
I hope this helps!
Related
I have several logs that are manipulated by two different TfIdfVectorizer objects.
The first one reads and splits the log in ngrams
with open("my/log/path.txt", "r") as test:
corpus = [test.read()]
tf = TfidfVectorizer(ngram_range=ngr)
corpus_transformed = tf.fit_transform(corpus)
infile.close()
The resulting data is written in a Pandas dataframe that has 4 columns
(score [float], review [ngrams of text], isbad [0/1], kfold [int]).
Initial kfold value is -1.
where I have:
my_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf.get_feature_names()).transpose()
For cross-validation I split the dataset in test and train with StratifiedKFolds by doing a simple:
ngr=(1,2)
for fold_ in range(5):
# 'reviews' column has short sentences expressing an opinion
train_df = df[df.kfold != fold_].reset_index(drop=True)
test_df = df[df.kfold == fold_].reset_index(drop=True)
new_tf = TfidfVectorizer(ngram_range=ngr)
new_tf.fit(train_df.reviews)
xtrain = new_tf.transform(train_df.reviews)
xtest = new_tf.transform(test_df.reviews)
And only after this double tfidf transformation I fit my SVC model with:
model.fit(xtrain, train_df.isbad) # where 'isbad' column is 0 if negative and 1 if positive
preds = model.predict(xtest)
accuracy = metrics.accuracy_score(test_df.isbad, preds)
So at the end of the day I have my model that classifies reviews in both classes (negative-0 or positive-1), I dump my model and both tfidf vectorizers (tf and new_tf) but when it comes to new data, even if I do:
with open("never/seen/data.txt", "r") as unseen: # load REAL SAMPLE data
corpus = [unseen.read()]
# to transform the unseen data I use one of the dumped tfidf's obj
corpus_transformed = tf_dump.transform(corpus)
unseen.close()
my_unseen_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf_dump.get_feature_names()).transpose()
my_unseen_df = my_unseen_df.sample(frac=1).reset_index(drop=True) # randomize rows
# to transform reviews' data that are going to be classified I use the new_tf dump, like before
X = new_tf_dump.transform(my_unseen_df.reviews)
# use the previously loaded model and make predictions
res = model_dump.predict(X)
#print(res)
I got ValueError: X has 604,969 features, but SVC is expecting 605,424 as input, but how is that possibile if I manipulate the data with the same objects? What am I doing wrong here?
I want to use my trained model as a classifier for new, unseen data. Isn't this the right way to go?
Thank you.
I have a dataset with multiple features and I am trying to build an svm model to classify new entries based on these features. To go about this, I chose to use CountVectorizer to convert the text data into numerical data for the training. I understand how to train a model with the features apart but I'm having difficulty understanding how to do so together.
Category Lyric Song_title
Rock Master of puppets pulling the strings Master of puppets
Rock Let the bodies hit the floor Bodies
Pop dreaming about the things we could be. Counting Stars
Pop Im glad you came Im glad you came NULL
[2000 rows x 3 columns]
To simplify certain steps. I decided to use built in functions to generate the data sets.
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.linear_model import LogisticRegression
data = pd.read_excel('./music_data.xlsx',0)
train_data, test_data = train_test_split(data,test_size=0.53)
As both columns contain null values. I thought to separate the columns into 2 training sets and Train the models with the associated categories.
lyric_train = train_data[~pd.isnull(train_data['Lyric'])]
lyric_test = test_data[~pd.isnull(test_data['Lyric'])]
vectorizer_lyric = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_lyric = vectorizer_lyric.fit_transform(lyric_train['Lyric'])
song_title_train = train_data[~pd.isnull(train_data['Song_title'])]
song_title_test = test_data[~pd.isnull(test_data['Song_title'])]
vectorizer_song = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_song = vectorizer_song.fit_transform(song_title_train['Song_title'])
Then I build the models and try to combine them using a stacking classifier.
# Train for lyric feature
model_lyric = svm.SVC()
model_lyric.fit(vc_lyric, lyric_train['Category'])
features_test_lyric = vectorizer_lyric.transform(lyric_test['Lyric'])
model_lyric.score(features_test_lyric,lyric_test['Category']))
# train for Song Title feature
model_song = svm.SVC()
model_song.fit(vc_song, song_title_train['Category'])
features_test_song = vectorizer_song.transform(song_title_test['Song_title'])
model_song.score(features_test_song,song_title_test['Category']))
# Combine SVM models
estimators = [('lyric_svm',model_lyric),
('song_svm',model_song)]
stack_model = StackingClassifier(estimators=estimators,final_estimator=LogisticRegression())
From reading up online, this is not the correct way to do this as the StackingClassifier appears to combine multiple models using the same dataset & features. But I had separated the features for the CountVectorizer.
After training a classifier, I tried passing a few sentences to check if it is going to classify it correctly.
During that testing the results are not appearing well.
I suppose some variables are not correct.
Explanation
I have a dataframe called df that looks like this:
news type
0 From: mathew <mathew#mantis.co.uk>\n Subject: ... alt.atheism
1 From: mathew <mathew#mantis.co.uk>\n Subject: ... alt.space
2 From: I3150101#dbstu1.rz.tu-bs.de (Benedikt Ro... alt.tech
...
#each row in the news column is a document
#each row in the type column is the category of that document
Preprocessing:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn import metrics
vectorizer = TfidfVectorizer( stop_words = 'english')
vectors = vectorizer.fit_transform(df.news)
clf = SVC(C=10,gamma=1,kernel='rbf')
clf.fit(vectors, df.type)
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)
Attempt to check how some sentences are classified
texts = ["The space shuttle is made in 2018",
"stars are shining",
"galaxy"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
print('"{}"'.format(text))
print(" - Predicted as: '{}'".format(df.type[pred]))
print("")
The problem is that it returns this:
"The space shuttle is made in 2018"
- Predicted as: 'alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
alt.atheism NaN
What do you think?
EDIT
Example
This is kind of how it should look like :
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)
>>> predicted = clf.predict(X_new_tfidf)
>>> for doc, category in zip(docs_new, predicted):
... print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
As you mentioned in the comments, you have around 700 samples. To test how good your classifier works, you should always split your data into training and test samples. For example 500 sample as training data and 200 to test your classifier. You should then only use your training samples for training and your test samples for testing. Test data created by hand as you did are not necessarily meaningful. sklearn comes with a handy function to separate data into test and training:
#separate training and test data, 20% og your data is selected as test data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
vectors = vectorizer.fit_transform(df_train.news)
clf = SVC(C=10,gamma=1,kernel='rbf')
#train classifier
clf.fit(vectors, df_train.type)
#test classifier on the test set
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)
#prints accuracy of your classifier
from sklearn.metrics import classification_report
classification_report(df_test.type, pred)
This will give you a hint how good your classifier actually is. If you think it is not good enough, you should try another classifier, for example logistic regression. Or you could change your data to all lower case letters and see if this helps to augment your accuracy.
Edit:
You can also write your predictions back to your test_datframe:
df_test['Predicted'] = preds
df_test.head()
This will help you to see a pattern. Is acctually all predicted as alt.atheism as your example suggests?
The data with which you train your classifier is significantly different to the phrases you test it on. As you mentioned in your comment on my first answer, you get an accuracy of more than 90%, which is pretty good. But you tought your classifier to classify mailing list items which are long documents with e-mail adresses in them. Your phrases such as "The space shuttle is made in 2018" are pretty short and do not contain e-mail adresses. Its possible that your classifier uses those e-mail adresses to classify the documents, which explaines the good results. You can test if that is really the case if you remove the e-mail adresses from the data before training.
So, basically I have a test corpus of 350 text files (350 rows) and I made a ml model to predict the gender of an author based on the SMS in each text file.
After preprocessing is done these are my final lines of code :
(Joined is preprocessed column in dataframe df)
from sklearn.model_selection import train_test_split
from sklearn import cross_validation
from sklearn.feature_extraction.text import CountVectorizer
y = df['Gender']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
df['Joined'], y,
test_size=0.20,random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)
Now I have a new test corpus which has 150 text files(150 rows) and I have to predict the gender of these files based on my previous model!
I have made a new dataframe called newdf and preprocessed the test corpus files into a column called new_test which has 150 rows.
Now how can I use my previous nb_classifier model on this new_test column?
Assuming you have pre-processed new_test the same way you did count_test you would simply call nb_classifier.predict or predict_proba and pass in your new_test array.
I prefer predict_proba as it returns the probability of each class rather than a single prediction.
Update Per Comments
It would appear you have a dimensionality issue. When you train your MultinomialNB classifier it can only process data that is passed in with the same dimensions as that upon which it was trained. For example:
You created training data with n samples and m features by using CountVectorizer. Any data passed into your classifier must conform to having m features or the classifier will not understand how to process this discrepancy.
As such it is critical that when using a CountVectorizer for pre-processing you also use that fitted instance to transform any data upon which you wish to predict.
In code:
df = pd.DataFrame({
'joined': [
'a sentence', 'This is some great food',
'the quick red fox jumped over the lazy brown dog'],
'label': ['M', 'F', 'M']})
df2 = pd.DataFrame({
'new_text': [
'a differenct sentence',
'something entirely different that hasnt been seen before',
'fox and dog'],
'label': ['M', 'M', 'F']})
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(df.joined.values)
nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, df.label)
metrics.accuracy_score(y_test, pred)
new_test = count_vectorizer.transform(df2.new_text.values)
nb_classifier.predict_proba(new_test)
array([[0.27272727, 0.72727273],
[0.33333333, 0.66666667],
[0.2195122 , 0.7804878 ]])
I am trying out this code
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
train_data = ["football is the sport","gravity is the movie", "education is imporatant"]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
print "Applying first train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
print "\n\nApplying second train data"
train_data = ["cricket", "Transformers is a film","AIMS is a college"]
X_train = vectorizer.transform(train_data)
print vectorizer.get_feature_names()
print "\n\nApplying fit transform onto second train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
The output for this one is
Applying first train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying second train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying fit transform onto second train data
[u'aims', u'college', u'cricket', u'film', u'transformers']
I gave the first set of data using fit_transform to vectorizer so it gave me feature names like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport'] after that i applied another train set to the same vectorizer but it gave me the same feature names as I didnt use fit or fit_transform. But I want to know how to update the features of a vectorizer without overwriting the previous oncs. If I use fit_transform again the previous features will get overwritten. So I want to update the feature list of the vectorizer. So i want something like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport',u'aims', u'college', u'cricket', u'film', u'transformers'] How can I get that.
In sklearn terminology, this is called a partial fit and you can't do it with a TfidfVectorizer. There are two ways around this:
Concatenate the two training sets and re-vectorize
use a HashingVectorizer, which support partial fitting. However, that does not have a get_feature_names method due to the fact that is hashes features, so the original isn't kept. Another advantage is that this is much more memory efficient.
Example of the first approach:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
train_data1 = ["football is the sport", "gravity is the movie", "education is important"]
vectorizer = TfidfVectorizer(stop_words='english')
print("Applying first train data")
X_train = vectorizer.fit_transform(train_data1)
print(vectorizer.get_feature_names())
print("\n\nApplying second train data")
train_data2 = ["cricket", "Transformers is a film", "AIMS is a college"]
X_train = vectorizer.transform(train_data2)
print(vectorizer.get_feature_names())
print("\n\nApplying fit transform onto second train data")
X_train = vectorizer.fit_transform(train_data1 + train_data2)
print(vectorizer.get_feature_names())
Output:
Applying first train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']
Applying second train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']
Applying fit transform onto second train data
['aims', 'college', 'cricket', 'education', 'film', 'football', 'gravity', 'important', 'movie', 'sport', 'transformers']
I found this question while googling for the same issue that OP raised. Like mbatchkarov said Scikit-Learn's TfidfVectorizer doesn't natively support partial fitting.
HashingVectorizer is usually a great alternative, but it really depends on your use-case. Specifically, if you care very much about representing infrequent terms precisely, then collisions will hurt performance.
So I went ahead and wrote my own implementation of "partial_fit" for both TfidfVectorizer and CountVectorizer (see here). Hope it's useful for other people reaching this post. Note that this kind of partial fitting does change the dimension of the output vector given by the vectorizer since the whole point is to update the vocabulary (so take this into account when using in a pipeline).