I have a dataset with multiple features and I am trying to build an svm model to classify new entries based on these features. To go about this, I chose to use CountVectorizer to convert the text data into numerical data for the training. I understand how to train a model with the features apart but I'm having difficulty understanding how to do so together.
Category Lyric Song_title
Rock Master of puppets pulling the strings Master of puppets
Rock Let the bodies hit the floor Bodies
Pop dreaming about the things we could be. Counting Stars
Pop Im glad you came Im glad you came NULL
[2000 rows x 3 columns]
To simplify certain steps. I decided to use built in functions to generate the data sets.
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.linear_model import LogisticRegression
data = pd.read_excel('./music_data.xlsx',0)
train_data, test_data = train_test_split(data,test_size=0.53)
As both columns contain null values. I thought to separate the columns into 2 training sets and Train the models with the associated categories.
lyric_train = train_data[~pd.isnull(train_data['Lyric'])]
lyric_test = test_data[~pd.isnull(test_data['Lyric'])]
vectorizer_lyric = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_lyric = vectorizer_lyric.fit_transform(lyric_train['Lyric'])
song_title_train = train_data[~pd.isnull(train_data['Song_title'])]
song_title_test = test_data[~pd.isnull(test_data['Song_title'])]
vectorizer_song = CountVectorizer(analyzer='word', ngram_range=(1, 5))
vc_song = vectorizer_song.fit_transform(song_title_train['Song_title'])
Then I build the models and try to combine them using a stacking classifier.
# Train for lyric feature
model_lyric = svm.SVC()
model_lyric.fit(vc_lyric, lyric_train['Category'])
features_test_lyric = vectorizer_lyric.transform(lyric_test['Lyric'])
model_lyric.score(features_test_lyric,lyric_test['Category']))
# train for Song Title feature
model_song = svm.SVC()
model_song.fit(vc_song, song_title_train['Category'])
features_test_song = vectorizer_song.transform(song_title_test['Song_title'])
model_song.score(features_test_song,song_title_test['Category']))
# Combine SVM models
estimators = [('lyric_svm',model_lyric),
('song_svm',model_song)]
stack_model = StackingClassifier(estimators=estimators,final_estimator=LogisticRegression())
From reading up online, this is not the correct way to do this as the StackingClassifier appears to combine multiple models using the same dataset & features. But I had separated the features for the CountVectorizer.
Related
I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.
I am trying to produce a series of product classifiers based on the text description that each product has. The data frame I have is similar to the following but is more complicated. Python and the sklearn library are used.
data = {'description':['orange', 'apple', 'bean', 'carrot','pork','fish','beef'],
'level1':['plant', 'plant', 'plant', 'plant','animal','animal','animal'],
'level2:['fruit','fruit','vegatable','vegatable','livestock', 'seafood','livestock'}
# Create DataFrame
df = pd.DataFrame(data)
"Description" is the textual data. Now it is only a word. But the real one is a longer sentence.
"Level1" is the top category.
"Level2" is a sub-category.
I know how to train a classification model to classify the products into Level 1 categories by using the sklearn library.
Below is what I did:
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
import pickle
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(df['description'],
df[['Level1','Level2']], test_size = 0.4, shuffle=True)
#use the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
#transforming the training data into tf-idf matrix
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train)
#transforming testing data into tf-idf matrix
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)
#Create and save model for level 1
naive_bayes_classifier = MultinomialNB()
model_level1 = naive_bayes_classifier.fit(X_train_vectors_tfidf, y_train['Level1'])
with open('model_level_1.pkl','wb') as f:
pickle.dump(model_level1, f)
What I don't know how to do is to build a classification model for each Level 1 category that can predict the products' Level 2 category. For example, based on the above dataset, there should be one classification model for 'plant' (to predict fruit or vegetable) and another model for 'animal' (to predict seafood or livestock). Do you have any ideas to do it and save the models by using loops?
Assuming you will be able to get all the columns of the dataset then it would be a mix of features with Levels being the class labels. Formulating on the same lines:
cols = ["abc", "Level1", "Level2", "Level3"]
From this now let's take only levels because that is what we are interested in.
level_cols = [val for val in levels if "Lev" in val]
The above just check for the presence of "Lev" starts with these three characters.
Now, with level cols in place. I think you could do the following as a starting point:
1. Iterate only the level cols.
2. Take only the numbers 1,2,3,4....n
3. If step-2 is divisible by 2 then I do the prediction using the saved level model. Ideally, all the even ones.
4. Else train on other levels.
for level in level_cols:
if int(level[-1]) % 2 == 0:
# open the saved model at int(level[-1]) - 1
# Perform my prediction
else:
level_idx = int(level[-1])
model = naive_bayes_classifier.fit(x_train, y_train[level])
mf = open("model-x-"+level_idx, "wb")
pickle.dump(model, mf)
This question already has an answer here:
Testing text classification ML model with new data fails
(1 answer)
Closed 2 years ago.
Below is my code I am trying for text classification model;
from sklearn.feature_extraction.text import TfidfVectorizer
ifidf_vectorizer = TfidfVectorizer()
X_train_tfidf = ifidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape
(3, 16)
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)
Till now only training set has been vectorized into a full vocabulary. In order to perform analysis on test set I need to submit it to the same procedures.
So I did
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
And finally when trying to predict its showing error;
predictions = clf.predict(X_test_tfidf)
ValueError: X has 12 features per sample; expecting 16
But when I use pipeline from sklearn.pipeline import Pipeline then it worked fine;
Can’t I code the way I was trying?
The error is with fit_transform of test data. You fit_transform training data and only transform test data:
# change this
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
# to
X_test_tfidf = ifidf_vectorizer.transform(X_test)
X_test_tfidf.shape
Reasons:
When you do fit_transform, you teach your model the vectors with fit. The model learns the vectors to which they are used to transform data. You use the train data to learn the vectors, then you apply them to both train and test with transform
If you do a fit_transform on test data, you replaced the vectors learned in training data and replaced them with test data. Given that your test data is smaller than your train data, it is likely you would get two different vectorisation.
A Better Way
The best way to do what you do is using Pipelines which will make your flow easy to understand
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
clf = Pipeline(steps=[
('vectorizer', TfidfVectorizer()),
('model', LinearSVC()),
])
# train
clf.fit(X_train,y_train)
# predict
clf.predict(X_test)
This is easier as the transformation are taking care for you. You don’t have to worry about fit_transform when fitting the model or transform when predicting or scoring.
You can access the features independently if you with with
clf.named_steps('vectorizer') # or 'model'
Under the hood, when you do clf.fit, your data will pass throw your vectorizer using fit_transform and then to the model. When you predict or score, your data will pass throw your vectorizer with transform before reaching your model.
Your code fails as you are refitting the vectorizer with .fit_transform() on the test set X_test again. However, you should only transform the data with the vectorizer:
X_test_tfidf = ifidf_vectorizer.transform(X_test)
Now it should work as expected. You only fit the ifidf_vectorizer according to X_train and transform all data according to this. It ensures that the same vocabulary is used and that you get outputs of the same shape.
I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif
I have 3 labels (male, female, na), denoted as follows:
labels = [0,1,2]
Each label was defined by 3 features (height, weight, and age) as the training data:
Training data for males:
male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])
males = np.vstack([male_height,male_weight,male_age]).T
Training data for females:
female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])
females = np.vstack([female_height,female_weight,female_age]).T
Training data for not availables:
na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])
nas = np.vstack([na_height,na_weight,na_age]).T
So, the complete training data are:
trainingData = np.vstack([males,females,nas])
Complete labels are:
labels = np.repeat(labels,5)
Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.
I tried below according to the answer from #eickenberg and the comments from #larsmans
selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)
selected = trainingData[selector.get_support()]
print selected
[[111 60 41]
[121 70 32]]
However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?
You can use e.g. SelectKBest as follows
from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)
and place it into your pipeline
pipe = make_pipeline(selector, StandardScaler(), svm.SVC())
pipe.fit(trainingData, labels)
To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). But, through that experience, I can confidently say that the more features you have, the better your predictions will be.
To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance.
But, if this is a huge necessity, look into scikit learn's Random Forest Classifier. It can accurately assess which features are more important, using the "feature_importances_" attribute.
Here's an example of how I would use it (code not tested):
clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_
Hope that helps.