I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I'm currently working on a multilabel text classification problem, in which I have 4 labels, which is represented as 4 dummy variables. I have tried out several ways to transform the data in a way that is suitable for making the MLC.
Right now I'm running with pipelines, but as far as I can see, this doesn't fit a model with all labels included, but rather makes 1 model per label - do you agree with this?
I have tried to use MultiLabelBinarizer and LabelBinarizer, but with no luck.
Do you have a tip on how I can solve this problem in a way that makes the model include all the labels in one model, taking into account the different label combinations?
A subset of the data and my code is here:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Import data
df = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text
categories = ['TV','Internet','Mobil','Fastnet']
# Model
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
])
for category in categories:
print('... Processing {}'.format(category))
LogReg_pipeline.fit(X_train, train[category])
prediction = LogReg_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
https://www.transfernow.net/dl/20210921NbWDt3eo
Code Analysis
The scikit-learn LogisticRegression classifier using OVR (one-vs-rest) can only predict a single output/label at a time. Since you are training the model in the pipeline on multiple labels one at a time, you will produce one trained model per label. The algorithm itself will be the same for all models, but you would have trained them differently.
Multi-Output Regressor
Multi-output regressors can accept multiple independent labels and generate one prediction for each target.
The output should be the same as what you have, but you only need to maintain a single model and train it once.
To use this approach, wrap your LR model in a MultiOutputRegressor.
Here is a good tutorial on multi-output regression models.
model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
('clf', MultiOutputRegressor(model))])
preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)
combine_data() merges all data into a single DataFrame for convenience:
def combine_data(X, Y, y_cols):
""" X is a dataframe, Y is a np array, y_cols is a list """
df_out = pd.DataFrame(Y, columns=y_cols)
df_out.index = X.index
return pd.concat([X, df_out], axis=1).sort_index()
Multinomial Logistic Regression
To use a LogisticRegression classifier on all labels at once, set multi_class=multinomial.
The softmax function is used to find the predicted probability of a sample belonging to a class.
You'll need to reverse the one-hot encoding on the label to get back the categorical variable (answer here). If you have the original label before one-hot encoding, use that.
Here is a good tutorial on multinomial logistic regression.
label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])
# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)
# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)
I am working on a flight ticket price prediction data set using pyspark ML lib, which contains both train and test data sets. I have successfully implemented my model on train data set and predicted the price i.e, the label column, but don't know how to apply the same model on test data set for predicting the price of the ticket.
The following code is for training the model on train data set(containing both features and label column).
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol="features",labelCol = "Price", maxIter = 10)
gbtModel = gbt.fit(training_data)
predictions_gbt = gbtModel.transform(testing_data)
predictions_gbt.select("features", "Price", "prediction").show()
It does not appear that Pyspark Onv-vs-Rest classifier provides probabilities. Is there a way to do this?
I am appending code below. I am adding the standard multiclass classifier for comparison.
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# load data file.
inputData = spark.read.format("libsvm") \
.load("/data/mllib/sample_multiclass_classification_data.txt")
(train, test) = inputData.randomSplit([0.8, 0.2])
# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)
# train the multiclass model.
ovrModel = ovr.fit(train)
lrm = lr.fit(train)
# score the model on test data.
predictions = ovrModel.transform(test)
predictions2 = lrm.transform(test)
predictions.show(6)
predictions2.show(6)
I don't think you can access the probabilities(confidence) vector because it takes the max of the confidence and drops the confidence vector. To test, you can make a copy of the class and modify it and remove the .drop(accColName)
http://spark.apache.org/docs/2.0.1/api/python/_modules/pyspark/ml/classification.html
# output the index of the classifier with highest confidence as prediction
labelUDF = udf(
lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]),
DoubleType())
# output label and label metadata as prediction
return aggregatedDataset.withColumn(
self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)
I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)
and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score as auc
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3,
min_samples_leaf=100, learning_rate=0.1,
subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)
But when I am trying to use this model to predict for prediction data set it is giving me error
predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model
Which makes sense because total variables in testing data remains to be over 400.
My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor.
Thanks!
You should transform your prediction matrix through your feature selection too. So somewhere in your code you do
df = sel.fit_transform(X)
and before predicting
df_prediction = sel.transform(X_prediction)