I am trying to train a classifier to take in a news headline as input, and output tags that fit the following headline. My data contains a bunch of news headlines as the input variables and meta-tags for those headlines as the output variables.
I One-Hot_Encoded both the headlines and their corresponding meta-tags into two separate CSV's. I then combined them into one large data frame with the X_train values being a 5573x958 numpy array for the headline words, and the y_train values being a 5573x843 numpy array.
Here is the following image of a pandas data-frame containing my data in One-Hot-Encoded form.
The goal of my classifier is for me to feed in a headline and have the most related tags to that headline as the output. The problem I have is the following.
X_train = train_set.iloc[:, :958].values
X_train.shape
(out) (5573, 958)
y_train = train_set.iloc[:, 958:].values
y_train.shape
(out) (5573, 843)
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train, y_train)
When I train it using a naive-bayes classifier, I get the following error message:
bad input shape (5573, 843)
From what I researched, the only way I can have a multi-label target values is by One-Hot-Encoding them as when I tried LabelEncoder() or MultiLabelBinarizer() I had to specify the name of each column to be binarized and when I have over 800 columns (words) to specify, I could not figure out how do it. So I just One-Hot-Encoded them which I believe gives the same result, just the classifier doesn't like it as input. Any suggestions on how I can fix this?
You can use the Multi target classification of Sklearn. Here is an example :
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultiOutputClassifier(MultinomialNB()).fit(X_train, y_train)
You can see the documentation from this link sklearn.multioutput.MultiOutputClassifier
Related
I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.
enter image description here
Hello
I'm experimenting with lda, which separates 0,1 targets through 3 features.I have the same data as the picture.
First of all, a standard scaler was used to scale each data,
It indicates an error such as "setting an array element with a sequence."
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
This is the code that I tried.
X_train is a numpy array in which the data frame of the above picture is divided through train_test_split.
To summarize the question,
Can I use those arrangements as features?
If it's okay to use it as it is, how should I scale it?
I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.
This question already has an answer here:
Testing text classification ML model with new data fails
(1 answer)
Closed 2 years ago.
Below is my code I am trying for text classification model;
from sklearn.feature_extraction.text import TfidfVectorizer
ifidf_vectorizer = TfidfVectorizer()
X_train_tfidf = ifidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape
(3, 16)
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)
Till now only training set has been vectorized into a full vocabulary. In order to perform analysis on test set I need to submit it to the same procedures.
So I did
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
And finally when trying to predict its showing error;
predictions = clf.predict(X_test_tfidf)
ValueError: X has 12 features per sample; expecting 16
But when I use pipeline from sklearn.pipeline import Pipeline then it worked fine;
Can’t I code the way I was trying?
The error is with fit_transform of test data. You fit_transform training data and only transform test data:
# change this
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test)
X_test_tfidf.shape
(2, 12)
# to
X_test_tfidf = ifidf_vectorizer.transform(X_test)
X_test_tfidf.shape
Reasons:
When you do fit_transform, you teach your model the vectors with fit. The model learns the vectors to which they are used to transform data. You use the train data to learn the vectors, then you apply them to both train and test with transform
If you do a fit_transform on test data, you replaced the vectors learned in training data and replaced them with test data. Given that your test data is smaller than your train data, it is likely you would get two different vectorisation.
A Better Way
The best way to do what you do is using Pipelines which will make your flow easy to understand
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
clf = Pipeline(steps=[
('vectorizer', TfidfVectorizer()),
('model', LinearSVC()),
])
# train
clf.fit(X_train,y_train)
# predict
clf.predict(X_test)
This is easier as the transformation are taking care for you. You don’t have to worry about fit_transform when fitting the model or transform when predicting or scoring.
You can access the features independently if you with with
clf.named_steps('vectorizer') # or 'model'
Under the hood, when you do clf.fit, your data will pass throw your vectorizer using fit_transform and then to the model. When you predict or score, your data will pass throw your vectorizer with transform before reaching your model.
Your code fails as you are refitting the vectorizer with .fit_transform() on the test set X_test again. However, you should only transform the data with the vectorizer:
X_test_tfidf = ifidf_vectorizer.transform(X_test)
Now it should work as expected. You only fit the ifidf_vectorizer according to X_train and transform all data according to this. It ensures that the same vocabulary is used and that you get outputs of the same shape.
I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler