Sklearn Naive Bayes GaussianNB from .csv - python

I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here

I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.

Related

How can i map predicted values (after using RandomForestClassifier) back to their original values in Python?

For context, I am taking Ad listing data for Machines and using it to predict the type of Machine.
I have used the RandomForestClassifier for class prediction. In the model I have used LabelEncoder to convert all categorical variables, including the feature label (for example, 'Excavator' becomes '5'). After running the model successfully, I am left with my array of predicted values. These values are the encoded values - numerical. What I would like to do now is convert these predictions back into their original strings. E.g. I would like to map the number 5 back to it's original value of 'Excavator' - ideally mapping all of the predicted values in one DataFrame.
I have left out a lot of code below as I don't want to drown people in the full script so I have just left what I deem to be most relevant to my question but if you need to see more in order to help then please let me know!
### ENCODE TO CATEGORICAL ###
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Choose columns to encode
cols = ['make', 'model_of_Ad', 'year_manufactured', 'business', "tag_name_deep"]
# Encode columns
df[cols] = df[cols].apply(LabelEncoder().fit_transform)
# Reset df index
df.reset_index(drop=True, inplace=True)
....
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# define the model
rf = RandomForestClassifier()
# fit the model on the whole dataset
rf.fit(X_train, y_train)
#Predict on the test set in order to assess accuracy
y_pred = rf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# See predicted values
print(y_pred)
Any help is appreciated!

Combining sklearn pipeline and cross validation with binary columns

I want to run a regression model on a dataset with one textual column, five binary variables, and one numerical target variable. I included a CountVectorizer to vectorize the textual column, and tried to combine it in a sklearn Pipeline using make_column_transformer. The data doesn't have any missing values - yet, when running the below script, I am getting the following warning:
FitFailedWarning: Estimator fit failed. The score on this train-test
partition for these parameters will be set to nan.
and following error message:
TypeError: All estimators should implement fit and transform, or can be
'drop' or 'passthrough' specifiers. 'Level1' (type <class 'str'>) doesn't.
I assume the problem might be that I did not specify a second tuple in
make_column_transformer but merely the following:
sample_df[categorical_cols] but I am unsure how to include an already
processed, ready data in make_column_transformer.
Full code:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), textual_col),
sample_df[categorical_cols],
remainder='passthrough')),
('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores
Sample dataset:
import io
data_string = """
Level1;Level2;Level3;Level4;Level5;Text;Value
0;0;1;0;0;Are you sure that the input;109.3
0;0;0;0;0;that the input text data for;87.2
0;0;1;0;0;text data for your model is;21.5
0;0;0;0;0;your model is in English? Well,;143.5
0;0;0;0;1;in English? Well, no one can;141.1
0;0;0;0;0;no one can be sure about;93.4
0;0;0;0;0;be sure about this, as no;29.5
0;0;0;0;0;this, as no one will read;17.9
0;0;1;0;0;one will read around 20k records;37.8
0;0;1;0;0;around 20k records of text data.;153.7
0;0;0;0;0;of text data. So, how non-English;99.5
0;0;0;1;0;So, how non-English text will affect;119.1
0;0;0;0;1;text will affect your English text;97.5
0;0;0;0;0;your English text trained model? Pick;49.2
0;0;0;0;0;trained model? Pick any non-English text;79.3
0;0;0;0;0;any non-English text and pass it;107.7
0;1;0;0;1;and pass it through as input;117.3
0;0;0;0;0;through as input to your English;151.1
0;0;0;0;0;to your English text trained classification;47.3
0;0;0;0;0;text trained classification model. You will;129.3
0;0;0;0;0;model. You will come to know;135.1
0;0;0;0;0;come to know that the category;145.8
0;0;0;0;1;that the category is assigned to;131.9
1;0;0;1;0;is assigned to non-English text by;43.7
1;0;0;0;0;non-English text by the model. If;67.1
1;0;0;0;0;the model. If your model is;105.3
0;0;0;1;0;your model is dependent on one;65.2
0;1;0;0;0;dependent on one language then, other;98.3
0;0;0;0;0;language then, other languages in your;130.5
0;0;0;0;0;languages in your textual data should;107.2
0;1;1;0;0;textual data should be considered as;66.5
0;0;0;1;0;be considered as noise. But why?;43.1
0;0;0;0;1;noise. But why? The job of;56.7
0;0;0;0;0;The job of the text classification;75.1
1;0;0;0;0;the text classification model is to;88.3
1;0;0;0;0;model is to classify. And, it;91.3
0;0;0;0;0;classify. And, it will do its;106.4
1;0;0;0;0;will do its job despite its;109.5
0;0;0;0;1;job despite its input text will;143.1
0;0;0;0;0;input text will be in English;54.1
1;0;0;0;0;be in English or not. What;96.4
0;0;0;1;0;or not. What can we do;133.8
0;0;0;0;0;can we do to avoid such;146.4
0;0;1;0;0;to avoid such a situation? Your;164.3
0;0;1;0;0;a situation? Your model will not;34.6
0;0;0;0;0;model will not stop classifying the;76.8
0;0;0;1;0;stop classifying the non-English text. So,;80.5
0;0;1;0;0;non-English text. So, you have to;90.3
0;0;0;0;0;you have to detect the non-English;68.3
0;0;0;0;0;detect the non-English text and remove;44.0
0;0;1;0;0;text and remove it from trained;100.4
0;0;0;0;0;it from trained data and prediction;117.4
0;0;0;0;1;data and prediction data. This process;85.4
0;1;0;0;0;data. This process comes under the;65.7
0;0;1;0;0;comes under the data cleaning part.;54.3
0;1;0;0;0;data cleaning part. Inconsistency in your;78.9
0;0;0;0;0;Inconsistency in your data will result;96.8
1;0;0;0;1;data will result in a decrease;108.1
0;0;0;0;0;in a decrease in the accuracy;145.7
1;0;0;0;0;in the accuracy of the model.;103.6
0;0;1;0;0;of the model. Sometimes, multiple languages;56.4
0;0;0;0;1;Sometimes, multiple languages present in text;90.5
0;0;0;0;0;present in text data could be;80.4
0;0;0;0;0;data could be one of the;90.7
1;0;0;0;0;one of the reasons your model;48.8
0;0;0;0;0;reasons your model behaves strangely. So,;65.4
0;0;1;0;0;behaves strangely. So, in this article,;107.5
0;0;0;0;0;in this article, we will discuss;143.2
0;0;0;0;0;we will discuss the different python;165.0
0;0;0;0;0;the different python libraries which detect;123.3
0;0;0;0;1;libraries which detect the language(s) of;85.3
0;0;0;0;0;the language(s) of the text data.;91.4
0;0;0;0;1;the text data. Let’s start with;49.5
0;0;0;0;0;Let’s start with the spaCy library.;76.3
0;0;0;0;0;the spaCy library.;49.5
"""
sample_df = pd.read_csv(io.StringIO(data_string), sep=';')
You can use remainder='passthrough' to avoid transforming already processed columns (therefore in your case you can just consider the binary columns as residual columns that your ColumnTransformer object won't process, but on which it will pass through). Then you should be aware that CountVectorizer expects a 1D array as input and therefore you should specify the columns to be passed to make_column_transformer as a string ('Text'), rather than as an array (['Text']) (see reference from make_column_transformer() doc).
columns : str, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), 'Text'),
remainder='passthrough')),
('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores

How to Encode categorical data into labels for training and testing

The training dataset has object columns called shops and others. Now for the machine learning model I converted the columns into labels for training purposes. Using the code below
from sklearn.ensemble import RandomForestRegressor
X = df_all_4.copy()
y = df_all_4.item_price
X = X.drop(['item_price','date'], axis=1)
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
X[c] = X[c].factorize()[0]
rf = RandomForestRegressor()
rf.fit(X,y)
Now the testing dataset also has those categorical columns but with the some columns missing including the target column not relevant here I think. But if I again label the training dataset (unordered) the labels would be different than the one used while training so the model would not work properly . How to solve this problem and get the same encodings while training and testing
The important thing here is you can use LabelEncoder or OneHotEncoder classes present in Sklearn package. which makes this task pretty much simple.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
le = LabelEncoder()
X[c] = le.fit_transform(X[c])
test[c] = le.transform(test[c])
That's it you have encoded the labels into numbers for both train and test data
You can also use OneHotEncoder which does OneHotEncoding to categorical data.

Create 3 classification models to predict the class based on the other available columns

I have three type of classes (stetosa, versicolor, virginica) and also 4 other columns as sepal_length, sepal_width, petal_length, petal_width with around 150 rows and each it's filled with it's own information (so nothing is empty there). I need to predict the type of the class based on other columns.
This is what I have tried:
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
X=df[["sepal_length","sepal_width","petal_length","petal_width"]]
y=df["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(y_train, X_train)
clf.predict(y_test)
The text marked reponse with this problem:
ValueError: could not convert string to float: 'virginica'
I need to do this with train and test.
You need to encode your data. in other words, transform each category in a number (int or float).
Map the following categories like this:
mapping={'setosa':0,'versicolor':1,'virginica':2}
y.map(mapping)
After you train your model, you will get 0,1 or 2 as a result. Convert it back and you'll have your predictions.
And by the way, if you are predicting a class, you must change your model. LinearRegression() is a numerical predictor it can only predict numerical values.
Try to use SVC, LogisticRegression or any other classification model instead.

How to train SVM model in sklearn python by input CSV file?

I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Categories

Resources