I have used sklearn scikit python for prediction. While importing following package
from sklearn import datasets and storing the result in iris = datasets.load_iris() , it works fine to train model
iris = pandas.read_csv("E:\scikit\sampleTestingCSVInput.csv")
iris_header = ["Sepal_Length","Sepal_Width","Petal_Length","Petal_Width"]
Model Algorithm :
model = SVC(gamma='scale')
model.fit(iris.data, iris.target_names[iris.target])
But while importing CSV file to train model , creating new array for target_names also , I am facing some error like
ValueError: Found input variables with inconsistent numbers of
samples: [150, 4]
My CSV file has 5 Columns in which 4 columns are input and 1 column is output. Need to fit model for that output column.
How to provide argument for fit model?
Could anyone share the code sample to import CSV file to fit SVM model in sklearn python?
Since the question was not very clear to begin with and attempts to explain it were going in vain, I decided to download the dataset and do it for myself. So just to make sure we are working with the same dataset iris.head() will give you or something similar, a few names might be changed and a few values, but overall strucure will be the same.
Now the first four columns are features and the fifth one is target/output.
Now you will need your X and Y as numpy arrays, to do that use
X = iris[ ['sepal length:','sepal Width:','petal length','petal width']].values
Y = iris[['Target']].values
Now since Y is categorical Data, You will need to one hot encode it using sklearn's LabelEncoder and scale the input X to do that use
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)
X = StandardScaler().fit_transform(X)
To keep with the norm of separate train and test data, split the dataset using
X_train , X_test, y_train, y_test = train_test_split(X,Y)
Now just train it on your model using X_train and y_train
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)
After this you can use the test data to evaluate the model and tune the value of C as you wish.
Edit Just in case you don't know where the functions are here are the import statements
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
Related
I want to build a House Price Prediction app. The content has features where user can enter their inputs, then a predictive model will predict the price and display it to the user. I am using a dataset from Kaggle to do the prediction. When I run the code, it shows an error message that says
X has 8 features, but RandomForestRegressor is expecting 67 features as input.
Below is the code. Xy contains the data from Kaggle and df is the user input. Xy is the train set and df is the test. Xy has 8 variables including the target. df will only retrieve 7 inputs (so it will have 7 variables because there's no target variables received from user).
# Assign to X for input features and Y for target
X = Xy.drop('Price', axis=1)
Y = Xy['Price'].values
# Build Regression Model
model = RandomForestRegressor()
model.fit(X, Y)
df = pd.get_dummies(df, columns=['Location', 'Furnishing', 'Property_Type_Supergroup', 'Size_Type'])
# Apply Model to Make Prediction
prediction = model.predict(df)
I tried to search the solutions online but nothing works for my code. Hope someone can help.
It's a little difficult to tell without seeing the data that you're fitting the model on. Between the error and your code though, it seems like possibly you're fitting the model on a data frame of 67 features. The data frame that you call fit on needs to be the same as the data frame you call predict on (at least in terms of features).
Sorry if this answer is redundant, it is difficult to tell without seeing the data and the exact error.
"X has 8 features, but RandomForestRegressor is expecting 67 features as input."
I assumed that this is the standard dataset you used, and after unzipping and loading it has the following files:
sample_submission.csv
test.csv
data_description.txt
train.csv
if you check the shape of train.csv and test.csv:
train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')
print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
#Train shape : (1460, 81)
#Test shape : (1459, 80)
That shows you deleted or dropped some column/features/attributes and reduced them from 81 to 67, so no problem till now. The problem is once you converted the categorical variables into numeric variables using pd.get_dummies() in the data pre-processing stage then split data into x_train & y_train using same df to fit() your model. Finally, you predict on x_test via y_pred = model.predict(x_test). Otherwise, the shape of df does not match X (one has 8 columns, the other has 67 columns in your case)!!
So I suggest first the df should be splitted:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Chossing features for predicting the target variable
x = df
# Data split on df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 , random_state=42)
# Apply RandomForestRegressor
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(x_train,y_train)
# Predicting the data using the model
y_pred = model.predict(x_test)
# Evaluating the model
print(metrics.r2_score(y_test,y_pred))
I included following posts for your reference:
post1
post2
post3
I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.
I was using standard scalar model from sklearn.preprocessing. I fitted the standard scaler model on the dataset having 27 features in it. Is it possible to use same standard scalar model on a testing dataset having less than 27 features in it Code Snippet
from sklearn.preprocessing import StandardScaler()
sc=StandardScaler()
sc.fit_transform(x_train)
Till this point this is working fine.Problem is arising when I am trying to transform my test dataset. I know the problem why it is happening so. The test dataset has 24 features in it. But is it possible to transform the only 24 features and ignoring those columns which are not present in it.
sc.transform(x_test)
Thanks in advance!!
If want select all features without first 3 features use DataFrame.iloc:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train.iloc[:, 3:] = sc.fit_transform(x_train.iloc[:, 3:])
print (x_train)
If features are in list use subset:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
features = ['col1','col2',..., 'col24']
x_train[features] = sc.fit_transform(x_train[features])
print (x_train)
The training dataset has object columns called shops and others. Now for the machine learning model I converted the columns into labels for training purposes. Using the code below
from sklearn.ensemble import RandomForestRegressor
X = df_all_4.copy()
y = df_all_4.item_price
X = X.drop(['item_price','date'], axis=1)
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
X[c] = X[c].factorize()[0]
rf = RandomForestRegressor()
rf.fit(X,y)
Now the testing dataset also has those categorical columns but with the some columns missing including the target column not relevant here I think. But if I again label the training dataset (unordered) the labels would be different than the one used while training so the model would not work properly . How to solve this problem and get the same encodings while training and testing
The important thing here is you can use LabelEncoder or OneHotEncoder classes present in Sklearn package. which makes this task pretty much simple.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
le = LabelEncoder()
X[c] = le.fit_transform(X[c])
test[c] = le.transform(test[c])
That's it you have encoded the labels into numbers for both train and test data
You can also use OneHotEncoder which does OneHotEncoding to categorical data.
I have three type of classes (stetosa, versicolor, virginica) and also 4 other columns as sepal_length, sepal_width, petal_length, petal_width with around 150 rows and each it's filled with it's own information (so nothing is empty there). I need to predict the type of the class based on other columns.
This is what I have tried:
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
X=df[["sepal_length","sepal_width","petal_length","petal_width"]]
y=df["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(y_train, X_train)
clf.predict(y_test)
The text marked reponse with this problem:
ValueError: could not convert string to float: 'virginica'
I need to do this with train and test.
You need to encode your data. in other words, transform each category in a number (int or float).
Map the following categories like this:
mapping={'setosa':0,'versicolor':1,'virginica':2}
y.map(mapping)
After you train your model, you will get 0,1 or 2 as a result. Convert it back and you'll have your predictions.
And by the way, if you are predicting a class, you must change your model. LinearRegression() is a numerical predictor it can only predict numerical values.
Try to use SVC, LogisticRegression or any other classification model instead.