ValueError: could not convert string to float: '?'

ValueError: could not convert string to float: '?' - python

I have tried to run a SVM program, and I got the above error. The code is here below. Please point out the error in it.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
data = pd.read_csv('risk_factors_cervical_cancer.csv')
X = np.array(data[[#some data elements]])
y = np.array(data[#some data elements])
print(X)
print(y)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=30)
classifier = svm.SVC()
classifier.fit(X_train, y_train) #the error occurs here
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
`

As #Guimoute wrote, preprocessing your data is always necessary in order to train it with any machine learning algorithm. Try X.head(10) to get an introduction to the data you are using. Your error occurs because there is a value "?" in your X column. Replace it with some reasonable number, i.e. the mean of the column for example in order to get better results.

Related

Converting predicted random forest results into dataframe

This question may look silly but did not getting an idea so need your help.
I used random forest to predict the result and wrote the following codes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
regressor.fit(X_train, y_train)
Y_pred = regressor.predict(X_test)
Y_pred is the result for a given X_test. Now, I would like to create a data frame of my Y_pred and y_test data and save it into CSV format.
Any idea how to do this?

It seems quite simple, just clicked into my mind. So this way it can be done
df_new = pd.DataFrame({'x':Y_pred, 'y':y_test})
df_new.head()

Multiple Linear Regression 100% Accuracy

I am getting 100% accuracy in multiple linear regression. I am following one tutorial of last year. He is not getting 100% accuracy on the same model but I am getting now. Seems weird to me. Here's my code. Am I am doing it right or there's something wrong in my code?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('M_Regression.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, :1].values
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
#regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train,Y_train)
#Prediction
y_pred = reg.predict(x_test)
print(str(y_test) + " - " + str(y_pred))

Linear Regression have simple numbers it is common to have 100% accuracy on large dataset. Try with other datasets once. I tried your code i got 1.0 Accuracy on it.

To check the accuracy of your model, you could try printing the r2 score of your test sample. Something among the lines of :
from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))
if you still have issues with the score. You could try removing the "random_state=0" to check if you still have 100% accuracy with several train/test data sets.

How to use train_test_split? Fix error n_samples = 0

I'm trying to split the data I am working with into training and testing sets but I get the error that n_samples = 0 when I use the train_test_split function.
Here's my code:
X_train, X_test, y_train, y_test = model_selection.train_test_split(summary, labels, test_size=0.35)
summary and labels are lists and after converting them to arrays this is the shape I get:
(1248,)
(1248,)
They both have 1248 values. Can someone tell me why its not working? Thanks
Error Message:
With n_samples=0, test_size=0.35 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters

Works for me, check if this works for you:
from sklearn.model_selection import train_test_split
import numpy as np
# dummy examples
summary, labels = np.arange(0,1248), np.arange(0,1248)
X_train, X_test, y_train, y_test = train_test_split(summary, labels, test_size=0.35)
Test with string list
summary, labels = ["hello"]*1248, ["test"]*1248

MAE using Pipeline and GridSearchCV

I am facing a challenge finding Mean Average Error (MAE) using Pipeline and GridSearchCV
Background:
I have worked on a Data Science project (MWE as below) where a MAE value would be returned of a classifier as it's performance metric.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)
#RandomForest Prediction
y_predict = RF_model.predict(X_valid)
#MAE
print(mean_absolute_error(y_valid, y_predict))
#Output:
# 0.38727149627623564
Challenge:
Now I am trying to implement the same using Pipeline and GridSearchCV (MWE as below). The expectation is the same MAE value would be returned as above. Unfortunately I could not get it right using the 3 approaches below.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)
#Approach 1:
print(grid.best_score_)
# Output:
# -0.508130081300813
#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 0. 0. ... 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output:
# ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
Discussion:
Approach 1:
As in GridSearchCV() the scoring variable is set to neg_mean_squared_error, tried to read the grid.best_score_. But it did not get the same MAE result.
Approach 2:
Tried to get the y_predict values using grid.predict(X_valid). Then tried to get the MAE using grid.score(y_predict, y_valid) as the scoring variable in GridSearchCV() is set to neg_mean_squared_error. It returned a ValueError complaining "Expected 2D array, got 1D array instead".
Approach 3:
Tried to reshape y_predict and it did not work either. This time it returned "ValueError: Number of features of the model must match the input."
It would be helpful if you can assist to point where I could have made the error?
If you need, the data.csv is available at https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv
Thank you very much

You are trying to compare mean_absolute_error with neg_mean_squared_error which is very different refer here for more details. You should have used neg_mean_absolute_error in your GridSearchCV object creation like shown below:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
Also, the score method in sklearn takes (X,y) as inputs, where x is your input feature of shape (n_samples, n_features) and y is the target labels, you need to change your grid.score(y_predict, y_valid) into grid.score(X_valid, y_valid).

ValueError : x and y must be the same size

I have a dataset which i'm trying to calculate Linear regression using sklearn.
The dataset i'm using is already made so there are not suppose to be problems with it.
I have used train_test_split in order to split my data into train and test groups.
When I try to use matplotlib in order to create scatter plot between my ttest and prediction group, I get the next error:
ValueError: x and y must be the same size
This is my code:
y=data['Yearly Amount Spent']
x=data[['Avg. Session Length','Time on App','Time on Website','Length of Membership','Yearly Amount Spent']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
#training the model
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm.fit(x_train,y_train)
lm.coef_
predictions=lm.predict(X_test)
#here the problem starts:
plt.scatter(y_test,predictions)
Why does this error occurs?
I have seen previous posts here and the suggestions for this was to use x.shape and y.shape but i'm not sure what is the purpose of that.
Thanks

It seems that you are using the EcommerceCustomers.csv dataset (link here)
In your original post the column 'Yearly Amount Spent' is also included in the y as well as in x but this is wrong.
The following should work fine:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv("EcommerceCustomers.csv")
y = data['Yearly Amount Spent']
X = data[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# ## Training the Model
lm = LinearRegression()
lm.fit(X_train,y_train)
# The coefficients
print('Coefficients: \n', lm.coef_)
# ## Predicting Test Data
predictions = lm.predict( X_test)
See also this

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ValueError: could not convert string to float: '?' - python

Related

Converting predicted random forest results into dataframe

Multiple Linear Regression 100% Accuracy

How to use train_test_split? Fix error n_samples = 0

MAE using Pipeline and GridSearchCV

ValueError : x and y must be the same size

Categories

Resources