I am doing a project as a python and machine learning beginner and came across Titanic dataset. After splitting my dataset into training and testing, I wanted to normalize the x_train using StandardScaler, but this keeps coming out:
ValueError: could not convert string to float: 'PassengerId'
and this is my code
feature =df[['PassengerId', 'PClass', 'Age', 'SibSp', 'Parch']].values
target = df[['Survived']].values
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
How can I solve this?
# it's a good idea to exclude PassengerId, since it's probabaly not a predictive feature
# also convert all the values to float
feature =df[['PClass', 'Age', 'SibSp', 'Parch']].astype('float').values
target = df[['Survived']].values
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
Related
I want to merge my predicted results of my test data to my X_test. I was able to merge it with y_test but since my X_test is a corpus I'm not sure how I can identify the indexes to merge.
My codes are as below
def lr_model(df):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Create corpus as a list
corpus = df['text'].tolist()
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, -1].values
# Splitting to testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Train Logistic Regression on Training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Merge true vs predicted labels
true_vs_pred = pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
return true_vs_pred
This gives me the y_test and y_pred but I'm not sure how I can add the X_test as an original data frame (the ids of the X_test) to this.
Any guidance is much appreciated. Thanks
Using a pipeline can help you link the original X_test with the prediction:
def lr_model(df):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.pipeline import Pipeline
# Defining X and y
cv = CountVectorizer()
X = df['text']
y = df.iloc[:, -1].values
# Splitting to testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create a pipeline
pipeline = Pipeline([
('CountVectorizer', cv),
('LogisticRegression', LogisticRegression(random_state = 0)),
])
# Train pipeline on Training set
pipeline.fit(X_train, y_train)
# Predicting the Test set results
y_pred = pipeline.predict(X_test)
return X_test, y_test, y_pred
I have numpy arrays split into X and y, originally made from Pandas DataFrame as follows:
>> X
array([[ 2.86556780e-03, 1.87100798e-01],
[ 2.56781670e-04, 2.45417491e-01],
[ 2.35497137e-03, 1.76615342e-01],
...,
[ 2.30078468e-03, -4.16726811e-60],
[ 5.66213972e-03, -2.98597808e-60],
[ 4.39503905e-03, -2.13954678e-60]])
>> y
array([19.08666992, 19.09239006, 19.08938026, ..., 45.21157634,
45.19350761, 45.13230675])
I split them into training and test dataset as follows:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Before scaling the data, I reshape my labels as follows:
y_train= y_train.reshape((-1,1))
y_test= y_test.reshape((-1,1))
Using sklearn MinMaxScaler I then fit_transform my training_data as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
y_train = scaler.fit_transform(y_train)
I then try to transform my test data using MinMaxScaler as follows:
X_test = scaler.transform(X_test)
y_test = scaler.transform(y_test)
But test dataset is not transformed as I get the following error:
----> 1 X_test = scaler.transform(X_test)
ValueError: X has 2 features, but MinMaxScaler is expecting 1 features as input.
Can anyone guide me what I am doing wrong here.
This is because scaler is fit to y_train which has a single feature, whereas X_test has 2 features.
You have to define different scaler objects for X and y:
scaler_X = MinMaxScaler()
scaler_Y = MinMaxScaler()
X_train = scaler_X.fit_transform(X_train)
y_train = scaler_Y.fit_transform(y_train)
X_test = scaler_X.transform(X_test)
y_test = scaler_Y.transform(y_test)
another way to do the same job is to use a scaler fit to X_train to transform X_test; then use a scaler fit to y_train to transform y_test:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
I am starting to write the learning machine model. I have a Y_train dataset containing the labels where there are 5 classes. The X_train dataset contains the samples. I try to make my model with the help of a logistic regression.
X_train ((560, 20531)) and Y_train ((560, 5)) have the same dimensions.
I have seen a few publications associated with the same problem but I have not been able to solve the problem.
I don't know how to correct this error,can you help me please ?
X = pd.read_csv('/Users/lottie/desktop/data.csv', header=None, skiprows=[0])
Y = pd.read_csv('/Users/lottie/desktop/labels.csv', header=None)
Y_encoded = list()
for i in Y.loc[0:,1] :
if i == 'BRCA' : Y_encoded.append(0)
if i == 'KIRC' : Y_encoded.append(1)
if i == 'COAD' : Y_encoded.append(2)
if i == 'LUAD' : Y_encoded.append(3)
if i == 'PRAD' : Y_encoded.append(4)
Y_bis = to_categorical(Y_encoded)
#separation of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_bis, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
#train model
train_train = regression_log.fit(X_train, Y_train)
You get that error because your label is categorical. You need to use a label encoder to encode it into 0,1,2.. , check out help page from scikit-learn. Below would be an implementation using an example dataset similar to yours:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
Y = pd.DataFrame({'label':np.random.choice(['BRCA','KIRC','COAD','LUAD','PRAD'],560)})
X = pd.DataFrame(np.random.normal(0,1,(560,5)))
Y_encoded = le.fit_transform(Y['label'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encoded, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
train_train = regression_log.fit(X_train, Y_train)
This is for an assignment where the SVM methods has to be used for model accuracy.
There were 3 parts, wrote the below code
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
But after this, the question is as below
Perform Standardization of digits.data and store the transformed data
in variable digits_standardized.
Hint : Use required utility from sklearn.preprocessing. Once again,
split digits_standardized into two sets names X_train and X_test.
Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split method from sklearn.model_selection; set
random_state to 30; and perform stratified sampling. Build another SVM
classifier from X_train set and Y_train labels, with default
parameters. Name the model as svm_clf2.
Evaluate the model accuracy on testing data set and print it's score.
On top of the above code, tried writing this, but seems to be failing. Can anyone help on how the data can be standardized.
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
Tried the below. Seems to be working.
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits();
X = digits.data
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
#print(X_train.shape)
#print(X_test.shape)
from sklearn.svm import SVC
svm_clf2 = SVC().fit(X_train, y_train)
print("Accuracy ",svm_clf2.score(X_test,y_test))
Try this as final code includes all Tasks
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
digits = datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf2.score(X_test,y_test))
Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?
You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.