Machine Learning Using Scikit-Learn & SVM - python

Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?

You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.

Related

Standardized data of SVM - Scikit-learn/ Python

This is for an assignment where the SVM methods has to be used for model accuracy.
There were 3 parts, wrote the below code
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
But after this, the question is as below
Perform Standardization of digits.data and store the transformed data
in variable digits_standardized.
Hint : Use required utility from sklearn.preprocessing. Once again,
split digits_standardized into two sets names X_train and X_test.
Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split method from sklearn.model_selection; set
random_state to 30; and perform stratified sampling. Build another SVM
classifier from X_train set and Y_train labels, with default
parameters. Name the model as svm_clf2.
Evaluate the model accuracy on testing data set and print it's score.
On top of the above code, tried writing this, but seems to be failing. Can anyone help on how the data can be standardized.
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
Tried the below. Seems to be working.
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits();
X = digits.data
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
#print(X_train.shape)
#print(X_test.shape)
from sklearn.svm import SVC
svm_clf2 = SVC().fit(X_train, y_train)
print("Accuracy ",svm_clf2.score(X_test,y_test))
Try this as final code includes all Tasks
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
digits = datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf2.score(X_test,y_test))

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
I am happy to provide more details if needed.
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
You can also, for example, append the scores to a list defined outside.
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....
I think you can also solve this with a pipeline from the imbalanced-learn library.
I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.
#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score = mean(scores))

Cross Validation Python Sklearn

I want to do Cross Validation on my SVM classifier before using it on the actual test set. What I want to ask is do I do the cross validation on the original dataset or on the training set, which is the result of train_test_split() function?
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X, y, cv=kfold) #Cross validation on original set
or
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.svm import SVC
df = pd.read_csv('dataset.csv', header=None)
X = df[:,0:10]
y = df[:,10]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)
kfold = KFold(n_splits=10, random_state=seed)
svm = SVC(kernel='poly')
results = cross_val_score(svm, X_train, y_train, cv=kfold) #Cross validation on training set
It is best to always reserve a test set that is only used once you are satisfied with your model, right before deploying it. So do your train/test split, then set the testing set aside. We will not touch that.
Perform the cross-validation only on the training set. For each of the k folds you will use a part of the training set to train, and the rest as a validations set. Once you are satisfied with your model and your selection of hyper-parameters. Then use the testing set to get your final benchmark.
Your second block of code is correct.

"Inconsistent numbers of samples" - scikit - learn

I'm learning some basics in machine learning in Python (scikit - learn) and when I tried to implement the K-nearest neighbors algorithm an error occurs: ValueError: Found input variables with inconsistent numbers of samples: [426, 143]. I have no idea how to deal with it.
This is my code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = load_breast_cancer()
X_train, y_train, X_test, y_test = train_test_split(cancer.data,cancer.target,
stratify =
cancer.target,
random_state = 0)
clf = KNeighborsClassifier(n_neighbors = 6)
clf.fit(X_train, y_train)`
train_test_split returns a tuple in the order X_train, X_test, y_train, y_test
You've assigned the return values to the wrong variables so you are fitting with the training data and the test data instead of the training data and the training labels.
It should be
X_train, X_test, y_train, y_test = train_test_split()

Categories

Resources