matlab equivalent of python sklearn train_test_split function? - python

How can I get in matlab the equivalent of the python code
x_train, x_test, y_train, y_test = sk.cross_validation.train_test_split(X,y)
The train and test dataset should be randomly sampled because I will repeat this procedure more times to perform bootstrap.

Say you have 150 samples that you want to split into 100 samples for training and 50 samples for testing. You could just do:
Python:
import numpy as np
idx = np.random.permutation(range(len(y)))
X_train, y_train = X[idx[:100]], y[idx[:100]]
X_test, y_test = X[idx[100:]], y[idx[100:]]
MATLAB/Octave:
idx = randperm(length(y))
X_train, y_train = X(idx(1:100)), y(idx(1:100))
X_test, y_test = X(idx(100:150)), y(idx(100:150))

Related

how to input the model into the KNN classification algorithm?

I want to make image clasification using KNN. i use https://pythonprogramming.net/loading-custom-data-deep-learning-python-tensorflow-keras/ to make a model. i have 20 image which 10 image in dog category and 10 image in cat category. I'm having trouble entering the model into the KNN algorithm,there is a problem in my coding. this is my code:
knn_model=KNeighborsClassifier(n_neighbors=3) #define K=3
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
predict_knn=knn_model.predict(X_test)
print(predict_knn)
there is an error : found input variables with inconsistent numbers of samples: [60, 20]
I need your opinion how to fix this code. thank you.
The problem could be due to the inconsistent sample size of X and y.
1. len(y) == 20
# Works
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(20*32*32*3).reshape((20, 32, 32, 3)), list(range(20))
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
2. len(y) == 60
# Does not work
X, y = np.arange(20*32*32*3).reshape((20, 32, 32, 3)), list(range(60))
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
The second script produces the below error.

Machine Learning Using Scikit-Learn & SVM

Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?
You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

"Inconsistent numbers of samples" - scikit - learn

I'm learning some basics in machine learning in Python (scikit - learn) and when I tried to implement the K-nearest neighbors algorithm an error occurs: ValueError: Found input variables with inconsistent numbers of samples: [426, 143]. I have no idea how to deal with it.
This is my code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = load_breast_cancer()
X_train, y_train, X_test, y_test = train_test_split(cancer.data,cancer.target,
stratify =
cancer.target,
random_state = 0)
clf = KNeighborsClassifier(n_neighbors = 6)
clf.fit(X_train, y_train)`
train_test_split returns a tuple in the order X_train, X_test, y_train, y_test
You've assigned the return values to the wrong variables so you are fitting with the training data and the test data instead of the training data and the training labels.
It should be
X_train, X_test, y_train, y_test = train_test_split()

How can I do K fold cross-validation for splitting the train and test set?

I have a set of documents and a set of labels.
Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
When I try the method provided in the documentation of scikit learn: I receive an error that says:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
error
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
How can I perform a 10 fold cross-validation on my documents and labels?
There are two ways to solve this error:
First way:
Cast your data to a numpy array:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
then it should work with your current code.
Second way:
Use list comprehension to index the train & label list with the train_index & test_index list
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(For this solution also see related question index list with another list)

Categories

Resources