faster data fitting ( or learn) function in python scikit - python

I am using scikit for my machine learning purposes . While I followed the steps exactly as mentioned in its official documentation but I encounter two problems. Here is the main part of the code :
1) trdata is training data created using sklearn.train_test_split.
2) ptest and ntest is test data of positives and negatives respectively
## Preprocessing
scaler = StandardScaler(); scaler.fit(trdata);
trdata = scaler.transform(trdata)
ptest = scaler.transform(ptest); ntest = scaler.transform(ntest)
## Building Classifier
# setting gamma and C for grid search optimization, RBF Kernel and SVM classifier
crange = 10.0**np.arange(-2,9); grange = 10.0**np.arange(-5,4)
pgrid = dict(gamma = grange, C = crange)
cv = StratifiedKFold(y = tg, n_folds = 3)
## Threshold Ranging
clf = GridSearchCV(SVC(),param_grid = pgrid, cv = cv, n_jobs = 8)
## Training Classifier: Semi Supervised Algorithm
clf.fit(trdata,tg,n_jobs=8)
Problem 1) When I use n_jobs = 8 in GridSearchCV, the code runs till GridSearchCV but hangs or say takes exceptionally long time without result in executing 'clf.fit' , even for a very small dataset. When I remove it then both execute but clf.fit takes very long time to converge for large datasets. My data size is 600 x 12 matrix for both positive and negatives. Can you tell me what exactly n_jobs will do and how it should be used? Also is there any faster fitting technique or modification in code that can be applied to make it faster ?
Problem 2) also StandardScaler should be used upon positive and negative data combined or separately for both ? I suppose it has to be used combined because then only we can use the scaler parameters upon the test sets.

SVC seems to be very sensitive to the data that is not normalized, you may try to normalize the data by:
from sklearn import preprocessing
trdata = preprocessing.scale(trdata)

Related

Python Logistic regression with 2 features data X and label Y - Training accuracy

# import sklearn and necessary libraries
from sklearn.linear_model import LogisticRegression
# Apply sklearn logistic regression on the given data X and labels Y
X_skl = np.vstack((df1,df2)) # 10000 x 2 array
Y_skl = Y # 10000 x 1 array
LogR = LogisticRegression()
LogR.fit(X_skl,Y_skl)
Y_skl_hat = LogR.predict(X_skl)
# Calculate the accuracy
# Check the number of points where Y_skl is not equal to Y_skl_hat
error_count_skl = 0 # Count the number of error points
for i in range(N):
if Y_skl[i] == Y_skl_hat[i]:
error_count_skl = error_count_skl
else:
error_count_skl = error_count_skl + 1
# Calculate the accuracy
Accuracy = 100*(N - error_count_skl)/N
print("Accuracy(%):")
print(Accuracy)
Output:
Accuracy(%):
99.48
Hello,
I'm trying to apply logistic regression model on array X (with size of 10000 x 2) and label Y (10000 x 1)
using sklearn library in Python. I'm completely lost cause I've never used this library before. Can anyone help me with the coding?
Edited:
Sorry for the vague question, the goal is to find the training accuracy using the entire dataset of X. Above is what I came up with, can anyone take a look and see if it makes sense?
To calculate accuracy you can simply use this sklearn method.
sklearn.metrics.accuracy_score(y_true, y_pred)
In your case
sklearn.metrics.accuracy_score(Y_skl, Y_skl_hat)
If you want to take a look at
sklearn documentation for accuracy_score
And also you should train your model on some data and test it on others to check if the model can be generalized and to avoid overfitting.
To split your data in train and test datasets you could use:
sklearn.model_selection.train_test_split
If you want to take a look at
sklearn documentation for train_test_split

How to perform k-fold CV in R?

I have a Python code that works well for performing k-fold CV on a dataset. My Python code looks like this:
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle
# Load the dataset.
dataset = pandas.read_csv('values.csv')
# Preprocessing the dataset.
X = dataset.iloc[:, 0:8]
Y = dataset.iloc[:, 8] # The class value is the last column and is called Outcome.
# Scale all values to 0,1.
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
# 3-fold CV computation.
scores = []
svr_rbf = SVR(kernel='rbf', gamma='auto')
cv = KFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
svr_rbf.fit(X_train, Y_train)
scores.append(svr_rbf.score(X_test, Y_test))
Now, I want to rewrite the same thing in R, and I tried to do something like this:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("values.csv", col_names=TRUE)
results <- train(Outcome~.,
data=dataset,
method="smvLinear",
trControl=trainControl(
method="cv",
number=3,
savePredictions=TRUE,
verboseIter=TRUE
))
print(results)
print(results$pred)
My data is similar to this one: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data
Except this one has 12 attributes, and 13th column is the class, in my case there are 8 attributes, and 9th one is the class. But, value-wise it is similar.
Now, I can see the results printing, however there are a few things unclear to me.
1) In my Python code, I did this scaling of values, how can I do that in R?
2) I have used SVR with rbf kernel, how can I use SVR with that kernel in R instead of SMV?
3) Also, in Python version I use a random_state=42 (just a random number) to generating the splittings for the folds, so it uses different folds. But it is consistent throughout different executions. How do this in R?
4) Lastly, in Python I do the training inside a for loop per fold. I need something like this in R too, as after every fold, I want to perform some other statistics and computations. How do do this in R?
5) Should I stick to caret or use mlr package? Does mlr do the k-fold CV too? If yes how?
EDIT:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
print(dataset)
X = dataset[, 1:8]
print(X)
Y = dataset$X9
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
rfCaret <- caret::train(X, Y, method = "svmLinear", trControl = fit.control)
print(rfCaret)
Check out createFolds in the caret package for fixed folds.
Here's some code that you can amend to fit your particular modelling case; this example would build out a randomforest model, but you can switch the model for an SVM. If you follow the package guide there's a link (copied here for ease: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines) - section 7.0.47 lists all the available SVM models and their parameters.Note that you may need to install some additional packages, like kernlab, to use specific models.
There is a package called rngtools that is supposed to allow you to create reproducible models across multiple cores (parallel processing), but if you want to be sure then single core is probably the best way in my experience.
folds <- 3
set.seed(42)
cvIndex <- createFolds(your_data, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
search.grid <- expand.grid(.mtry = c(seq.int(1:sqrt(length(your_data)))+1))
rfCaret <- train(your_data_x, your_data_y, method = "rf",
metric = 'ROC', ntree = 500,
trControl = fit.control, tuneGrid = search.grid,
)
In my experience, caret is pretty good for covering pretty much all bases. If you also want to preprocess your data (e.g. centre, scale) - then you want the function preProcess - again, details in the caret package if you type ?train - but for example you would want
preProcess(yourData, method = c("center", "scale"))
Caret is clever in that it understands if it has taken a preprocessed input, and applies the same scaling to your test data sets.
edit - additional: unused parameters issue
To answer your follow up question about unused parameters - it's probably because you're using mtry which is a random forest parameter.
Here's a version for a simple SVM:
folds <- 3
set.seed(42)
cvIndex <- createFolds(dataset$Outcome, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
SVMCaret <- train(Outcome ~ ., data = dataset, method = "svmLinear",
metric = 'ROC',
trControl = fit.control)
You don't need a tuning grid; Caret will generate a random one. Of course if you want to test specific cost values, then create one yourself in much the same way as I did for the .mtry parameter for randomForests.
1) The caret::train function has a preProcess argument that allows you to do choose a preprocessing. See ?caret::train for more details.
2) There is svmRadial available for caret. You can look at examples and all available algorithms at caret/train-models-by-tag.
3) Fix the random seed with set.seed(123) for consistency. You may access training folds in the train object (results$trainingData here).
4) Don't loop, access your folds directly via your train object and calculate your stats if needed (see results$resample)
5) mlr has cross-validation too, it depends which flavor you like.

Decision boundary changing in sklearn each time I run code

In Udacity's Intro to Machine Learning class, I am finding that the result of my code can change each time I run it. The correct values are acc_min_samples_split_2 = .908 and acc_min_samples_split_2 = .912, but when I run my script, sometimes the value for acc_min_samples_split_2 = .912 as well. This happens on both my local machine and the web interface within Udacity. Why might this be happening?
The program uses the SciKit Learn library for python.
Here is the part of the code that I wrote:
def classify(features, labels, samples):
# Creates a new Decision Tree Classifier, and fits it based on sample data
# and a specified min_sample_split value
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split = samples)
clf = clf.fit(features, labels)
return clf
#Create a classifier with a min sample split of 2, and test its accuracy
clf2 = classify(features_train, labels_train, 2)
acc_min_samples_split_2 = clf2.score(features_test,labels_test)
#Create a classifier with a min sample split of 50, and test its accuracy
clf50 = classify(features_train, labels_train, 50)
acc_min_samples_split_50 = clf50.score(features_test,labels_test)
def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
print submitAccuracies()
Some classifiers within scikit-learn are of stochastic nature using some PRNG to generate random-numbers internally.
DecisionTree is one of them. Check the docs and use the argument random_state to make that random-behaviour deterministic.
Just create your fit-object like:
clf = tree.DecisionTreeClassifier(min_samples_split = samples, random_state=0) # or any other constant
If you don't provide a random_state or some seed/integer like in my example above, the PRNG will be seeded by some external source (most probably based on system-time) resulting in different results across runs of that script.*
Two runs, sharing the code and given constant will behave equal (ignoring some pathological architecture/platform stuff).

Impossible to use sum in a dataframe while similar code works

I am taking dataquest.io and I observed something strange (but could not get any answer back there). I am wondering why I can't use a code snippet that worked before in a situation that use the same kind/type of data, and should not cause an exception.
The lesson first teach to fit a regressor on a same training set and to predict on the same values, the calculating MSE.
Then it shows that it would overfit and propose a randomization process to avoid that. Problem being, apart from the random splitting, the dataframes generated are very similar, but if I try to calculate my MSE on the final results, it fails poorly, and I have to change the code for an alternative.
Here are both codes:
First code
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Initialize the linear regression class.
regressor = LinearRegression()
# We're using 'value' as a predictor, and making predictions for 'next_day'.
# The predictors need to be in a dataframe.
# We pass in a list when we select predictor columns from "sp500" to
# force pandas not to generate a series.
# (?) I could not figure out why it is not necessary for "to_predict"
predictors = sp500[["value"]]
to_predict = sp500["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(predictors)
print(next_day_predictions)
MSE_frame=(next_day_predictions-to_predict)**2
#(?) can math.pow(frame_difference, 2) be used on a dataframe?
mse=MSE_frame.sum()/len(MSE_frame.index)
______________________________________________________________________________
Second code
import numpy as np
import random
# Set a random seed to make the shuffle deterministic.
np.random.seed(1)
random.seed(1)
#(?) are there any difference between both of these statements? Are they
# both necessary or just one out of two?
# Randomly shuffle the rows in our dataframe
sp500 = sp500.loc[np.random.permutation(sp500.index)]
# Select 70% of the dataset to be training data
highest_train_row = int(sp500.shape[0] * .7)
train = sp500.loc[:highest_train_row,:]
# Select 30% of the dataset to be test data.
test = sp500.loc[highest_train_row:,:]
regressor = LinearRegression()
regressor.fit(train[["value"]], train["next_day"])
predictions = regressor.predict(test[["value"]])
mse = sum((predictions - test["next_day"]) ** 2) / len(predictions)
regressor = LinearRegression()
predictors = train[["value"]]
to_predict = train["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(test[["value"]])
print(next_day_predictions)
sqr=(next_day_predictions-test["next_day"])**2
Mistake was here, I was passing a with test[["next_day"]] while it was not done in the first code. Stupid me
mse=sum(sqr)/len(sqr.index)
#or
mse=sqr.sum()/len(sqr.index)
# This is the line which failed while it was identical to what was
#done before.
** it is worth noting both mse expressions don't yield the same results, They are identical for first ten decimals, but comparison with == doesn't give True.
So, the problem was there:
sqr=(next_day_predictions-test["next_day"])**2
I originally wrote
sqr=(next_day_predictions-test[["next_day"]])**2
thus passing a list into calculation, which was not done in the first code.

Train scikit SVM, customize score assessment

I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.

Categories

Resources