Decision boundary changing in sklearn each time I run code - python

In Udacity's Intro to Machine Learning class, I am finding that the result of my code can change each time I run it. The correct values are acc_min_samples_split_2 = .908 and acc_min_samples_split_2 = .912, but when I run my script, sometimes the value for acc_min_samples_split_2 = .912 as well. This happens on both my local machine and the web interface within Udacity. Why might this be happening?
The program uses the SciKit Learn library for python.
Here is the part of the code that I wrote:
def classify(features, labels, samples):
# Creates a new Decision Tree Classifier, and fits it based on sample data
# and a specified min_sample_split value
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split = samples)
clf = clf.fit(features, labels)
return clf
#Create a classifier with a min sample split of 2, and test its accuracy
clf2 = classify(features_train, labels_train, 2)
acc_min_samples_split_2 = clf2.score(features_test,labels_test)
#Create a classifier with a min sample split of 50, and test its accuracy
clf50 = classify(features_train, labels_train, 50)
acc_min_samples_split_50 = clf50.score(features_test,labels_test)
def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
print submitAccuracies()

Some classifiers within scikit-learn are of stochastic nature using some PRNG to generate random-numbers internally.
DecisionTree is one of them. Check the docs and use the argument random_state to make that random-behaviour deterministic.
Just create your fit-object like:
clf = tree.DecisionTreeClassifier(min_samples_split = samples, random_state=0) # or any other constant
If you don't provide a random_state or some seed/integer like in my example above, the PRNG will be seeded by some external source (most probably based on system-time) resulting in different results across runs of that script.*
Two runs, sharing the code and given constant will behave equal (ignoring some pathological architecture/platform stuff).

Related

adding more data to Support Vector Classifier training

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

How to perform k-fold CV in R?

I have a Python code that works well for performing k-fold CV on a dataset. My Python code looks like this:
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.utils import shuffle
# Load the dataset.
dataset = pandas.read_csv('values.csv')
# Preprocessing the dataset.
X = dataset.iloc[:, 0:8]
Y = dataset.iloc[:, 8] # The class value is the last column and is called Outcome.
# Scale all values to 0,1.
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
# 3-fold CV computation.
scores = []
svr_rbf = SVR(kernel='rbf', gamma='auto')
cv = KFold(n_splits=3, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
svr_rbf.fit(X_train, Y_train)
scores.append(svr_rbf.score(X_test, Y_test))
Now, I want to rewrite the same thing in R, and I tried to do something like this:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("values.csv", col_names=TRUE)
results <- train(Outcome~.,
data=dataset,
method="smvLinear",
trControl=trainControl(
method="cv",
number=3,
savePredictions=TRUE,
verboseIter=TRUE
))
print(results)
print(results$pred)
My data is similar to this one: https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data
Except this one has 12 attributes, and 13th column is the class, in my case there are 8 attributes, and 9th one is the class. But, value-wise it is similar.
Now, I can see the results printing, however there are a few things unclear to me.
1) In my Python code, I did this scaling of values, how can I do that in R?
2) I have used SVR with rbf kernel, how can I use SVR with that kernel in R instead of SMV?
3) Also, in Python version I use a random_state=42 (just a random number) to generating the splittings for the folds, so it uses different folds. But it is consistent throughout different executions. How do this in R?
4) Lastly, in Python I do the training inside a for loop per fold. I need something like this in R too, as after every fold, I want to perform some other statistics and computations. How do do this in R?
5) Should I stick to caret or use mlr package? Does mlr do the k-fold CV too? If yes how?
EDIT:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
print(dataset)
X = dataset[, 1:8]
print(X)
Y = dataset$X9
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
rfCaret <- caret::train(X, Y, method = "svmLinear", trControl = fit.control)
print(rfCaret)
Check out createFolds in the caret package for fixed folds.
Here's some code that you can amend to fit your particular modelling case; this example would build out a randomforest model, but you can switch the model for an SVM. If you follow the package guide there's a link (copied here for ease: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines) - section 7.0.47 lists all the available SVM models and their parameters.Note that you may need to install some additional packages, like kernlab, to use specific models.
There is a package called rngtools that is supposed to allow you to create reproducible models across multiple cores (parallel processing), but if you want to be sure then single core is probably the best way in my experience.
folds <- 3
set.seed(42)
cvIndex <- createFolds(your_data, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
search.grid <- expand.grid(.mtry = c(seq.int(1:sqrt(length(your_data)))+1))
rfCaret <- train(your_data_x, your_data_y, method = "rf",
metric = 'ROC', ntree = 500,
trControl = fit.control, tuneGrid = search.grid,
)
In my experience, caret is pretty good for covering pretty much all bases. If you also want to preprocess your data (e.g. centre, scale) - then you want the function preProcess - again, details in the caret package if you type ?train - but for example you would want
preProcess(yourData, method = c("center", "scale"))
Caret is clever in that it understands if it has taken a preprocessed input, and applies the same scaling to your test data sets.
edit - additional: unused parameters issue
To answer your follow up question about unused parameters - it's probably because you're using mtry which is a random forest parameter.
Here's a version for a simple SVM:
folds <- 3
set.seed(42)
cvIndex <- createFolds(dataset$Outcome, folds, returnTrain = T)
fit.control <- trainControl(method = "cv",
index = cvIndex,
number = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = FALSE)
SVMCaret <- train(Outcome ~ ., data = dataset, method = "svmLinear",
metric = 'ROC',
trControl = fit.control)
You don't need a tuning grid; Caret will generate a random one. Of course if you want to test specific cost values, then create one yourself in much the same way as I did for the .mtry parameter for randomForests.
1) The caret::train function has a preProcess argument that allows you to do choose a preprocessing. See ?caret::train for more details.
2) There is svmRadial available for caret. You can look at examples and all available algorithms at caret/train-models-by-tag.
3) Fix the random seed with set.seed(123) for consistency. You may access training folds in the train object (results$trainingData here).
4) Don't loop, access your folds directly via your train object and calculate your stats if needed (see results$resample)
5) mlr has cross-validation too, it depends which flavor you like.

SelectKBest with GaussianNB not precise/consistent results

I want to select top K features using SelectKBest and run GaussianNB:
selection = SelectKBest(mutual_info_classif, k=300)
data_transformed = selection.fit_transform(data, labels)
new_data_transformed = selection.transform(new_data)
classifier = GaussianNB()
classifier.fit(data_transformed, labels)
y_predicted = classifier.predict(new_data)
acc = accuracy_score(new_data_labels, y_predicted)
However, I do not get consistent results for accuracy on the same data.
The accuracy has been:
0.61063743402354853
0.60678034916768164
0.61733658140479086
0.61652456354039786
0.64778725131952908
0.58384084449857898
For the SAME data. I don't do splits etc. I just use two static sets of data and new_data.
Why do the results vary? How do I make sure I get the same accuracy for the same data?
This is because their is some randomness in the data or variables. This depends on the Random number generator used internally by the estimators or functions, in your case it is mutual_info_classif which you pass into SelectKBest.
Have a look at the usage of random_state here and in this answer
As a workaround you can insert the following line on top of your code.
np.random.seed(some_integer)
This will set the numpy's seed to the some_integer and as far as I know, scikit estimators uses numpy's random number generator. See this for more details

how to implement walk forward testing in sklearn?

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:
to cross validate a time series data, the training and testing data are often splitted like this:
That is to say, the testing data should be always ahead of training data.
My thought is:
Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.
Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.
Any suggestion is welcome.
I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.
After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.
I did some work regarding all this some months ago.
You could check it out in this question/answer:
Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis
My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.
What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.
I've written some code that I hope could be helpful to someone.
'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation
for sequence in range(40, df.sequence.max() + 1):
train = df[df['sequence'] < sequence]
test = df[df['sequence'] == sequence]
X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
y_train, y_test = train['quantity'].values, test['quantity'].values
mdl = LinearRegression()
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
RMSE.append(error)
print('Mean RMSE = %.5f' % np.mean(RMSE))
Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):
def tscv(X, train_size, test_size):
folds = math.floor(len(X) / test_size)
tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
splits = []
for train_index, test_index in tscv.split(X):
if len(train_index) < train_size:
continue
elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
pass
else:
train_index = train_index[-train_size:]
splits.append([train_index, test_index])
return splits
I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:
|X||V|O|O|O|
|O|X||V|O|O|
|O|O|X||V|O|
|O|O|O|X||V|
X / V are the training / validation sets.
"||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.
You could easily extend it to get longer lookback windows for the training sets.
class StratifiedWalkForward(object):
def __init__(self,n_splits,n_gap):
self.n_splits = n_splits
self.n_gap = n_gap
self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
return
def split(self,X,y,groups=None):
splits = self._cv.split(X,y)
_ixs = []
for ix in splits:
_ixs.append(ix[1])
for i in range(1,len(_ixs)):
yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
def get_n_splits(self,X,y,groups=None):
return self.n_splits
Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.

faster data fitting ( or learn) function in python scikit

I am using scikit for my machine learning purposes . While I followed the steps exactly as mentioned in its official documentation but I encounter two problems. Here is the main part of the code :
1) trdata is training data created using sklearn.train_test_split.
2) ptest and ntest is test data of positives and negatives respectively
## Preprocessing
scaler = StandardScaler(); scaler.fit(trdata);
trdata = scaler.transform(trdata)
ptest = scaler.transform(ptest); ntest = scaler.transform(ntest)
## Building Classifier
# setting gamma and C for grid search optimization, RBF Kernel and SVM classifier
crange = 10.0**np.arange(-2,9); grange = 10.0**np.arange(-5,4)
pgrid = dict(gamma = grange, C = crange)
cv = StratifiedKFold(y = tg, n_folds = 3)
## Threshold Ranging
clf = GridSearchCV(SVC(),param_grid = pgrid, cv = cv, n_jobs = 8)
## Training Classifier: Semi Supervised Algorithm
clf.fit(trdata,tg,n_jobs=8)
Problem 1) When I use n_jobs = 8 in GridSearchCV, the code runs till GridSearchCV but hangs or say takes exceptionally long time without result in executing 'clf.fit' , even for a very small dataset. When I remove it then both execute but clf.fit takes very long time to converge for large datasets. My data size is 600 x 12 matrix for both positive and negatives. Can you tell me what exactly n_jobs will do and how it should be used? Also is there any faster fitting technique or modification in code that can be applied to make it faster ?
Problem 2) also StandardScaler should be used upon positive and negative data combined or separately for both ? I suppose it has to be used combined because then only we can use the scaler parameters upon the test sets.
SVC seems to be very sensitive to the data that is not normalized, you may try to normalize the data by:
from sklearn import preprocessing
trdata = preprocessing.scale(trdata)

Categories

Resources