I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).
First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.
X shape : (25433, 17)
y shape : (25433, 1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=(0))
Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.
'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)
X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)
'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)
I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing).
SVM might also be an option i want to check out.
Then i set up the GridSearchCV
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
fbeta = make_scorer(fbeta_score,
average = 'micro',
needs_proba=True,
beta=1)
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=fbeta,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
The Problem:
GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
You can use GridSearchCV for unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.
Here's an example in the docs that uses grid search for KernelDensity, an unsupervised estimator. It works without issue because this estimator has a score method (docs).
In your case, since IsolationForest doesn't have a score method, you'll need to define a custom scorer to pass as the search's scoring method. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.
Agree with #Ben Reiniger's answer and it has good links for other SO posts on this topic.
You can try creating a custom scorer by assuming you can make use of y_train . This is not strictly unsupervised .
Here is one example where R2 score is used as a scoring metric.
from sklearn.metrics import r2_score
def scorer_f(estimator, X_train,Y_train):
y_pred=estimator.predict(Xtrain)
return r2_score(Y_train, y_pred)
Then you can use it as normal.
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=scorer_f,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
Related
For a classification task I try to compare two models (with SVM) where the difference between the models is just one feature (namely, cult_distance & inst_dist). Even after HP-tuning (using GridSearchCV) both models show the exact same accuracy. When I try a third model with both cult_distance & inst_dist in the model, still the exact same accuracy. My code is below and in the screenshot you see the classification report. As you can see in the report, the minority class is not predicted. I think my model is overfitting but I don't know how to solve this. Note: also when I run the model without any other features than cult_distance & inst_dist, the accuracy is still the same.
features1 = ['inst_dist', 'cult_distance', 'deal_numb', 'acq_sic', 'acq_stat', 'acq_form',
'tar_sic', 'tar_stat', 'tar_form', 'vendor_name', 'pay_method',
'advisor', 'stake', 'deal_value', 'days', 'sic_sim']
X = data.loc[:, features1]
y = data.loc[:, ['deal_status']]
### training the model
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, train_size = .75)
# defining parameter range
rfc = svm.SVC(random_state = 0)
param_grid = {'C': [0.001],
'gamma': ['auto'],
'kernel': ['rbf'],
'degree':[1]}
grid = GridSearchCV(rfc, param_grid = param_grid, refit = True, verbose = 2,n_jobs=-1)
# fitting the model for grid search
grid.fit(X_train, y_train)
# print best parameter after tuning
grid_predictions = grid.predict(X_test)
# print classification report
print(classification_report(y_test, grid_predictions))
EDIT: After a useful comment, I used class_weight to insert in the dataset. However the accuracy of the data is much lower than the baseline of 0.76?
The problem seems to be connected with unbalanced dataset. One class present much more in your dataset, so model learns to predict only that class. There are numerous ways to combat that, the simplest one is to adjust class_weight parameter in your model. In your situation it should be around class_weight = {0 : 3.2, 1 : 1} as second class is 3 times more popular. You can calculate that using compute_class_weight from sklearn.
Hope that was helpful!
I have used RandomizedSearchCV to tune the parameters of my Random Forest model, as in the code cell below:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
It turns out that the rf_random model outperforms any of my manually trained models with the same parameters which were retrieved using
rf_random.best_params_
I need to reproduce the exact prediction that I have made using RandomizedSearchCV, but I am unable to do so, mainly due to two reasons:
best_params_ differ on each run
I am having trouble understanding how RandomizedSearchCV splits the data into train set and validation set, which means that it is nearly impossible for me to train a new model that behaves the same.
What can I do? What more information do I need to reproduce the results? Or is it even possible to reproduce results from RandomizedSearchCV, despite the fact that I have fixed my random_state to 42? Should I stick to GridSearchCV instead if I need to reproduce the results?
I believe you are looking for the best_estimator_ attribute of RandomizedSearchCV which will return the fitted estimator which scored highest on the left out data:
kf = KFold(n_splits=3, random_state=42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = kf, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
tuned_rf = rf_random.best_estimator_
for train_index, test_index in kf.split(X_train[rf_cols], y_train):
# use train and test indices here
I was working with the Heart Disease Prediction dataset from Kaggle and came up with something odd that I couldn't find an answer.
With default Logistic Regression with 'liblinear' solver (C = 1) I get an accuracy on the train set of 87.26% and 86.81% on the test set. Pretty good. However, I tried using GridSearchCV tweaking C in case I find better values and I constantly get worse results (an accuracy about 85% on train set and 82.5% on test set).
Is GridSearchCV using some other metric for comparing these values of C? I just don't understand why it returns a worse solution.
I leave the last part of my code here.
Default Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'liblinear', random_state = 2)
lr = lr.fit(X_train, y_train)
model_score(lr, X_train, y_train, X_test, y_test)
GridSearchCV
from sklearn.model_selection import GridSearchCV
# Finding better hyperparameters for the Logistic Regression
lr_params = [ {'C': np.logspace(-1, 0.3, 30)} ]
lr = LogisticRegression(solver = 'liblinear', random_state = 2)
lr_cv = GridSearchCV(lr, lr_params, cv = 5, scoring = 'accuracy')
lr_cv.fit(X_train, y_train)
lr_best_params = lr_cv.best_params_
lr = LogisticRegression(**lr_best_params)
lr.fit(X_train, y_train)
model_score(lr, X_train, y_train, X_test, y_test)
EDIT
Full code in this link (check section 4-4.1).
In the example below,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
I am using StandardScaler(), is this the correct way to apply it to test set as well?
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.
What happens can be described as follows:
Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data
Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
Quick answer: Your methodology is correct.
Although the above answer is very good, I just would like to point out some subtleties:
best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].
Deep Dive:
A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].
References:
[1] GridSearchCV
[2] Introduction to Machine Learning with Python
[3] 3.1 Cross-validation: evaluating estimator performance
I am computing ROC AUC values for a Gradient Boosting Classifier using 10-fold cross validation with python sklearn. I have done this in two ways which I thought would give identical results, but they do not: (1) Use cross_val_predict with method = 'predict_proba' to get the predicted probabilities via cross validation, and then compute the AUC for each fold using roc_auc_score, versus (2) Use cross_val_score with scoring = 'roc_auc'. The results are not wildly different, but it bothers me that they differ at all (see code and output below). Can anyone explain this difference?
gbm = GradientBoostingClassifier(loss='deviance', n_estimators=initNumTrees, learning_rate=0.001, subsample=0.5, max_depth=1, random_state=12345, warm_start=True)
foldgen = StratifiedKFold(n_splits=10, shuffle=True, random_state=12345)
cv_probs = cross_val_predict(gbm, X_train, y_train, method='predict_proba', cv=foldgen, n_jobs=n_cores)[:,1]
auc = []
for train_index, test_index in foldgen.split(X_train, y_train):
auc.append(roc_auc_score(y_train[test_index], cv_probs[test_index]))
np.round(auc,4)
array([ 0.6713, 0.5878, 0.6315, 0.6538, 0.6709, 0.6724, 0.666 ,
0.6857, 0.6426, 0.6581])
versus:
cv_values = cross_val_score(gbm, X_train, y_train, scoring='roc_auc', cv=foldgen, n_jobs=n_cores)
np.round(cv_values,4)
array([ 0.6391, 0.6159, 0.6673, 0.6613, 0.6748, 0.6754, 0.6869,
0.7107, 0.6552, 0.6602])
I was going through the same problem.
I read the documentation and found this article. After that, I had started to use the method make_scorer instead of the literal string roc_auc_score.
So, that worked for me and now I can get the same result when I use the cross_val_score and the StratifiedKFold.
I hope being useful.