I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).
Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?
Unsuccessful code:
from sklearn import ensemble, cross_validation, datasets
dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)
loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
print('Sample %d score: %f' % (test_i[0], score))
Resulting exception:
ValueError: r2_score can only be computed given more than one sample.
[EDIT, to clarify]:
I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?
cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.
Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.
Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.
Edit: After a quick discussion with #banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using
numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])
Related
I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.
I'm trying to detect outliers in a dataframe using the Isolation Forest algorithm from sklearn.
Here's the code I'm using to set up the algorithm:
iForest = IsolationForest(n_estimators=100, max_samples=256, contamination='auto', random_state=1, behaviour='new')
iForest.fit(dataset)
scores = iForest.decision_function(dataset)
Now, since I don't know what a good value for the contamination could be, I would like to check my scores and decide where to draw the line based on the distribution of the scores. Here's the code for the graph and the graph itself:
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);
Is it correct to assume that negative scores indicate outliers in my dataframe? I can't find a good explanation on the range of the IF scores and how these scores work (why do I get negative scores?).
Additionally, is there a way to attach these scores to the original dataset and manually check rows with negative scores to see if they make sense?
Thanks!
One way of approaching this problem is to make use of the score_samples method that is available in sklearn's isolationforest module. Once you have fitted the model to your data, use the score_samples method to find out the abnormality scores for each sample (lower the value more abnormal it is). Since you don't have the information about true anomalies in your data, you can sort your samples w.r.t the scores that you have obtained and manually review the records to see if the sample with the least score is actually an anomaly or not and in this process you can come up with a threshold value for classifying a data point as an anomaly, which you can use it later on for any new data to check if they are anomalies or not.
the return value of score_samples is $-s(x,\psi)$ , which range is $[-1,0]$, 0 means decision length is short so abnomal.
decision_function will convert score_samples to $[-0.5,0.5]$
then predict will convert decision_function to $-1$ or $1$ by predefined anomaly rate(contamination)
iforest = IsolationForest(n_estimators=100,
max_features=1.0,
max_samples='auto',
contamination='auto',
bootstrap=False,
n_jobs=1,
random_state=1)
iforest.fit(X)
scores = iforest.score_samples(X)
predict = iforest.predict(X)
decision = iforest.decision_function(X)
offset = iforest.offset_ # default -0.5
print(offset)
print(iforest.max_samples_)
assert np.allclose(decision, scores-offset)
assert np.allclose(predict, np.where(decision<0,-1,1))
This is my first time using stack exchange but I need help with a problem (It is not a homework or assignment problem):
I have two Decision Trees: D1 = DecisionTreeClassifier(max_depth=4,criterion = 'entropy',random_state=1) and D2 = DecisionTreeClassifier(max_depth=8,criterion = 'entropy',random_state=1). When I performed 5 fold cross validation on both of them for a given set of features and corresponding labels, I found their average validation accuracy over the 5 folds as 0.59 and 0.57 respectively. How do I determine whether the difference between their performances is statistically significant or not? (P.S. we're to use significance level = 0.01).
Do state if any information or term of significance is missing here.
This is a very good question, and the answer proves to be not that straightforward.
Instinctively, most people would tend to recommend a Student's paired t-test; but, as explained in the excellent post Statistical Significance Tests for Comparing Machine Learning Algorithms of Machine Learning Mastery, this test is not actually suited for this case, as its assumptions are in fact violated:
In fact, this [Student's t-test] is a common way to compare
classifiers with perhaps hundreds of published papers using this
methodology.
The problem is, a key assumption of the paired Student’s t-test has
been violated.
Namely, the observations in each sample are not independent. As part
of the k-fold cross-validation procedure, a given observation will be
used in the training dataset (k-1) times. This means that the
estimated skill scores are dependent, not independent, and in turn
that the calculation of the t-statistic in the test will be
misleadingly wrong along with any interpretations of the statistic and
p-value.
The article goes on to recommend the McNemar's test (see also this, now closed, SO question), which is implemented in the statsmodels Python package. I will not pretend to know anything about it and I have never used it, so you might need to do a further digging by yourself here...
Nevertheless, as reported by the aforementioned post, a Student's t-test can be a "last resort" approach:
It’s an option, but it’s very weakly recommended.
and this is what I am going to demonstrate here; use it with caution.
To start with, you will need not only the averages, but the actual values of your performance metric in each one of the k-folds of your cross-validation. This is not exactly trivial in scikit-learn, but I have recently answered a relevant question on Cross-validation metrics in scikit-learn for each data split, and I will adapt the answer here using scikit-learn's Boston dataset and two decision tree regressors (you can certainly adapt these to your own exact case):
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model_1 = DecisionTreeRegressor(max_depth = 4, criterion='mae',random_state=1)
model_2 = DecisionTreeRegressor(max_depth = 8, criterion='mae', random_state=1)
cv_mae_1 = []
cv_mae_2 = []
for train_index, val_index in kf.split(X):
model_1.fit(X[train_index], y[train_index])
pred_1 = model_1.predict(X[val_index])
err_1 = mean_absolute_error(y[val_index], pred_1)
cv_mae_1.append(err_1)
model_2.fit(X[train_index], y[train_index])
pred_2 = model_2.predict(X[val_index])
err_2 = mean_absolute_error(y[val_index], pred_2)
cv_mae_2.append(err_2)
cv_mae_1 contains the values of our metric (here mean absolute error - MAE) for each of the 5 folds of our 1st model:
cv_mae_1
# result:
[3.080392156862745,
2.8262376237623767,
3.164851485148514,
3.5514851485148515,
3.162376237623762]
and similarly cv_mae_2 for our 2nd model:
cv_mae_2
# result
[3.1460784313725494,
3.288613861386139,
3.462871287128713,
3.143069306930693,
3.2490099009900986]
Having obtained these lists, it is now straightforward to calculate the paired t-test statistic along with the corresponding p-value, using the respective method of scipy:
from scipy import stats
stats.ttest_rel(cv_mae_1,cv_mae_2)
# Ttest_relResult(statistic=-0.6875659723031529, pvalue=0.5295196273427171)
where, in our case, the huge p-value means that there is not a statistically significant difference between the means of our MAE metrics.
Hope this helps - do not hesitate to dig deeper by yourself...
How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score?
Tried the following two but doesn't work:
scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring='mean_squared_error')
and
scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring=metrics.mean_squared_error)
The first one generates a list of negative numbers while mean squared error should always be non-negative. The second one complains that:
mean_squared_error() takes exactly 2 arguments (3 given)
I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed - https://github.com/scikit-learn/scikit-learn/issues/2439
In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you want to use to compare models but R^2 cannot be calculated depending (I think) on the type of cross-val you are using.
If you choose MSE as a scorer, it outputs a list of errors which you can then take the mean of, like so:
# Doing linear regression with leave one out cross val
from sklearn import cross_validation, linear_model
import numpy as np
# Including this to remind you that it is necessary to use numpy arrays rather
# than lists otherwise you will get an error
X_digits = np.array(x)
Y_digits = np.array(y)
loo = cross_validation.LeaveOneOut(len(Y_digits))
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, X_digits, Y_digits, scoring='mean_squared_error', cv=loo,)
# This will print the mean of the list of errors that were output and
# provide your metric for evaluation
print scores.mean()
The first one is correct. It outputs the negative of the MSE, as it always tries to maximize the score. Please help us by suggesting an improvement to the documentation.
I am using sklearn.lda for a classification purpose and was a little puzzled about the score function that prints the mean classification error.
Is it determined by leave one out - jackknife?
How do I interpret the result? It's only a float value without much documentation.
Thanks in advance,
EL
The score method takes samples X and their true labels y and compares its own predictions with y. It returns the mean accuracy, which is always a single figure. For example,
lda = LDA().fit(X, y)
print(lda.score(X, y))
will print the accuracy of the classifier on its own training set.
Every classifier has a score method, which usually (though not necessarily) returns mean accuracy. The method is used by the GridSearchCV model selection algorithm to determine the quality of the classifier if you don't explicitly give it a scoring argument.