I am using IsolationForest as follows to detect outlier data points of my dataset.
clf = IsolationForest(max_samples='auto',
random_state=42,
behaviour="new",
contamination=.01)
clf.fit(X)
y_pred_train = clf.predict(X)
outliers = []
for item in np.where(y_pred_train == -1)[0]:
outliers.append(df_nodes[item])
I want the predicted outliers as a ranked list. That is, I want to know what was the most potential outlier and the next and so on (maybe sorted using some probability of prediction). I was trying to find out a way to do it in sklearn. However, I still could not find a way. Please let me know a suitable way of doing this.
I am happy to provide more details if needed.
Instead of using predict, use decision_function.
From the docs:
Methods
decision_function(self, X) Average anomaly score of X of the base classifiers.
Then, you can rank them based on their anomaly score. The lower this value, the more abnormal the observation is.
Related
TL,DR: I'm looking for a good way to compare the output of different scikit learn ML models on a multi-output classification problem: labelling social media messages according to the different disaster response categories they might fall into. I'm currently just using precision_recall_fscore_support on each label and then averaging the results, but I'm not convinced that this is a good solution.
In detail: As part of an exercise I'm doing for an online data science course, I'm looking at a dataset of social media messages that occurred during natural disasters. The goal of the exercise is to train a machine learning model to classify these messages according to the various emergency departments they relate to, such as: aid_related, medical_help, weather_related, floods, etc...
So for example the following message: "UN reports Leogane 80-90 destroyed. Only Hospi..." is classed in my training data as 'medical_products', 'aid_related' and 'request'.
I've started off using scikit-learn's KNeighborsClassifier, and MultiOutputClassifier. I'm also using gridsearch to compare parameters inside the model:
pipeline = pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(KNeighborsClassifier()))
])
parameters = { 'clf__estimator__n_neighbors': [5, 7]}
cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
When I finally (it takes forever just with two parameteres to compare) get the model output, I've written the following function to pull out a matrix with the average precision, recall and fscore for each column:
def classify_model_output(y_test, y_pred):
classification_scores = []
for i, column in enumerate(y_test.columns): classification_scores.append(precision_recall_fscore_support(y_test[column], y_pred[:, i]))
df_classification = pd.DataFrame(classification_scores)
df_classification.columns = ['precision', 'recall', 'fscore', 'support']
df_classification.set_index(y_test.columns, inplace=True)
# below loop splits the precision, recall and f-score columns into two, one for negatives and one for positives (0 and 1)
for column in df_classification.columns:
column_1 = df_classification[column].apply(lambda x: x[0]).rename(column+str(0), inplace=True)
column_2 = df_classification[column].apply(lambda x: x[1]).rename(column+str(1), inplace=True)
df_classification.drop([column], axis=1, inplace=True)
df_classification = pd.concat([df_classification, column_1, column_2], axis=1)
# finally, take the average of the dataframe to get a classifier for the model
df_classification_avg = df_classification.mean(axis=0)
return df_classification_avg
The df_classification table which looks like this (top 5 rows):
And here's what I get when I compare the average classification tables (produced by the previous method) for knn with 5 neighbors (avg_knn), knn with 7 neighbors (knn_avg_2), and random forest (rf) - yellow cells represent the max for that row:
But I'm not sure how to interpret this. One the face of it it looks like Random Forest (rf) performed best. But I'm not sure if this is the best way to achieve this, or if using the average even makes sense here.
Does anyone have any advice on the best way to accurately and transparently compare my models, in the case of a multioutput problem like this one?
Edit: updated code block with easier to read function, and added comparison of three models
If your training data is not biased towards any particular output label, then you can go for your accuracy score. i.e. corresponding all labels, there is balanced amount of training data.
However if your data is imbalanced i.e. training data is more towards one or two particular output label then go for precision and recall.
Now between precision and recall , the choice depends on your need . If you are not much considered about accuracy go for recall e.g. on airport there is minimum chance that any bomb would be recovered from luggage, but you check all bags. That is recall.
When you are more considered about how much correct predictions are done from a sample , go for precision.
I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.
I'm trying to detect outliers in a dataframe using the Isolation Forest algorithm from sklearn.
Here's the code I'm using to set up the algorithm:
iForest = IsolationForest(n_estimators=100, max_samples=256, contamination='auto', random_state=1, behaviour='new')
iForest.fit(dataset)
scores = iForest.decision_function(dataset)
Now, since I don't know what a good value for the contamination could be, I would like to check my scores and decide where to draw the line based on the distribution of the scores. Here's the code for the graph and the graph itself:
plt.figure(figsize=(12, 8))
plt.hist(scores, bins=50);
Is it correct to assume that negative scores indicate outliers in my dataframe? I can't find a good explanation on the range of the IF scores and how these scores work (why do I get negative scores?).
Additionally, is there a way to attach these scores to the original dataset and manually check rows with negative scores to see if they make sense?
Thanks!
One way of approaching this problem is to make use of the score_samples method that is available in sklearn's isolationforest module. Once you have fitted the model to your data, use the score_samples method to find out the abnormality scores for each sample (lower the value more abnormal it is). Since you don't have the information about true anomalies in your data, you can sort your samples w.r.t the scores that you have obtained and manually review the records to see if the sample with the least score is actually an anomaly or not and in this process you can come up with a threshold value for classifying a data point as an anomaly, which you can use it later on for any new data to check if they are anomalies or not.
the return value of score_samples is $-s(x,\psi)$ , which range is $[-1,0]$, 0 means decision length is short so abnomal.
decision_function will convert score_samples to $[-0.5,0.5]$
then predict will convert decision_function to $-1$ or $1$ by predefined anomaly rate(contamination)
iforest = IsolationForest(n_estimators=100,
max_features=1.0,
max_samples='auto',
contamination='auto',
bootstrap=False,
n_jobs=1,
random_state=1)
iforest.fit(X)
scores = iforest.score_samples(X)
predict = iforest.predict(X)
decision = iforest.decision_function(X)
offset = iforest.offset_ # default -0.5
print(offset)
print(iforest.max_samples_)
assert np.allclose(decision, scores-offset)
assert np.allclose(predict, np.where(decision<0,-1,1))
I'm wondering if it is possible for Sklearn's RFECV to select a fixed number of the most important features. For example, working on a dataset with 617 features, I have been trying to use RFECV to see which 5 of those features are the most significant. However, RFECV does not have the parameter 'n_features_to_select', unlike RFE (which confuses me). How should I deal with this?
According to this quora post
The RFECV object helps to tune or find this n_features parameter using cross-validation. For every step where "step" number of features are eliminated, it calculates the score on the validation data. The number of features left at the step which gives the maximum score on the validation data, is considered to be "the best n_features" of your data.
Which says RFECV determines the optimal number of features (n_features) to get best result.
The fitted RFECV object contains an attribute ranking_ with feature ranking, and support_ mask to select optimal features found.
However if you MUST select top n_features from RFECV you can use the ranking_ attribute
optimal_features = X[:, selector.support_] # selector is a RFECV fitted object
n = 6 # to select top 6 features
feature_ranks = selector.ranking_ # selector is a RFECV fitted object
feature_ranks_with_idx = enumerate(feature_ranks)
sorted_ranks_with_idx = sorted(feature_ranks_with_idx, key=lambda x: x[1])
top_n_idx = [idx for idx, rnk in sorted_ranks_with_idx[:n]]
top_n_features = X[:5, top_n_idx]
Reference:
sklearn documentation, Quora post
I know that this is an old question, but I think it is still relevant.
I don't think shanmuga's solution is right because features within the same rank are not ordered by importance. That is, if selector.ranking_ has 3 features with rank 1, I don't think it is necessarily true that the first in the list is more important than the second or third.
A naive solution to this problem would be to run RFE while setting n_features_to_select to the desired number and "manually" cross-validate it.
In case you want n features from the optimal m features (with n<m) you can do:
# selector is a RFECV fitted object
feature_importance = selector.estimator_.feature_importances_ # or coef_
feature_importance_sorted = sorted(enumerate(feature_importance), key=lambda x: x[1])
top_n_idx = [idx for idx, _ in feature_importance_sorted[:n]]
You should note that multiple features may have the same importance or coefficient, which you might leave out with this approach.
I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).
Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?
Unsuccessful code:
from sklearn import ensemble, cross_validation, datasets
dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)
loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
print('Sample %d score: %f' % (test_i[0], score))
Resulting exception:
ValueError: r2_score can only be computed given more than one sample.
[EDIT, to clarify]:
I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?
cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.
Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.
Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.
Edit: After a quick discussion with #banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using
numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])