How to use document vectors in isolationforest in sklearn

How to use document vectors in isolationforest in sklearn - python

In understanding what isolation forest really does, I did a sample project as follows using 8 features as follows.
from sklearn.ensemble import IsolationForest
#features
df_selected = df[["feature1", "feature2", "feature3", "feature4", "feature5", "feature6", "feature7", "feature8"]]
X = np.array(df_selected)
#isolation forest
clf = IsolationForest(max_samples='auto', random_state=42, behaviour="new", contamination=.01)
clf.fit(X)
y_pred_train = clf.predict(X)
print(np.where(y_pred_train == -1)[0])
Now, I want to identify what are the outlier documents using isolation forest. For that I trained a doc2vec model using gensim. Now for each of my document in the dataset I have a 300-dimensional vector.
My question is can I straight away use the document vectors in isolation forest as X in the above code to detect outliers? Or do I need to reduce the dimensionality of the vectors before applying them to isolation forest?
I am happy to provide more details if needed.

You can straight away use the predict() to detect outliers unless you plan on removing some variables that would not be considered in the training model.
In general, I would say to do a correlation analysis and remove the variables that are highly correlated with each other (Logic basis being that if they are highly correlated, then they are the same and should not encourage the bias of the variables by doubling the consideration).
Feel free to dispute otherwise or state your considerations as I think the above is really my opinion on how to approach the problem.

Related

How to use KMeans clustering to improve the accuracy of a logistic regression model?

I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?

I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.

How to find feature importance for each class in multiclass classification

I have written code to find the importance of each feature in the entire dataset for multiclass classification. Now I want to find feature importance for each class in multiclass classification, i.e. I want to find the list of features (for each class) that are more important to classify that individual classes.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
model = DecisionTreeClassifier()
model.fit(x3, y3)
importance = model.feature_importances_
for i,v in enumerate(importance):
print('Feature[%0d]:%s, Score: %.6f' % (i,df.columns[i],v))
plt.subplots(figsize=(15,7))
plt.bar([x for x in range(len(importance))], importance)
plt.xlabel('Feature index')
plt.ylabel('Feature importance score')
plt.xticks(rotation=90)
plt.xticks(np.arange(0,len(df.columns)-2, 2.0))
plt.show()
EDIT (28-04-2022):
I read a paper titled Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization; quoting:
On the evaluate section, we fist extract the 80 traffic features from the dataset and clarify the best short feature set to detect each attack family using RandomForestRegressor algorithm. Afterwards, we examine the performance and accuracy of the selected features
with seven common machine learning algorithms.
Can anyone explain how this is done?click for picture from that paper

The decision trees are split into nodes that maximise information gain. Each split is based on the Gini index or entropy values. So the only way I think what you want to do can be achieved is by printing out the tree and examining it yourself visually, provided there are not too many nodes.
You can't say with certainty that one of your features is very important in discriminating against a certain class because suppose you have two classes, A and B. The feature that discriminates class A against class B is also discriminating class B against class A. So the importance of that feature is for both classes. In general, you can only get the overall feature importance not specific to any of your classes but the features that help get the work done.
Trees are highly unstable, and a slight change in your dataset will build an entirely new different tree from the first.
EDIT(28-04-2022):
The paper says they used Random-ForestRegressor, different from the decision tree you used. Random-ForestRegressor meant they had a regression task. The paper used the algorithm as a feature selection technique to reduce the 80 features. The few features selected (based on feature importance) were then used to train seven other different models. Using fewer features instead of the whole 80 will make the resulting models more elegant and less prone to overfitting.
It is important to know that Random forest is an ensemble method and has a lot of random happenings in the background such as bagging and bootstrapping. Feature importance is a form of model interpretation. It is difficult to interpret Ensemble algorithms the way you have described. Such a way would be too detailed. So, definitely, what they wrote in the paper is different from what you think.
Decision trees are a lot more interpretable. If you want to understand causality in your decision tree model, you can click here to see how the model can be converted into rules or as suggested earlier, observe the tree with your naked eyes.

Is it possible to fit() a scikit-learn model in a loop or with an iterator

Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?

This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)

It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)

After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.

How to improve my multiclass text-classification on German text?

I am new in NLP and it is a bit confusing me.
I am trying to do a text classification with SVC on my dataset.
I have an imbalanced dataset of 6 classes.
The text is news for classes of health, sport, culture, economy, science and web.
I am using TF-IDF for vectorization.
the preprocessing steps: lower-case all the texts and to remove the stop-words. since my text is in German I did not use lemmatization
my first try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
X_train = train['text']
y_train = train['category']
X_test = test['text']
y_test = test['category']
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score was: 93%
then I decided to reduce the dimensionality: so on my 2nd try I added TruncatedSVD
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),('svd', TruncatedSVD()),
('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score dropped to 34%.
my questions:
1- How can I improve my model if I want to stick to TF-IDF and SVC for classification
2- What can I do other than that if I want to have a good classification

The best way to improve accuracy, given that you want to stick with this configuration is through hyperparameter tuning, or by introducing additional components, such as feature selection.
Hyperparameter tuning
Most machine learning algorithms and parts of a machine learning pipeline have several parameters you can change. For example, the TfidfVectorizer has different ngram ranges, different analysis levels, different tokenizers, and many more parameters to vary. Most of these will affect your performance. So, what you can do is systematically vary these parameters (and those of your SVC), while monitoring you accuracy on a development set (i.e., not the test data!). Instead of fixed development set, cross-validation is typically used in these kinds of settings.
The best way to do this in sklearn is through a RandomizedSearchCV (see here for details). This class automatically cross-validates and searches through the possible options you pre-specify by randomly sampling from the option set for a fixed number of iterations. By applying this technique on your training data, you will automatically find models that perform better for your given training data and your options. Ideally, these models would also perform better on your test data. Fair warning: cross-validated search techniques can take a while to run.
Feature Selection
In addition to grid search, another way to improve performance is through feature selection. Feature selection typically consists of a statistical test that determines which features explain variance in the typical task you are trying to solve. The feature selection methods in sklearn are detailed here.
By far the most important bit here is that the performance of anything you add to your model should be verified on an independent development set, or in cross-validation. Leave your test data alone.

What exactly does `eli5.show_weights` display for a classification model?

I used eli5 to apply the permutation procedure for feature importance. In the documentation, there is some explanation and a small example but it is not clear.
I am using a sklearn SVC model for a classification problem.
My question is: Are these weights the change (decrease/increase) of the accuracy when the specific feature is shuffled OR is it the SVC weights of these features?
In this medium article, the author states that these values show the reduction in model performance by the reshuffle of that feature. But not sure if that's indeed the case.
Small example:
from sklearn import datasets
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC, SVR
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
clf = SVC(kernel='linear')
perms = PermutationImportance(clf, n_iter=1000, cv=10, scoring='accuracy').fit(X, y)
print(perms.feature_importances_)
print(perms.feature_importances_std_)
[0.38117333 0.16214 ]
[0.1349115 0.11182505]
eli5.show_weights(perms)

I did some deep research.
After going through the source code here is what I believe for the case where cv is used and is not prefit or None. I use a K-Folds scheme for my application. I also use a SVC model thus, score is the accuracy in this case.
By looking at the fit method of thePermutationImportance object, the _cv_scores_importances are computed (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L202). The specified cross-validation scheme is used and the base_scores, feature_importances are returned using the test data (function: _get_score_importances inside _cv_scores_importances).
By looking at get_score_importances function (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L55), we can see that base_score is the score on the non shuffled data and feature_importances (called differently there as: scores_decreases) are defined as non shuffled score - shuffled score (see https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L93)
Finally, the errors (feature_importances_std_) are the SD of the above feature_importances (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L209) and the feature_importances_ is the mean of the above feature_importances (non-shuffled score minus (-) shuffled score).

A fair bit shorter answer to your original question, regardless of the setting for the cv parameter, eli5 will calculate the average decrease in the scorer you provide. Because you're using the sklearn wrapper, the scorer will come from scikit-learn: in your case accuracy. Overall as a word on the package, some of these details are particularly difficult to figure out without going into the deeper into the source code, might be worth trying to submit a pull request to make the documentation more detailed where possible.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.