I'm attempting to interpret my OneClassSVM model, but the computation time is very high. I have used cross-validation with 36 folds, so want to combine the results of all the folds on one SHAP plot so that I can fully interpret what features contribute most to the model.
So far I thought that taking a sample of the data that I want to be interpreted would speed things up (it did reduce the time), but it still will take around 8hours for one fold and there are 36 folds.
Note that my train set is ~2400 and my test set is ~1400, each with 88 features.
import shap
from sklearn.svm import OneClassSVM
import numpy as np
# These are 2d arrays, where each element is a DataFrame of the selected data for train/test for a fold
shap_train = np.load('shap_train.npy', allow_pickle=True)
shap_test = np.load('shap_test.npy', allow_pickle=True)
clf = OneClassSVM(nu=0.35)
folds = len(shap_train)
shap_values = []
shap_data_test = []
for fold in range(folds):
explainer = shap.Explainer(clf.fit_predict, shap_train[fold])
# Sampling 1/3 of the data
data = shap_test[fold].sample(frac=(1/3))
shap_values.append(explainer(data))
shap_data_test.append(data)
# Storing SHAP values for plots later
np.save('shap_data.npy', np.array(shap_values))
np.save('shap_data_test.npy', np.array(shap_data_test))
I've questioned my methodology of needing to produce shap values for all the folds, but I know that some folds perform better than others, so want an overall view of what features are contributing most.
I deployed this script on a Debian server with a Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz and 64GB RAM.
After further reading and advice, I found that there's no real way to speed up this computation. Using a large sample will always take a long time.
Instead as #sergey-bushmanov mentioned, just using a small sample will work well enough. In my case, I used a sample of 200 datapoints per fold and this still took some time but a lot less then previously, whilst also still displaying the contributions of each feature.
Related
I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!
I am training a deep learning model using Mask RCNN from the following git repository: matterport/Mask_RCNN. I rely on a heavy augmentation of my dataset (original dataset: 59 images of 1988x1355x3 with each > 80 annotations), which I store locally (necessary to evaluate type/degree of augmentation vs validation metrics). The augmented dataset counts 6000 images. This dataset varies in x and y dimensions of the image, because of reducing resolution and affine transformations - I assume the different x,y-dimensions will not affect the final tests.
However, my Python kernel crashes whenever I load more than 'X' images to train the model.
Hence, I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same (read: same, taken the stochastic nature of 'stochastic gradient descent' into account)?
I wonder, if the results would be the same, if I don't iterate through the sub-datasets per epoch, but train Y epochs (eg. 20 for 'heads' only, 10 for 'all layers')?
Yet, I am sure this is not the most efficient way of solving this issues. Ideas for improvement are welcome.
Note, I am not using keras.preprocessing.image.ImageDataGenerator(), as I have understood it, it randomly generates data and feeds it to the model by replacing the input for the epoch, whereas I would like to feed the whole dataset to the model.
I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same?
You are doing the same thing as ImageDataGenerator is doing (creating your own mini-batches) but less optimally. The result will be same with respect to what?
If you mean with respect to a model that was trained with all the data in a single batch? - Most probably not. As a lower batch means slower convergence. But this can be solved by training for more epochs.
Another issue is reproducibility. If you want to reproduce your model with same results each time just use seeds.
import random
random.seed(1)
import numpy as np
np.random.seed(1)
import tensorflow
tensorflow.random.set_seed(1)
Another concept is gradient accumulation. It will help you train you with high batch size without keeping too many images in memory at a time.
https://github.com/CyberZHG/keras-gradient-accumulation
Finally, keras.preprocessing.image.ImageDataGenerator() in facts trains on the whole dataset, it just chooses a random sample at each step (you're doing the same thing with your so-called sub-datasets).
You can seed the ImageDataGenerator so it is reproducible and not entirely random.
Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.
I can only make 2-3 predictions per second with this model which is super slow.
When using LinearRegression model I can easily achieve 40x speedup.
I'm using scikit-learn python package with a very simple dataset containing 3 columns (day, hour and result) so basically 2 features.
day and hour are categorical variables.
Naturally there are 7 day and 24 hour categories.
Training sample is relatively small (cca 5000 samples).
It takes just a dew seconds to train it.
But when I go on predicting something it's very slow.
So my question is: is this fundamental characteristic of RandomForrestRegressor or I can actually do something about it?
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100,
max_features='auto',
oob_score=True,
n_jobs=-1,
random_state=42,
min_samples_leaf=2)
Here are some steps to optimize a RandomForest with sklearn
Do batch predictions by passing multiple datapoints to predict(). This reduces Python overhead.
Reduce the depth of trees. Using something like min_samples_leaf or min_samples_split to avoid having lots of small decision nodes. To use 5% percent of training set, use 0.05.
Reduce the number of trees. With somewhat pruned trees, RF can often perform OK with as little as n_estimators=10.
Use an optimized RF inference implementation like emtrees. Last thing to try, also dependent on prior steps to perform well.
The performance of the optimized model must be validated, using cross-validation or similar. Steps 2 and 3 are related, so one can do a grid-search to find the combination that best preserves model performance.
I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.