How many features can scikit-learn handle? - python

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?

Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).
I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See #amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.
Full disclosure: I'm a scikit-learn core developer.

Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.
Although before you go running any models you could try the following
1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.
2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.
Both approaches could allow you to greatly reduce the training data required.

Related

K-Means applied on a dataframe with too many features and zeros

I´m trying to cluster a dataframe with 36 features and a lot (88%) of zeros. It's my first job at ML, I started with K-Means, but for any K I choose, 99.5% of my data remains on cluster 0. I've tried some PCA do reduce features, but the same problem appeared.
Any thoughts on approaches I can try?
Have you tried techniques such as sequential feature selection? These are so-called 'wrapper methods' where you add (for forward selection) or eliminate (for backward elimination) one feature at a time and assess the performance of the model accordingly. I tend to use supervised learning models in my job but I believe you can use sequential selection algorithms to assess unsupervised models as well. I have used the sklearn library for this: https://scikit-learn.org/stable/modules/feature_selection.html

Unexpected behavior with LinearDiscriminantAnalysis of scikit-learn

I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!

Too Much Data for SVM?

So I'm running a SVM classifier (with a linear kernel and probability false) from sklearn on a dataframe with about 120 features and 10,000 observations. The program takes hours to run and keeps crashing due to exceeding computational limits. Just wondering if this dataframe is perhaps too large?
In short no, this is not too big at all. Linear svm can scale much further. The libSVC library on the other hand cannot. The good thing, even in scikit-learn you do have large scale svm implementation - LinearSVC which is based on liblinear. You can also solve it using SGD (also available in scikitlearn) which will converge for much bigger datasets as well.
You could try changing the parameters for the algorithm.
Tips on practical use from the documentation.
You could try a different algorithm, here's a cheat sheet you might find helpful:
The implementation is based on libsvm. The fit time complexity is more
than quadratic with the number of samples which makes it hard to scale
to dataset with more than a couple of 10000 samples.
The offical data about sklearn svm told the theshold is 10,000 samples
so SGD could be a better try.

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.
Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.
Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

Improving SVC prediction performance on single samples

I have large-ish SVC models (~50Mb cPickles) for text classification and I am trying out various ways to use them in a production environment. Classifying batches of documents works very well (about 1k documents per minute using both predict and predict_proba).
However, prediction on a single document is another story, as explained in a comment to this question:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – larsmans
I am already serving the SVC models as Flask web-applications, so a part of the overhead is gone (unpickling) but the prediction times for single docs are still on the high side (0.25s).
I have looked at the code in the predict methods but cannot figure out if there is a way to "pre-warm" them, reconstructing the LibSVM data structure in advance at server startup... any ideas?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))
I can see three possible solutions.
Custom server
It is not the matter of "warming" anything up. Simply - libSVM is the C library, and you need to pack/unpack data into correct format. This process is more efficient on the whole matrices than on each row separately. The only way to overcome this would be to write more efficient wrapper between your production env and the libSVM (you could write a libsvm based server, which would use some kind of shared memory with your service). Unfortunately, this is to custom problem to be solvable by existing implementations.
Batches
Naive approach like buffering the queries is an option (if it is "high performance" system with thousands of queries, you can simply store them in N-element batches, and send them to libSVM in such packs).
Own classification
Lastly - classification using SVM is really simple task. You don't need libSVM to perform classification. Only training is a complex problem. Once you get all the support vectors (SV_i), kernel (K), lagragian multipliers (alpha_i) and intercept term (b), you classify using:
cl(x) = sgn( SUM_i y_i alpha_i K(SV_i, x) + b)
You can code this operation directly in your app, without the need to actualy pack/unpack/send anything to libsvm. This can speed things up by the order of magnitude. Obviously - probability is more complex to retrieve, as it requires the Platt's scaliing, but it is still possible.
You can't construct the LibSVM data structure in advance. When a request to classify a document arrives, you get the text of the document, make a vector out of if and only then convert to LibSVM format so you can get a decision.
LinearSVC should be considerably faster than a SVC with a linear kernel as it uses liblinear. You could try using a different classifier if that does not decrease performance too much.

Categories

Resources