I'm trying to do cross-validation with scikit-learn, and I'm running into some memory issues that are hard to figure out.
Basically, I've found that when I increase the number of hyperparameters searched, or when I increase the number of cross-validation loops for a GridSearchCV object, I get a nearly linear increase in memory consumption. This makes for dangerously high memory consumption if I use large enough matrices.
Here's a little gist of what I'm talking about:
http://nbviewer.ipython.org/gist/choldgraf/6a7be7866f2a3a3d3f98
Does anyone know why this might be happening? It seems that GridSearchCV is basically just looping through the cv object and the model parameter options in a list comprehension-style. It doesn't seem like that should increase memory usage...
UPDATE: After looking into this a bit more, it turns out that the problem isn't with GridSearchCV, but rather with a few of the solvers in Ridge (I have updated the gist accordingly). It may be a problem with the scipy linear algebra libraries, see issue here
Related
Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!
Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
The Number of Attributes (features) in Dataset.
The Number of Trees (n_estimators).
The Maximum Depth of the Tree (max_depth).
The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).
Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!
So I'm running a SVM classifier (with a linear kernel and probability false) from sklearn on a dataframe with about 120 features and 10,000 observations. The program takes hours to run and keeps crashing due to exceeding computational limits. Just wondering if this dataframe is perhaps too large?
In short no, this is not too big at all. Linear svm can scale much further. The libSVC library on the other hand cannot. The good thing, even in scikit-learn you do have large scale svm implementation - LinearSVC which is based on liblinear. You can also solve it using SGD (also available in scikitlearn) which will converge for much bigger datasets as well.
You could try changing the parameters for the algorithm.
Tips on practical use from the documentation.
You could try a different algorithm, here's a cheat sheet you might find helpful:
The implementation is based on libsvm. The fit time complexity is more
than quadratic with the number of samples which makes it hard to scale
to dataset with more than a couple of 10000 samples.
The offical data about sklearn svm told the theshold is 10,000 samples
so SGD could be a better try.
I have a billion feature vectors and I would like to put them into approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering for example it is not at all clear to me how their running time scales with the data size (except for Affinity Propagation which is clearly too slow).
What methods are suitable for clustering such a large data set? I assume any method will have to run in O(n) time.
The K-means complexity sounds reasonable for your data (only 4 components). The tricky part is the initialization and the choice of number of clusters. You can try different random initialization but this can be time consuming. An alternative is to sub-sample your data and run a more expensive clustering algorithm like Affinity Propagation. Then use the solution as init for k-means and run it with all your data.
For a billion feature vectors I'd be dubious of using K-means on its own. I'm sure you could do it, but it would take a long time and would thusly be difficult to debug. I recommend using Canopy Clustering first then applying K-means to reduce the complexity and computations. These sub-clusters could then be reduced further with a Map Reduce implementation to solve even faster.
I'm trying to fit a regression model with an L1 penalty, but I'm having trouble finding an implementation in python that fits in a reasonable amount of time. The data I've got is on the order of 100k by 500 (sidenote; several of the variables are pretty correlated), but running the sklearn Lasso implementation on this takes upwards of 12 hours to fit a single model (I'm not actually sure of the exact time, I've left it running overnight several times and it never finished).
I've been looking into Stochastic Gradient Descent as a way to get the job done faster. However, the SGDRegressor implementation in sklearn takes on the order of 8 hours to fit when I'm using 1e5 iterations. This seems like a relatively small amount (and the docs even suggest that the model often takes around 1e6 iters to converge).
I'm wondering if there's something that I'm being stupid about which is causing the fits to take a really long time. I've been told that SGD is often used for its efficiency (something around O(n_iter * n_samp * n_feat), though so far I haven't seen much improvement over Lasso.
To speed things up, I have tried:
Decreasing n_iter, but this often leads to a pretty bad solution because it hasn't converged yet.
Increasing the step size (and decreasing n_iter), but this often makes the loss function explode
Changing the learning rate types (from inverse scaling to an amount based off of the number of iterations), and this also didn't seem to make a huge difference.
Any suggestions for speeding this process up? It seems like partial_fit might be part of the answer, though the docs on this are somewhat sparse. I'd love to be able to fit these models without waiting for three days apiece.
Partial_fit is not the answer. It will not speed anything up. If anything, it would make it slower.
The implementation is pretty efficient, and I am surprised that you say convergence is slow. You do way to many iterations, I think. Have you looked at how the objective decreases?
Often tuning the initial learning rate can give speedups. Your dataset really shouldn't be a problem. I'm not sure if SGDRegressor does that internally, but rescaling your target to unit variance might help.
You could try vopal wabbit, which is an even faster implementation, but it shouldn't be necessary.
Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.
Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.
Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.