SKLearn VotingRegressor - why so slow? - python

I'm trying to work with SciKit-Learn's VotingRegressor, but I find the experience quite frustrating due to the apparent overhead this class adds.
All it should be doing according to the documentation is
...fits several base regressors, each on the whole dataset. Then it averages the individual predictions to form a final prediction.
But by doing this, I find it somehow increases the runtime by LOADS. Why?
For example, if I import 6 different regressors and train them individually, it amounts to around 5 minutes of training on my computer. Based on the description, the only additional step the VotingRegressor takes is it averages each predictor's prediction. However, when I pass the same 6 regressors to a VotingRegressor and start training, the training keeps running well above the 20 minute mark.
For getting an average, I wouldn't expect an over 5-fold increase in runtime (I'm currently running a training with over 30 minutes passed and still not stopped). What is the overhead that VotingRegressor is adding? Keep in mind this is happening with a circa 30 000 x 150 sized dataset.

Related

Is there a way we can calculate estimated model training time when applying machine learning with scikitlearn or otherwise?

Many a times specially when the dataset is large or has multiple features, it takes ages (long hours) for sci-kit learn to train the model. Since it is using the computational resources, working on other things during this time on the same machine becomes exceptionally slow, thus, reducing overall productivity.
Is there a way to estimate the time required for training a model? It doesnt have to be actually beforehand but it can be estimated once the training has started.
I have tried scitime but thats a very invasive method. Would prefer a method that is more tightly coupled with sklearn functionality.

TPOT taking too long to train

Ive been trying to use tpot for the first time on a dataset that has approximately 7000 rows, when trying to train tpot on the training dataset which is 25% of the dataset as a whole, tpot takes too long. ive been running the code for approximately 45 minutes on google colab and the optimization progress is still at 4%. Ive just been trying to use the example as seen on :http://epistasislab.github.io/tpot/examples/. Is it typical for tpot to take this long, because so far i dont think its worth even trying to use it
TPOT can take quite a long time depending on the dataset you have. You have to consider what TPOT is doing: TPOT is evaluating thousands of analysis pipelines and fitting thousands of ML models on your dataset in the background, and if you have a large dataset, then all that fitting can take a long time--especially if you're running it on a less powerful computer.
If you'd like faster results, you have a few options:
Use the "TPOT light" configuration, which uses simpler models and will run faster.
Set the n_jobs parameter to -1 or a number greater than 1, which will allow TPOT to evaluate pipelines in parallel. -1 will use all of the available cores and speed things up significantly if you have a multicore machine.
Subsample the data using the subsample parameter. The default is 1.0, corresponding to using 100% of your training data. You can subsample to lower percentages of the data and TPOT will run faster.

Keras model.predict() taking unreasonable amount of time

I am working on a project where we are using a compiled keras ANN-model to classify different positions based on sensor data received. These data are continuously fed to the model for it to predict via a daemon-thread collecting data in the background. We are having a problem where model.predict() takes up to 2 seconds to finish, even when entering small data-sets. The data-points are arrays containing 38 floats each. The prediction time seems unaffected by the amount of rows supplied, up to a certain amount. We have tried supplying it with only one row, and up to hundreds. The elapsed time stays around 2 seconds. Isn't this time consumption abnormally high, even for the larger data sets?
If it helps:
Our program is using multi-threading to be able to collect the data from the sensors and restructure them so that they fit the predict method of the model. Two daemon threads are running in the background collecting and restructuring data, while the main thread is actively picking data from a queue of already structured data and classifying based on these. Here is the code where we classify based on the data collected:
values = []
rows = 0
while rows < 20:
val = pred_queue.shift()
if val != None:
values.append(val)
rows += 1
rows = 0
values = np.squeeze(values)
start_time = time.perf_counter()
predictions = model.predict(values)
elapsed_time = round(time.perf_counter() - start_time, 2)
print("Predict time: ", elapsed_time)
for i in range(len(predictions)):
print(predictions[i].argmax())
#print(f"Predicted {classification_res} in {elapsed_time}s!")
Some clarification of the code:
The shift() method returns the first entry in the pred_queue(). This will either be an array of 38 floats or None, depending on the queue being empty or not.
What could possibly make these predictions so slow?
Edit
The reason for the confusion around the prediction times is that we have run the same model on some data before compiling it. These data-points were collected from a csv file and put into a pandas dataframe and finally passed to the predict method. These data were not streamed live, but the dataset was much bigger, around 9000 rows each containing 38 floats. This prediction took 0.3 seconds when we timed it. Obviously much faster than our current speeds!
You can try to use the __call__ method directly, as the documentation of the predict method states (emphasis is mine):
Computation is done in batches. This method is designed for performance in large scale inputs. For small amount of inputs that fit in one batch, directly using __call__ is recommended for faster execution, e.g., model(x), or model(x, training=False) if you have layers such as tf.keras.layers.BatchNormalization that behaves differently during inference. Also, note the fact that test loss is not affected by regularization layers like noise and dropout.
Note that this performance hit that you are noticing could be related to the fact that the resources of the machine are limited. Investigate CPU usage, RAM usage, etc.

How to perform 10 random splits to ensure the consistency of the machine learning result

I just read a paper about image popularity prediction. The author split the data into two halves, one for training and the other testing. 5-fold cross-validation was used on the training set to find the optimal parameters. And the outcome of the experiment is the rank correlation between the predicted popularity and the actual popularity.
To ensure the consistency of the result, the author averaged the performance over 10 random splits. I am confused about the 10 random splits.
Does it mean when he got the optimal parameters and the model, the model was used on the testing set, and the testing set was split for 10 times and 10 parts, 9 for training and 1 for testing?
Did the model train again during the process?
Reading up on What is cross-validation should help some.
I cannot speak to what the author did without looking at the paper, but the idea of cross-validation is to not do the splits on the test dataset, but actually train and discard models after doing K splits on the entire dataset, Using each subset once for testing.
Assuming you're okay on that part, the way you worded it makes it sound to me that after choosing the optimal parameters, the person proceeded to start with the 50-50 split from step 1 once again, and this time without changing the parameters, in effect getting a new train and test set. He did this 10 times in total.
IF that is the case, it essentially means that he trained the same model architecture 9 more times after performing the 50-50 split at random on the entire set 9 more times, and averaged out his "performance" metric across the 10 scores.
EDIT:
Paper Reference
3.2 Evaluation For each of the settings described above, we split the data randomly into two halves, one for training and the
other testing. We average the performance over 10 random splits to
ensure the consistency of our results; overall, we find that our
results are highly consistent with low standard devia- tions
across splits.
Alright, so yes, the author indeed "repeated" his work 10 times, each time creating a random 50-50 split to start.
So, the essence of it is, the 5 fold cross validation happened during training on the training set. The remaining 50% was used as evaluation. Think of it not like a typical test set, but rather like a "hold-out" set. Since cross validation meant the 50% of data was used both to train and test during the entire process.
At that point, you scrap all your work except the hyperparamaters and the result on the corresponding "hold-out" 50% of data. Now you start with your entire dataset again, and do a different but random 50% split. This time, with the same hyperparameters, you train again on your new training set, and test on your new "hold-out" set for this result. And repeat.
Without reading the paper (please provide it), it seems like he partitioned the data into 2 parts randomly, trained on one part, validated on the other part and recorded the performance results. He did this 10 times and probably averaged his performance results afterwards.

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

Categories

Resources