Keras model.predict() taking unreasonable amount of time

Keras model.predict() taking unreasonable amount of time - python

I am working on a project where we are using a compiled keras ANN-model to classify different positions based on sensor data received. These data are continuously fed to the model for it to predict via a daemon-thread collecting data in the background. We are having a problem where model.predict() takes up to 2 seconds to finish, even when entering small data-sets. The data-points are arrays containing 38 floats each. The prediction time seems unaffected by the amount of rows supplied, up to a certain amount. We have tried supplying it with only one row, and up to hundreds. The elapsed time stays around 2 seconds. Isn't this time consumption abnormally high, even for the larger data sets?
If it helps:
Our program is using multi-threading to be able to collect the data from the sensors and restructure them so that they fit the predict method of the model. Two daemon threads are running in the background collecting and restructuring data, while the main thread is actively picking data from a queue of already structured data and classifying based on these. Here is the code where we classify based on the data collected:
values = []
rows = 0
while rows < 20:
val = pred_queue.shift()
if val != None:
values.append(val)
rows += 1
rows = 0
values = np.squeeze(values)
start_time = time.perf_counter()
predictions = model.predict(values)
elapsed_time = round(time.perf_counter() - start_time, 2)
print("Predict time: ", elapsed_time)
for i in range(len(predictions)):
print(predictions[i].argmax())
#print(f"Predicted {classification_res} in {elapsed_time}s!")
Some clarification of the code:
The shift() method returns the first entry in the pred_queue(). This will either be an array of 38 floats or None, depending on the queue being empty or not.
What could possibly make these predictions so slow?
Edit
The reason for the confusion around the prediction times is that we have run the same model on some data before compiling it. These data-points were collected from a csv file and put into a pandas dataframe and finally passed to the predict method. These data were not streamed live, but the dataset was much bigger, around 9000 rows each containing 38 floats. This prediction took 0.3 seconds when we timed it. Obviously much faster than our current speeds!

You can try to use the __call__ method directly, as the documentation of the predict method states (emphasis is mine):
Computation is done in batches. This method is designed for performance in large scale inputs. For small amount of inputs that fit in one batch, directly using __call__ is recommended for faster execution, e.g., model(x), or model(x, training=False) if you have layers such as tf.keras.layers.BatchNormalization that behaves differently during inference. Also, note the fact that test loss is not affected by regularization layers like noise and dropout.
Note that this performance hit that you are noticing could be related to the fact that the resources of the machine are limited. Investigate CPU usage, RAM usage, etc.

Related

How to arrange multiple multivariate time series of different length before passing it to Keras LSTM layer

I have a number of multivariate time series that are produced by the same kind of process but:
are of significantly different lengths;
each time series is an independent instance, and the measurements are taken at different, quite random timestamps;
each time series is related at every timestamp to two targets.
In other words:
each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
To give an example, this could be treated as stocks of different companies, that are described by few various features and the target at a given timestamp are probabilities that the final price at the end of the year will be higher than x, except we learn them directly from magically given ground-truth probabilities (instead of observed 0/1 responses).
I want to be able to predict the target at each time point and I wanted to give RNNs a try. However, I'm having issues with figuring out how I should arrange the data before passing it to Keras LSTM layers. The main things I'm wondering about are:
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
My current thinking is that:
I can pad each of my example sequences to the same length (max(n_timestamps))
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape. Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp (i.e. while the fist batch has timestamps from 1 to 50, the next batch has timestamps from 2 to 51 etc.)?
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Is this the correct way to prepare the data for my problem? Can it be done better to not introduce bottlenecks such as learning using examples of only padding value (that should be ignored with masking layer). Or how can I prepare that data to address points 1., 2. and 3. described above?

each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
Okay, this is pretty standard so far.
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Check and make sure you actually need this. An RNN (or a Transformer) could use any of/all of the history that you give it. But that's assuming that the history is useful for the predictions you're making.
I'd try training on standard-sized random-clips of the data (like in this tutorial). I'd retrain it a few times with longer and longer clips and see if the model performance plateaus before I run out of memory.
But in Keras it is relatively simple to do exactly the thing you're asking.
Keras RNNs (LSTM, GRU) have these this argument return_states. It allows you to allows you to run the model over part of a sequence, pause, execute a training step, and then continue running exactly where you left off.
(and stateful argument is another mechanism to provide that effect)
The code ends up looking something like this:
class MyModel(keras.Model):
...
def train_step(self, args):
inputs, labels = args
state = self.get_initial_state()
while tf.shape(inputs)[1] != 0:
in_slice, inputs = inputs[:,:100], inputs[:,100:]
label_slice, labels = labels[:, :100], labels[:,100:]
with tf.GradientTape() as tape:
result, state = self(in_slice, state)
loss = self.loss(label_slice, result)
vars = self.trainable_variables
grads = tape.gradient(loss, vars)
self.optimizer.apply_gradients(zip(grads, vars))
It may also be possible to use ForwardAccumulator to collect the gradients. In that case you don't need to cut the sequences into chunks because the memory used by forward accumulator doesn't grow with sequence length. I've never tried before so I don't have example code.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
That might be okay, just inefficient. You can make batches of similar sequence lengths using: Dataset.bucket_by_sequence_length
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
Interpolating to a fixed rate might be a resonable thing to try if it doesn't make your data too much longer. Just think carefully about making predictions on interpolated values: There's some data leaking back in time from a future measurement.
Another approach would be to make the size of the time-step a feature. If each input is tagged with how long it's been since the last input the model can learn how to handle small or large steps.
I can pad each of my example sequences to the same length (max(n_timestamps))
Yes. Pad, or make clips of a fixed size.
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape.
That would line up with the code example I gave.
Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp
Either way is fine. But if you want to carry the state over from batch to batch (I'm skeptical that you actually need the carry over) then you need to do them chunk by chunk, not by single-stepping your window.
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Yeah, that'll be wasted computation, but it won't hurt anything.

XGBoost scaling weights for time series data

I'm working on a binary classification problem using time series data and I've been having some trouble adjusting the scale_pos_weight parameter.
As it's time series data most of my features are of the sort of last 30 days mean, number of days since X event, accumulated days of X event happening, etc. so in order to avoid data leakeage I'm splitting the data first 80% for training and the last 20% for test.
Works fine for most of the cases but there's a few that the target's distribution changes a lot from the training data to the test, meaning that the training data has 100:1 negative to positive instances meanwhile the test data is around 30:1.
I've tried switching the training size to different values to get similar target distributions, but I end up getting odd training sizes like 50% or 95%.
I also considered using the test data distribution to adjust the weights but it would be data leakeage.
Any ideas of how could I sort this out?

SKLearn VotingRegressor - why so slow?

I'm trying to work with SciKit-Learn's VotingRegressor, but I find the experience quite frustrating due to the apparent overhead this class adds.
All it should be doing according to the documentation is
...fits several base regressors, each on the whole dataset. Then it averages the individual predictions to form a final prediction.
But by doing this, I find it somehow increases the runtime by LOADS. Why?
For example, if I import 6 different regressors and train them individually, it amounts to around 5 minutes of training on my computer. Based on the description, the only additional step the VotingRegressor takes is it averages each predictor's prediction. However, when I pass the same 6 regressors to a VotingRegressor and start training, the training keeps running well above the 20 minute mark.
For getting an average, I wouldn't expect an over 5-fold increase in runtime (I'm currently running a training with over 30 minutes passed and still not stopped). What is the overhead that VotingRegressor is adding? Keep in mind this is happening with a circa 30 000 x 150 sized dataset.

Calculating processing time of a deep learning model

My model deals with videos, and I want to calculate how fast it can process frames as in frames per second or processing time for 1 frame.
I have made a single function to get predictions, it takes in raw frames as input, does all the preprocessing, and returns the classification. One of the preprocessing steps is sampling the frames from the video, basically, it reduces the number of frames which go into the deep learning model by 1/5. Without all the preprocessing, the model won't perform as expected.
So my question is, should I consider the preprocessing time aswell? And, most importantly, is this processing time for all frames or just for the frames the model actually sees?
Sample code structure as below:
start = time.time()
prediction = main(data)
end = time.time
print("Time for 1 frame=",(end-start)/n_frames) # lets say n_frames = 50
Inside the main function:
preprocessed = preprocess(data) # resizing, sampling down from 50 to 10 frames
prediction = model.predict(preprocessed)
return prediction
Example: Input is of 50 frames, and total time taken to preprocess them and make predictions is 1 second. (Note that the model only sees 10 preprocessed frames)
So, the processing time for 1 frame is 1/50seconds. OR Should it be 1/10seconds, as model only gets to process 10 frames, others simply get skipped in preprocessing. And where should I put the start time and end time frame?
Which way is the standard way or the right way?

There is no standard way, it depends on what exactly you're going to use the results for.
If you're trying to ONLY demonstrate the time used up by the deep learning model, don't include the preprocessing steps.
If you want to time the end to end process, include the entire pipeline.
An even better solution would be to profile you're code. This will give you a breakdown of how long each part of your code takes so you don't have to pick on or the other.
In your case, since you want to put the time in terms of the number of frames, I don't care about how many frames your model sees or any of the in workings of your pipeline. As a user, all I care about is if I put X number of frames in, how long will it take. So go with 50 frames / second.

In Keras, If samples_per_epoch is less than the 'end' of the generator when it (loops back on itself) will this negatively affect result?

I'm using Keras with Theano to train a basic logistic regression model.
Say I've got a training set of 1 million entries, it's too large for my system to use the standard model.fit() without blowing away memory.
I decide to use a python generator function and fit my model using model.fit_generator().
My generator function returns batch sized chunks of the 1M training examples (they come from a DB table, so I only pull enough records at a time to satisfy each batch request, keeping memory usage in check).
It's an endlessly looping generator, once it reaches the end of the 1 million, it loops and continues over the set
There is a mandatory argument in fit_generator() to specify samples_per_epoch. The documentation indicates
samples_per_epoch: integer, number of samples to process before going to the next epoch.
I'm assuming the fit_generator() doesn't reset the generator each time an epoch runs, hence the need for a infinitely running generator.
I typically set the samples_per_epoch to be the size of the training set the generator is looping over.
However, if samples_per_epoch this is smaller than the size of the training set the generator is working from and the nb_epoch > 1:
Will you get odd/adverse/unexpected training resulting as it seems the epochs will have differing sets training examples to fit to?
If so, do you 'fastforward' you generator somehow?

I'm dealing some something similar right now. I want to make my epochs shorter so I can record more information about the loss or adjust my learning rate more often.
Without diving into the code, I think the fact that .fit_generator works with the randomly augmented/shuffled data produced by the keras builtin ImageDataGenerator supports your suspicion that it doesn't reset the generator per epoch. So I believe you should be fine, as long as the model is exposed to your whole training set it shouldn't matter if some of it is trained in a separate epoch.
If you're still worried you could try writing a generator that randomly samples your training set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.