Adanet running out of memory

Adanet running out of memory - python

I tried training an AutoEnsembleEstimator with two DNNEstimators (with hidden units of 1000,500, 100) on a dataset with around 1850 features (after feature engineering), and I kept running out of memory (even on larger 400G+ high-mem gcp vms).
I'm using the above for binary classification. Initially I had trained various models and combined them by training a traditional ensemble classifier over the trained models. I was hoping that Adanet would simplify the generated model graph that would make the inference easier, rather than having separate graphs/pickles for various scalers/scikit models/keras models.

Three hypotheses:
You might have too many DNNs in your ensemble, which can happen if max_iteration_steps is too small and max_iterations is not set (both of those are constructor arguments to AutoEnsembleEstimator). If you want to train each DNN for N steps, and you want an ensemble with 2 DNNs, you should set max_iteration_steps=N, set max_iterations=2, and train the AutoEnsembleEstimator for 2N steps.
You might have been on adanet-0.6.0-dev, which had a memory leak. To fix this, try updating to the latest release and seeing if this problem still arises.
Your batch size might have been too large. Try lowering your batch size.

Related

Training design / sequential loading of images for Mask-RCNN

I am training a deep learning model using Mask RCNN from the following git repository: matterport/Mask_RCNN. I rely on a heavy augmentation of my dataset (original dataset: 59 images of 1988x1355x3 with each > 80 annotations), which I store locally (necessary to evaluate type/degree of augmentation vs validation metrics). The augmented dataset counts 6000 images. This dataset varies in x and y dimensions of the image, because of reducing resolution and affine transformations - I assume the different x,y-dimensions will not affect the final tests.
However, my Python kernel crashes whenever I load more than 'X' images to train the model.
Hence, I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same (read: same, taken the stochastic nature of 'stochastic gradient descent' into account)?
I wonder, if the results would be the same, if I don't iterate through the sub-datasets per epoch, but train Y epochs (eg. 20 for 'heads' only, 10 for 'all layers')?
Yet, I am sure this is not the most efficient way of solving this issues. Ideas for improvement are welcome.
Note, I am not using keras.preprocessing.image.ImageDataGenerator(), as I have understood it, it randomly generates data and feeds it to the model by replacing the input for the epoch, whereas I would like to feed the whole dataset to the model.

I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same?
You are doing the same thing as ImageDataGenerator is doing (creating your own mini-batches) but less optimally. The result will be same with respect to what?
If you mean with respect to a model that was trained with all the data in a single batch? - Most probably not. As a lower batch means slower convergence. But this can be solved by training for more epochs.
Another issue is reproducibility. If you want to reproduce your model with same results each time just use seeds.
import random
random.seed(1)
import numpy as np
np.random.seed(1)
import tensorflow
tensorflow.random.set_seed(1)
Another concept is gradient accumulation. It will help you train you with high batch size without keeping too many images in memory at a time.
https://github.com/CyberZHG/keras-gradient-accumulation
Finally, keras.preprocessing.image.ImageDataGenerator() in facts trains on the whole dataset, it just chooses a random sample at each step (you're doing the same thing with your so-called sub-datasets).
You can seed the ImageDataGenerator so it is reproducible and not entirely random.

Can we create an ensemble of deep learning models without increasing the classification time?

I want to improve my ResNet model by creating an ensemble of X number of this model, taking the X best one I have trained. For what I've seen, a technique like bagging will take X time longer to classify an image, which is really not an option in my case.
Is there a way to create an ensemble without increasing the required classifying time? Note that I don't care about increasing the training time, because it only needs to be done one time, compared to the classification which could be made a very large number of time.

There is no magic pill for doing what you want. Extra computation cannot come free.
So one way this can be achieved is by using multiple worker machines to run inference in parallel.
Each model could run on a different machine using tensorflow serving.
For every new inference do the following:
Have a primary machine which takes up the job of running the inference
This primary machine, submits requests to different workers (all of which can run in parallel)
The primary machine collects results from each individual worker, and creates the final output by combining them based upon your ensemble logic.

Depends on the ensembling method; it's an active area of research I suggest you look into, but I'll provide some examples below:
Dropout: trains parts of the model at any given iterations, thus effectively training a multi-NN ensemble
Weights averaging: train X models on X different splits of data to learn different features, average the early-stopped weights (requires advanted treatment)
Lookahead optimizer: automates the above by performing the averaging during training
Parallel weak learners: run X models, but each model taking 1/X the time to process - which can be achieved by e.g. inserting a strides=X convolutional layer at input; best starting bet is at X=2, so you'll average two models' predictions at output, each prediction made in parallel (which can run faster than original single model)
If you have a multi-core CPU, however, multi-model ensembling shouldn't pose much of a problem, as per last bullet, you can run inference concurrently, so inference time shouldn't increase much
More on parallelism: if a single model is large enough, CPU parallelism will no longer help - you'll also need to ensure multiple models can fit in memory (RAM). The alternative then is again a form of downsampling to cut computation

How should I approach a 300 classes classification machine learning problem?

I am trying to make a Multi-Class classification application, but my dataset has 300 classes, is it possible to train my model with all these classes with a normal PC?

Sure it is. You can even train imagenet with 1000 categories or more, if you have enough time! ;)
You just have to think about which loss function you want (categorical crossentropy, sparse categorical crossentropy or even binary if you want to penalize each output node independently), apart from that there's not really much difference between 10, 100 or a 1000 classes.
Of course you have to increase your model size to account for more classes, so RAM may be an issue, but then you can always decrease batch size. If you are using images and convnets and your model is still too large, try to downsample the images, use pooling layers or larger strides.
If your computer is too old and slow, you can also try Google Colab which offers free GPU and even TPU online!

It is difficult to answer this question. The training time of your model depends on a number of factors. It might be best to train your model for a certain amount of hours and evaluate the performance. You could make use of fitting a learning curve, which could provide an esitmation of how many data points your require for training to achieve a certain performance. After that you could link the required amount of data points to computation time.
Here is an article provides an algorithm for fitting a learning curve: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/.

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?

There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !

Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.