I'm using Keras for a sliding window object detection system. This naturally requires the ability to do many, many classifications quickly. Unfortunately, Keras's model.predict() function has a significant overhead and takes longer to load? preprocess the data? who knows, than it does to do the actual network processing. I know because I've tried removing layers, etc. and it makes almost no difference to the time spent in a model.predict() call.
So basically what I'm looking for is a way to use one network and run predictions on several inputs at once. Not necessarily in separate threads, but without returning to my code. Is anyone aware of such a technique?
Related
I am new to Pytorch and I am now following the tutorial on transforms. I see that the transformations are configured into the dataset object. I am wondering, however, why aren't they configured within the neural network itself. My naive point of view is that the transformations should be in any case the most external layers of the network, in the same way as the eye comes before the brain to transform light into signals for the brain, and you don't modify the world instead to adapt it to the brain.
So, is there any technical reason for putting the transformations in the dataset instead of the net? Is it a good/bad practice to put the transformations within my neural network instead? Why?
These are some of the reason that can explain why one would do this.
We would like to use the same NN code for training as well as testing / inference. Typically during inference, we don't want to do any transformation and hence one might want to keep it out of the network. However, you may argue that one can just simply use model.training flag to skip the transformation.
Most of the transformations happen on CPU. Doing transformations in dataset allows to easily use multi-processing and prefetching. The dataset code can prefetch the data, transform, and keep it ready to be fed into the NN in a separate thread. If instead, we do it inside the forward function, GPUs will idle during the transformations (as these happen on CPU), likely leading to a longer training time.
I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?
The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/
I have a dataset with 11k instances containing 0s,1s and -1s. I heard that deep learning can be applied to feature values.Hence applied the same for my dataset but surprisingly it resulted in less accuracy (<50%) compared to traditional machine learning algos (RF,SVM,ELM). Is it appropriate to apply deep learning algos to feature values for classification task? Any suggestion is greatly appreciated.
First of all, Deep Learning isn't a mythical hammer you can throw at every problem and expect better results. It requires careful analysis of your problem, choosing the right method, crafting your network, properly setting up your training, and only then, with a lot of luck will you see significantly better results than classical methods.
From what you describe (and without any more details about your implementation), it seems to me that there could have been several things going wrong:
Your task is simply not designed for a neural network. Some tasks are still better solved with classical methods, since they manually account for patterns in your data, or distill your advanced reasoning/knowledge into a prediction. You might not be directly aware of it, but sometimes neural networks are just overkill.
You don't describe how your 11000 instances are distributed with respect to the target classes, how big the input is, what kind of preprocessing you are performing for either method, etc, etc. Maybe your data is simply processed wrong, your training is diverging due to unfortunate parameter setups, or plenty of other things.
To expect a reasonable answer, you would have to share at least a bit of code regarding the implementation of your task, and parameters you are using for training.
I'm trying to train a very simple neural network to classify samples of data where some classes necessarily succeed others - this is why I decided to let the input data enter the network in batches. Using Tensorflow, apparently you get multiple ways of declaring batches, like tf.data.Dataset.batch (with which I currently train using the Adam Optimizer) and tf.train.batch. Where is the difference? Should the methods be used together or are they exclusive? In the latter case: which one should I prefer?
tf.train.* is an older API, more complex and prone to errors than the tf.data.* one (you need to take care yourself of queues, thread runners, coordinator, etc). For your stated purpose (batching data and feeding it to a model), the two are functionally equivalent, as in both achieve your goal. However, you should consider using tf.data as that's both simpler to use and the currently recommended way to handle input datasets.
I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?
There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !
Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.