Using Pybrain to detect malicious PDF files

Using Pybrain to detect malicious PDF files - python

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?

There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !

Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Related

Different results infer_vector() of Doc2Vec after saving to disk and load

I am using the Doc2Vec model from gensim (4.1.2) python library.
I trained model on my corpus of documents and used infer_vector(). Than I saved model and try to use infer_vector on same text, but I get totally different vector. What is wrong?
Here is example of code:
doc2vec_model.infer_vector(["system", "response"])
array([-1.02667394e-03, -2.73817539e-04, -2.08510624e-04, 1.01583987e-03,
-4.99124289e-04, 4.82861622e-04, -9.00296785e-04, 9.18195175e-04,
....
doc2vec_model.save('model/doc2vec')
If I load saved model
fname = "model/model_doc2vec"
model = Doc2Vec.load(fname)
model.infer_vector(["system", "response"])
array([-1.07945153e-03, 2.80674692e-04, 4.65555902e-04, 6.55420765e-04,
7.65898672e-04, -9.16261168e-04, 9.15124183e-05, -5.18970715e-04,
....

First, there's a natural amount of variance from one run of infer_vector() to another, that's inherent to how the algorithm works. The vector will be at least a little different every time you run it, even without the save/load between. For more details, see:
Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Second, a 2-word text is a minimal corner-case on which Doc2Vec is less likely to work very well. It's better on texts that are at least dozens of words long. In particular, both the training & inference are processes that work in proportion to the number of words in a text. So a 100-word text, that goes through inference to find a new vector, will get 50x more 'adjustment nudges' than a mere 2-word text - and thus tend to be somewhat more stable, run-to-run, than a tiny text. (As mentioned in the FAQ item linked above, increasing the epochs may help a bit, making a small text a little more like a longer text – but I would still expect any small text to be more at the mercy of vagaries of the random initialization, and random smpling during incremental adjustment, than a longer text.)
Finally, often other problems in the model – like insufficient training data, overfitting (expecially when the model is too large for the amount of training data), or other suboptimal parameters or errors during training can make a model that's especially inconsistent from inference to inference.
The vectors from repeated inferences will never be identical, but they should be fairly close, when parameters are good & training is sufficient. (In fact, one indirect way to test if a model is doing anything useful is to check, at then end of training, how often a re-inferred vector for training texts is the top, or one of the few top, neighbors of the same text's vector from bulk training.)
One possible errors could be too few epochs – the default of 5 inherited from Word2Vec is often too few, with 10 or 20 often being better. (Or, if you're struggling with minimal amounts of data, even more epochs can help eke out some results – though really, this algorithm needs lots of training data. Published results typically use at least tens-of-thousands, if not millions, of separate training docs, each at least dozens, but ideally hundreds or in some cases thousands of words long. With less data (and possibly too many vector_size dimensions for tiny training data), models will be 'looser' or more arbitrary when modeling new data.
Another very common error is to follow some of the bad tutorials online which include calling .train() many times in your own training loop, (mis-)managing the training alpha manually. This is almost never a good idea. See this other answer for more details on this common error:
My Doc2Vec code, after many loops/epochs of training, isn't giving good results. What might be wrong?

Methods for increasing accuracying of a CNN for image classification

I'm currently working on a image classification task, involving a large datasets of grayscale images of cartoons and my CNN needs to classify them. Atm my model has a test accuracy of about 88% but I know a higher accuracy is possible.
I've tried:
improving / changing the actual model / architecture
using different meta parameters
different loss functions from the pytorch libraries
a bunch of different transforms
different optimizes from torch.optim
I've also tried a bunch of the standard models included in torchvision.models and am still getting sub 90% accuracy on the test set.
Do I just need to keep trying the above things to squeeze out better accuracy or are there any other avenues I can try? Would really appreciate any suggestions, the only other thing I can think of would be making my own custom loss function specific for the data set but I'm not exactly sure how much that would help?

From what you've described, it sounds like it might be worth spending some time on the data preparation. Here is a good article on how to do that for images. Some ideas you could try are:
Resizing all your images to a fixed size
Subtracting mean pixel values, i.e. normalizing the dataset
I don't really know the context of what you're doing but I would also consider adding additional features that may be relevant and seeing if that helps.

Dataset with only values (0,1,-1) with LSTM or CNN is giving 50% accuracy where as RF, SVM, ELM, Neural networks are giving above 90%

I have a dataset with 11k instances containing 0s,1s and -1s. I heard that deep learning can be applied to feature values.Hence applied the same for my dataset but surprisingly it resulted in less accuracy (<50%) compared to traditional machine learning algos (RF,SVM,ELM). Is it appropriate to apply deep learning algos to feature values for classification task? Any suggestion is greatly appreciated.

First of all, Deep Learning isn't a mythical hammer you can throw at every problem and expect better results. It requires careful analysis of your problem, choosing the right method, crafting your network, properly setting up your training, and only then, with a lot of luck will you see significantly better results than classical methods.
From what you describe (and without any more details about your implementation), it seems to me that there could have been several things going wrong:
Your task is simply not designed for a neural network. Some tasks are still better solved with classical methods, since they manually account for patterns in your data, or distill your advanced reasoning/knowledge into a prediction. You might not be directly aware of it, but sometimes neural networks are just overkill.
You don't describe how your 11000 instances are distributed with respect to the target classes, how big the input is, what kind of preprocessing you are performing for either method, etc, etc. Maybe your data is simply processed wrong, your training is diverging due to unfortunate parameter setups, or plenty of other things.
To expect a reasonable answer, you would have to share at least a bit of code regarding the implementation of your task, and parameters you are using for training.

How to study the effect of each data on a deep neural network model?

I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,

From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.

This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.

Neural network library for true-false based image recognition

I'm in need of an artificial neural network library (preferably in python) for one (simple) task. I want to train it so that it can tell wether a thing is in an image. I would train it by feeding it lots of pictures and telling it wether it contains the thing I'm looking for or not:
These images contain this thing, return True (or probability of it containing the thing)
These images do not contain this thing, return False (or probability of it containing the thing)
Does such a library already exist? I'm fairly new to ANNs and image recognition; although I understand how they both work in principle I find it quite hard to find an adequate library for this task, and even research in this field has proven to be kind of a frustration - any advice towards the right direction is greatly appreciated.

There are several good Neural Network approaches in Python, including TensorFlow, Caffe, Lasagne, and sknn (Sci-kit Neural Network). sknn provides an easy, out of the box solution, although in my opinion it is more difficult to customize and can be slow on large datasets.
One thing to consider is whether you want to use a CNN (Convolutional Neural Network) or a standard ANN. With an ANN you will mostly likely have to "unroll" your images into a vector whereas with a CNN, it expects the image to be a cube (if in color, a square otherwise).
Here is a good resource on CNNs in Python.
However, since you aren't really doing a multiclass image classification (for which CNNs are the current gold standard) and doing more of a single object recognition, you may consider a transformed image approach, such as one using the Histogram of Oriented Gradients (HOG).
In any case, the accuracy of a Neural Network approach, especially when using CNNs, is highly dependent on successful hyperparamter tuning. Unfortunately, there isn't yet any kind of general theory on what hyperparameter values (number and size of layers, learning rate, update rule, dropout percentage, batch size, etc.) are optimal in a given situation. So be prepared to have a nice Training, Validation, and Test set setup in order to fit a robust model.

I am unaware of any library which can do this for you. I use a lot of Caffe and can give you a solution till you find a single library which can do it for you.
I hope you know about ImageNet and that Caffe has a trained model based on ImageNet.
Here is the idea:
Define what the object is. Say object = "laptop".
Use Caffe's ImageNet trained model, change the code to display the required output you want (you mentioned TRUE or FALSE) when the object is in the output labels.
Here is a link to the ImageNet tutorial which I wrote.
Here is what you might try:
Take a look here. It is a stripped down version of the ImageNet program which I used in a prediction engine.
In line 80 you'll get the top-1 predicted output label. In line 86 you'll get the top-5 predicted labels. Write a line of code to check whether object is in the output_label and return TRUE or FALSE according to it.
I understand that you are looking for a specific library, I will look for it, but this is something I would try out in the beginning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.