Where does machine learning algorithme store the result? - python

I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.

the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.

I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.

Related

For a binary classification model based on multiple continuous variable what model should be used?

I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!
Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

Classification of custom objects

I need to write a program that, given an object with certain attributes, it knows how to classify it. It should know how to classify new objects by being trained with a list of known objects with known attributes.
For example, I have object A with the following attributes: a=10 and b=1. I also trained the program so that it knows that values between 5..15 for a and 0..2 for b classify the given object as label1.
As the program evolves, I need to further train it with known data so that the attribute intervals will get more accurate (hence the classification).
Now, I haven't got any experience with machine learning or any of this kind and I would like to know how should I start with this. I've seen plenty of tutorials, but only for text classification. And only for 2-ways classification (that is, positive or negative, yes or no...only two values to choose from). I would have 5-6 labels to start with and their number will soon increase. Also, the object attributes are integers.
Any tip is highly appreciated!
Machine learning is a very broad field, so the first step would be knowing exactly what you're looking for and familiarizing yourself with the subproblem you're trying to solve.
By your description, you're trying to solve a classification problem in a supervised learning approach.
I'll paraphrase a bit from here:
The classification problem consists in identifying to which class a observation belongs to.
Supervised learning is a way of "teaching" a machine. Basically, an algorithm is trained through examples (i.e.: this particular object belongs to class X). After training, the machine should be able to apply its aquired knowledge to new data.
The k-NN algorithm is one of the simplest algorithms for solving this kind of problem. I suggest you familiarize yourself with it.
You have an implementation of k-NN in scipy. Here's a link to a tutorial on using it.
Now, answering your specific questions:
only for 2-ways classification (that is, positive or negative, yes or
no...only two values to choose from)
k-NN can handle any (finite) number of classes, so you're clear
Also, the object attributes are integers
K-NN usually uses a continuous space - so you'll have to convert those to floats.
Mapping the attributes values into points in the algorithm space is not a trivial problem (see Data pre-processing, especially the articles on normalization, feature extraction and selection)

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?
There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !
Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Can a neural network recognize a screen and replicate a finite set of actions?

I learned, that neural networks can replicate any function.
Normally the neural network is fed with a set of descriptors to its input neurons and then gives out a certain score at its output neuron. I want my neural network to recognize certain behaviours from a screen. Objects on the screen are already preprocessed and clearly visible, so recognition should not be a problem.
Is it possible to use the neural network to recognize a pixelated picture of the screen and make decisions on that basis? The amount of training data would be huge of course. Is there way to teach the ANN by online supervised learning?
Edit:
Because a commenter said the programming problem would be too general:
I would like to implement this in python first, to see if it works. If anyone could point me to a resource where i could do this online-learning thing with python, i would be grateful.
I would suggest
http://www.neuroforge.co.uk/index.php/getting-started-with-python-a-opencv
http://docs.opencv.org/doc/tutorials/ml/table_of_content_ml/table_of_content_ml.html
http://blog.damiles.com/2008/11/the-basic-patter-recognition-and-classification-with-opencv/
https://github.com/bytefish/machinelearning-opencv
openCV is basically an image processing library but also has some amazing helper classes that you you can use for almost any task. Its machine learning module is pretty easy to use and you can go through the source to see explanation and background theory about each function.
You could also use a pure python machine learning library like:
http://scikit-learn.org/stable/
But, before you feed in the data from your screen (i'm assuming thats in pixels?) to your ANN or SVM or whatever ML algorithm you choose, you need to perform "Feature Extraction" on your data. (which are the objects on the screen)
Feature Extraction can be thought of like representing the same data on the screen but with fewer numbers so i have less numbers to give to my ANN. You need to experiment with different features before you find a combination that works well for your particular scenario. a sample one could look something like this:
[x1,y1,x2,y2...,col]
This is basically a list of edge points that represent the area your object is in. a sort of ROI (Region of Interest) and perform egde detection, color detection and also extract any other relevant characteristics. The important thing is that now all your objects, their shape/color information is represented by a number of these lists, one for each object detected.
This is the data that can be provided as input to the neural network. but you'll have to define some meaningfull output parameters depending on your specific problem statements before you can train/test your system of course.
Hope this helps.
This is not entirely correct.
A 3-layer feedforward MLP can theoretically replicate any CONTINUOUS function.
If there are discontinuities, then you need a 4th layer.
Since you are dealing with pixelated screens and such, you probably would need to consider a fourth layer.
Finally, if you are looking at circular shapes, etc., than a radial basis function (RBF) network may be more suitable.

Categories

Resources