DataSet normalize range of input values

DataSet normalize range of input values - python

I'm running some experiments with neural networks in TensorFlow. The release notes for the latest version say DataSet is henceforth the recommended API for supplying input data.
In general, when taking numeric values from the outside world, the range of values needs to be normalized; if you plug in raw numbers like length, mass, velocity, date or time, the resulting problem will be ill-conditioned; it's necessary to check the dynamic range of values and normalize to the range (0,1) or (-1,1).
This can of course be done in raw Python. However, DataSet provides a number of data transformation features and encourages their use, on the theory that the resulting code will not only be easier to maintain, but run faster. That suggests there should also be a built-in feature for normalization.
Looking over the documentation at https://www.tensorflow.org/programmers_guide/datasets however, I'm not seeing any mention of such. Am I missing something? What is the recommended way to do this?

My understanding of tensorflow datasets main idea tells me that complex pre-procesing is not directly applicable, because tf.data.Dataset is specifically designed to stream very large amounts of data, more precisely tensors:
A Dataset can be used to represent an input pipeline as a collection
of elements (nested structures of tensors) and a "logical plan" of
transformations that act on those elements.
The fact that tf.data.Dataset operates with tensors means that obtaining any particular statistic over the data, such as min or max, requires a complete tf.Session and at least one run through the whole pipeline. The following sample lines:
iterator = dataset.make_one_shot_iterator()
batch_x, batch_y = iterator.get_next()
... which are designed to provide the next batch fast, no matter of the size of the dataset, would stop the world until the first batch is ready, if the dataset is responsible for pre-processing. That's why the "logical plan" includes only local transformations, which ensures the data can be streamed and, in addition, allows to do transformations in parallel.
This doesn't mean it's impossible to implement the normalization with tf.data.Dataset, I feel like it's never been designed to do so and, as a result, it will look ugly (though I can't be absolutely sure of that). However, note that batch-normalization fits into this picture perfectly, and it's one of the "nice" options I see. Another option is do simple pre-processing in numpy and feed the result of that into tf.data.Dataset.from_tensor_slices. This doesn't make the pipeline much more complicated, but doesn't restrict you from using tf.data.Dataset at all.

Related

Why transformations go into the dataset and not into the NN itself in Pytorch?

I am new to Pytorch and I am now following the tutorial on transforms. I see that the transformations are configured into the dataset object. I am wondering, however, why aren't they configured within the neural network itself. My naive point of view is that the transformations should be in any case the most external layers of the network, in the same way as the eye comes before the brain to transform light into signals for the brain, and you don't modify the world instead to adapt it to the brain.
So, is there any technical reason for putting the transformations in the dataset instead of the net? Is it a good/bad practice to put the transformations within my neural network instead? Why?

These are some of the reason that can explain why one would do this.
We would like to use the same NN code for training as well as testing / inference. Typically during inference, we don't want to do any transformation and hence one might want to keep it out of the network. However, you may argue that one can just simply use model.training flag to skip the transformation.
Most of the transformations happen on CPU. Doing transformations in dataset allows to easily use multi-processing and prefetching. The dataset code can prefetch the data, transform, and keep it ready to be fed into the NN in a separate thread. If instead, we do it inside the forward function, GPUs will idle during the transformations (as these happen on CPU), likely leading to a longer training time.

Time series forecasting model.predict()

I do not understand how model.predict(...) works on a a time series forecasting problem. I usually use it with a CNN and it is pretty straight forward but for time series I don't understand what it returns.
For example I am currently doing an exercise where I have to forecast the power consumption based on data using LTSM, I succeeded to train my model but when I want to know what the power cusumption will be tomorrow (so no data except past ones) I don't know what input to use.

Traditional ML algorithms, which you might be more used to, generally expect the data in a 2D structure like this:
For sequential data, such as a stream of timed events associated with each user, it’s also possible to create a lagged 2D dataset, where the history of different features for different IDs is aligned into single rows, with this structure:
This can be a good way to work because once your data is in the correct shape you can use it with fast to set up and train models. However, models using features engineered using this approach generally don’t have any capacity to “learn” anything about the natural sequence of the data. To something like a tree-based ensemble model receiving this format, feature 1 at time t and time t-1 in the example above are treated completely independently and this can severely limit the model’s predictive power.
There are types of deep learning architecture specifically designed for modelling sequence data called recurrent neural nets (RNN). Two of the most popular cells to use in these are long short term memory (LSTM) and gated recurrent units (GRU). There’s a good post on how to understand how LSTM cells work here, but the TL;DR is they have a structure that allows them to learn from sequences of data.
Cells like LSTM expect a 3D tensor of input data. We arrange it so that one axis has the data features along it, the second axis has the sequence steps (like time ticks) and the third axis has each of the different examples we want to predict a single "y" value for stacked along it. Using the same type of dataset as the lagged example above, it would look something like this:
The ability to learn patterns in sequences of data like this is particularly beneficial for both time series and text data, which are naturally ordered.
To return to your original question, when you want to predict something in your test set you'll need to pass it sequences represented just like the ones it was trained in (this is a reasonably good rule of supervised learning in general). For example, if the data is trained like the last example above, you'll need to pass it a 2D example for each ID you want to make a prediction for.
You should explore the way the original training data is represented and make sure you understand it well, as you'll need to create the same shape of data to make predictions. X_train.shape is a great place to start, if you have your training data in a pandas dataframe or numpy arrays, to see what the dimensionality is, and then you can inspect entries along each axis until you get a good feel for the data it contains.

Training batches: which Tensorflow method is the right one?

I'm trying to train a very simple neural network to classify samples of data where some classes necessarily succeed others - this is why I decided to let the input data enter the network in batches. Using Tensorflow, apparently you get multiple ways of declaring batches, like tf.data.Dataset.batch (with which I currently train using the Adam Optimizer) and tf.train.batch. Where is the difference? Should the methods be used together or are they exclusive? In the latter case: which one should I prefer?

tf.train.* is an older API, more complex and prone to errors than the tf.data.* one (you need to take care yourself of queues, thread runners, coordinator, etc). For your stated purpose (batching data and feeding it to a model), the two are functionally equivalent, as in both achieve your goal. However, you should consider using tf.data as that's both simpler to use and the currently recommended way to handle input datasets.

Where does machine learning algorithme store the result?

I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.

the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.

I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?

There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !

Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.