Is it possible to use HyperDriveStep with time-series cross-validation? - python

I want to deploy a stacked model to Azure Machine Learning Service. The architecture of the solution consists of three models and one meta-model.
Data is a time-series data.
I'd like the model to automatically re-train based on some schedule. I'd also like to re-tune hyperparameters during each re-training.
AML Service offers HyperDriveStep class that can be used in the pipeline for automatic hyperparameter optimization.
Is it possible - and if so, how to do it - to use HyperDriveStep with time-series CV?
I checked the documentation, but haven't found a satisfying answer.

AzureML HyperDrive is a black box optimizer, meaning that it will just run your code with different parameter combinations based on the configuration you chose. At the same time, it supports Random and Bayesian sampling and has different policies for early stopping (see here for relevant docs and here for an example -- HyperDrive is towards the end of the notebook).
The only thing that your model/script/training needs to adhere to is to be launched from a script that takes --param style parameters. As long as that holds you could optimize the parameters for each of your models individually and then tune the meta-model, or you could tune them all in one run. It will mainly depend on the size of the parameter space and the amount of compute you want to use (or pay for).

Related

LightGBM: train() vs update() vs refit()

I'm implementing LightGBM (Python) into a continuous learning pipeline. My goal is to train an initial model and update the model (e.g. every day) with newly available data.
Most examples load an already trained model and apply train() once again:
updated_model = lightgbm.train(params=last_model_params, train_set=new_data, init_model = last_model)
However, I'm wondering if this is actually the correct way to approach continuous learning within the LightGBM library since the amount of fitted trees (num_trees()) grows for every application of train() by n_estimators. For my understanding a model update should take an initial model definition (under a given set of model parameters) and refine it without ever growing the amount of trees/size of the model definition.
I find the documentation regarding train(), update() and refit() not particularly helpful. What would be considered the right approach to implement continuous learning with LightGBM?
In lightgbm (the Python package for LightGBM), these entrypoints you've mentioned do have different purposes.
The main lightgbm model object is a Booster. A fitted Booster is produced by training on input data. Given an initial trained Booster...
Booster.refit() does not change the structure of an already-trained model. It just updates the leaf counts and leaf values based on the new data. It will not add any trees to the model.
Booster.update() will perform exactly 1 additional round of gradient boosting on an existing Booster. It will add at most 1 tree to the model.
train() with an init_model will perform gradient boosting for num_iterations additional rounds. It also allows for lots of other functionality, like custom callbacks (e.g. to change the learning rate from iteration-to-iteration) and early stopping (to stop adding trees if performance on a validation set fails to improve). It will add up to num_iterations trees to the model.
What would be considered the right approach to implement continuous learning with LightGBM?
There are trade-offs involved in this choice and no one of these is the globally "right" way to achieve the goal "modify an existing model based on newly-arrived data".
Booster.refit() is the only one of these approaches that meets your definition of "refine [the model] without ever growing the amount of trees/size of the model definition". But it could lead to drastic changes in the predictions produced by the model, especially if the batch of newly-arrived data is much smaller than the original training data, or if the distribution of the target is very different.
Booster.update() is the simplest interface for this, but a single iteration might not be enough to get most of the information from the newly-arrived data into the model. For example, if you're using fairly shallow trees (say, num_leaves=7) and a very small learning rate, even newly-arrived data that is very different from the original training data might not change the model's predictions by much.
train(init_model=previous_model) is the most flexible and powerful option, but it also introduces more parameters and choices. If you choose to use train(init_model=previous_model), pay attention to parameters num_iterations and learning_rate. Lower values of these parameters will decrease the impact of newly-arrived data on the trained model, higher values will allow a larger change to the model. Finding the right balance between those is a concern for your evaluation framework.

How to aggregate datasets for Policy Aggregation in Python?

I am trying to run policy aggregation for a particular problem and I am confused on how to aggregate the two policies in Python. In policy aggregation, you have some initial policy on a dataset, and as you train you collect expert (oracle) actions, train a separate policy on that, and then combine those two policies with a particular weight distribution to get a new policy to predict states for the next training iteration.
Currently, I am using VotingClassifier with a soft voting. The issue here is that, one it uses the predictions from the individual models and doesn't really create a new model. Second, this new model has to be fit again to some dataset. If I used the initial or the expert dataset, the model doesn't learn, if I train on the aggregated dataset, it does improve, but the approach resembles another method DAgger. I am not sure how the papers that use it actually implement it.
Is there any other approach I can try to merge these two policies, which are trained on different datasets but that have the same features and classes (same X and y)?

What's the purpose of the different kinds of TensorFlow SignatureDefs?

It seems like the Predict SignatureDef encompasses all the functionality of the Classification and Regression SignatureDefs. When would there be an advantage to using Classification or Regression SignatureDefs rather than just using Predict for everything? We're looking to keep complexity down in our production environment, and if it's possible to use just Predict SignatureDefs in all cases, that would seem like a good idea.
From what I can see on the documentation (https://www.tensorflow.org/serving/signature_defs) it seems the "Classify" and "Regress" SigDefs try to force a simple and consistent interface for the simple cases (classify and regress), respectively, "inputs"-->"classes+scores" and "inputs"->"outputs". There seems to be an added benefit that the "Classify" and "Regress" SigDefs dont require a serving function to be constructed as part of the model export function.
Also from the docs, it seems the Predict SigDef allows a more generic interface with the benefit of being able to swap in and out models. From the docs:
Predict SignatureDefs enable portability across models. This means
that you can swap in different SavedModels, possibly with different
underlying Tensor names (e.g. instead of x:0 perhaps you have a new
alternate model with a Tensor z:0), while your clients can stay online
continuously querying the old and new versions of this model without
client-side changes.
Predict SignatureDefs also allow you to add optional additional
Tensors to the outputs, that you can explicitly query. Let's say that
in addition to the output key below of scores, you also wanted to
fetch a pooling layer for debugging or other purposes.
However the docs dont explain, aside from the minor benefit of not having to export a serving function, why one wouldn't just use the Predict SigDef for everything since it appears to be a superset with plenty of upside. I'd love to see a definitive answer on this, as the benefits of the specialized functions (classify, regress) seem quite minimal.
The differences I've seen so far are...
1) If utilizing the tf.feature_column.indicator_column wrapping the tf.feature_column.categorical_column_with_vocabulary_* in a DNNClassifier model, when you query the tensorflow server, I've had problems with the Predict API sometimes not being able to parse/map string inputs according to the vocabulary file/list. On the other hand, the Classify API properly mapped strings to their index (categorical_column) on the vocabulary, and then to the one-hot/multi-hot (indicator_column), and provided (what seems to be) the correct classification response to the query.
2) The response format of [[class, score],[class,score],....] for Classify API vs [class[], score[]] for Predict API. One or the other may be preferable if you need to parse the data in some way afterwards.
TLDR; With indicator_column wrapped in categorical_column_with_vocabulary_*, I've experienced issues with the vocabulary mapping when serving with Predict API. So, using Classify API.

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?
There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !
Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

How to automatic classification of app user reviews?

I have received tens of thousands of user reviews on the app.
I know the meaning of many of the comments are the same.
I can not read all these comments.
Therefore, I would like to use a python program to analyze all comments,
Identify the most frequently the most important feedback information.
I would like to ask, how can I do that?
I can download an app all comments, also a preliminary understanding of the Google Prediction API.
You can use the Google Prediction API to characterize your comments as important or unimportant. What you'd want to do is manually classify a subset of your comments. Then you upload the manually classified model to Google Cloud Storage and, using the Prediction API, train your model. This step is asynchronous and can take some time. Once the trained model is ready, you can use it to programmatically classify the remaining (and any future) comments.
Note that the more comments you classify manually (i.e. the larger your training set), the more accurate your programmatic classifications will be. Also, you can extend this idea as follows: instead of a binary classification (important/unimportant), you could use grades of importance, e.g. on a 1-5 scale. Of course, that entails more manual labor in constructing your model so the best strategy will be a function of your needs and how much time you can spend building the model.

Categories

Resources