Best way to manage train/test/val splits on AzureML - python

I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks

I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.

Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.

Related

How can I ensure that all users and all items appear in the training set of my recommender system?

I am building a recommender system in Python using the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/). In order for my system to work correctly, I need all the users and all the items to appear in the training set. However, I have not found a way to do that yet. I tried using sklearn.model_selection.train_test_split on the partition of the dataset relevant to each user and then concatenated the results, thus succeeding in creating training and test datasets that contain at least one rating given by each user. What I need now is to find a way to create training and test datasets that also contain at least one rating for each movie.
This requirement is quite reasonable, but is not supported by the data ingestion routines for any framework I know. Most training paradigms presume that your data set is populated sufficiently that there is a negligible chance of missing any one input or output.
Since you need to guarantee this, you need to switch to an algorithmic solution, rather than a probabilistic one. I suggest that you tag each observation with the input and output, and then apply the "set coverage problem" to the data set.
You can continue with as many distinct covering sets as needed to populate your training set (which I recommend). Alternately, you can set a lower threshold of requirement -- say get three sets of total coverage -- and then revert to random methods for the remainder.

Store and reuse tsfresh feature engineering performed

I am currently using the tsfresh package for a project (predictive maintenance).
It is working really well and now I want to implement it live.
However, the issue is that I don't know how to store the feature engineering that has been applied to my original dataset in order to do the same feature engineering to the data that I am streaming (receiving live).
Do you have any idea if there is a parameter or a function that allows to store the feature engineering performed by tsfresh?
(I am using the extract_relevant_features function).
After searching through various post it turns out that the answer is that you can save your parameter into a dictionnary (see here).
This dictionnary can be can later be called with the function extract_features to extract only those parameters.

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

I am absolutely baffled by how many unhelpful error messages I've received while trying to use this supposedly simple API to write TFRecords in a manner that doesn't take 30 minutes every time I have a new dataset.
Task:
I'd like to feed a list of image paths and a list of labels to a tf.data.Dataset, parse them in parallel to read the images and encode as tf.train.Examples, use tf.data.Dataset.shard to distribute them into different TFRecord shards (e.g. train-001-of-010.tfrecord, train-002-of-010.tfrecord, etc.), and for each shard finally write them to the corresponding file.
Since I've been debugging this for hours I haven't gotten any single particular error to fix, otherwise I would provide it. I've struggled to find any up to date tutorial that doesn't either (a) come from 2017 and use queue runners, (b) use a tf.Session (I'm using tensorflow 1.15 but official docs keep telling me to phase out sessions), (c) Conveniently do the record creating in pure python, which makes a simple tutorial but is too slow for any actual application, or (d) use already created TFRecords and just skip the whole process.
If necessary, I can put together an example of what I'm talking about. But since I'm getting stuck at every level of the process, at the moment it seems unhelpful.
Tldr:
If anyone has utilized tf.data.Dataset to create TFRecord shards in parallel please point me in a better direction than google has.

Print out summaries in console

Tensorflow's scalar/histogram/image_summary functions are very useful for logging data for viewing with tensorboard. But I'd like that information printed to the console as well (e.g. if I'm a crazy person without a desktop environment).
Currently, I'm adding the information of interest to the fetch list before calling sess.run, but this seems redundant as I'm already fetching the merged summaries. Fetching the merged summaries returns a protobuf, so I imagine I could scrape it using some generic python protobuf library, but this seems like a common enough use case that there should be an easier way.
The main motivation here is encapsulation. Let's stay I have my model and training script in different files. My model has a bunch of calls to tf.scalar_summary for the information that useful to log. Ideally, I'd be able to specify whether or not to additionally print this information to console by changing something in the training script without changing the model file. Currently, I either pass all of the useful information to the training script (so I can fetch them), or I pepper the model file with calls to tf.Print
Overall, there isn't first class support for your use case in TensorFlow, so I would parse the merged summaries back into a tf.Summary() protocol buffer, and then filter / print data as you see fit.
If you come up with a nice pattern, you could then merge it back into TensorFlow itself. I could imagine making this an optional setting on the tf.train.SummaryWriter, but it is probably best to just have a separate class for console-printing out interesting summaries.
If you want to encode into the graph itself which items should be summarized and printed, and which items should only be summarized (or to setup a system of different verbosity levels) you could use the Collections argument to the summary op constructors to organize different summaries into different groups. E.g. the loss summary could be put in collections [GraphKeys.SUMMARIES, 'ALWAYS_PRINT'], but another summary could be in collection [GraphKeys.SUMMARIES, 'PRINT_IF_VERBOSE'], etc. Then you can have different merge_summary ops for the different types of printing, and control which ones are run via command line flags.

How to store multi-dimensional data

I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows

Categories

Resources