Print out summaries in console - python

Tensorflow's scalar/histogram/image_summary functions are very useful for logging data for viewing with tensorboard. But I'd like that information printed to the console as well (e.g. if I'm a crazy person without a desktop environment).
Currently, I'm adding the information of interest to the fetch list before calling sess.run, but this seems redundant as I'm already fetching the merged summaries. Fetching the merged summaries returns a protobuf, so I imagine I could scrape it using some generic python protobuf library, but this seems like a common enough use case that there should be an easier way.
The main motivation here is encapsulation. Let's stay I have my model and training script in different files. My model has a bunch of calls to tf.scalar_summary for the information that useful to log. Ideally, I'd be able to specify whether or not to additionally print this information to console by changing something in the training script without changing the model file. Currently, I either pass all of the useful information to the training script (so I can fetch them), or I pepper the model file with calls to tf.Print

Overall, there isn't first class support for your use case in TensorFlow, so I would parse the merged summaries back into a tf.Summary() protocol buffer, and then filter / print data as you see fit.
If you come up with a nice pattern, you could then merge it back into TensorFlow itself. I could imagine making this an optional setting on the tf.train.SummaryWriter, but it is probably best to just have a separate class for console-printing out interesting summaries.
If you want to encode into the graph itself which items should be summarized and printed, and which items should only be summarized (or to setup a system of different verbosity levels) you could use the Collections argument to the summary op constructors to organize different summaries into different groups. E.g. the loss summary could be put in collections [GraphKeys.SUMMARIES, 'ALWAYS_PRINT'], but another summary could be in collection [GraphKeys.SUMMARIES, 'PRINT_IF_VERBOSE'], etc. Then you can have different merge_summary ops for the different types of printing, and control which ones are run via command line flags.

Related

How to add additional information in PMML other than related to model?

Just a simple question, I'm stuck at a scenario where I want to pass multiple information other than the pipeline itself inside a PMML file.
Other information like:
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features.
Correlation of all features with target.
There can be more of those, but the situation is like this as you can see. I know they can be easily sent with other file specifically made for it, but since it is regarding the ML model, I want them in the single file: PMML.
The Question is:
Can we add any additional information in PMML file that is extra in nature and might not be related with the model so that it can be used on the another side?
If that becomes possible, somehow, it would be much more helpful.
The PMML standard is extensible with custom elements. However, in the current case, all your needs appear to be served by existing PMML elements.
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features
You can store descriptive statistics about features using the ModelStats element.
Correlation of all features with target
You can store function information using the ModelExplanation element.

Making pandas code more readable/better organized

I am working on a data analysis project based on Pandas. Data which has to be analyzed is collected from application log files. Log entries are based on sessions, which can be different types (and can have different actions), then each session can have mutliple services (also with different types, actions, etc.). I have transformed log file entries to pandas dataframe and then based on that completed all required calculations. At this moment that's around few hundred different calculations, which are at the end printed to stdout. If anomaly is found that is specifically flagged. So, basic functionality is there, but now after this first phase is done, I'm not happy with readability of the code and it seems to me that there must be a way to make the code better organized.
For example what I have at the moment is:
def build(log_file):
# build dataframe from log file entries
return df
def transform(df):
# transform dataframe (for example based on grouped sessions, services)
return transformed_df
def calculate(transformed_df):
# make calculations based on transformed dataframe and print them to stdout
print calculation1
print calculation2
etc.
Since there are numerous criteria for filtering data, there is at least 30-40 different data frame filters present. They are used in calculate and in transform functions. In calculate functions I have also some helper functions which perform tasks which can be applied to similar session/service types and then result is based just on filtered dataframe for that specific type. With all these requirements, transformations, filters, I now have more than 1000 lines of code, which as I said, I have a feeling it can be more readable.
My current idea is to have perhaps classes organized like this:
class Session:
# main class for sessions (it can be inherited by other session types), also with standradized output for calculations
class Service:
# main class for services (it can be inherited by other service types), also with standradized output for calculations, etc.
class Dataframe
# dataframe class with filters, etc.
But I'm not sure if this is good approach. I tried searching here, on github, different blogs, but I didn't find anything which would provide some examples what would be best way to organize code in more than basic panda projects. I would appreciate any suggestion which would put me in the right direction.

Best way to manage train/test/val splits on AzureML

I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks
I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.
Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

I am absolutely baffled by how many unhelpful error messages I've received while trying to use this supposedly simple API to write TFRecords in a manner that doesn't take 30 minutes every time I have a new dataset.
Task:
I'd like to feed a list of image paths and a list of labels to a tf.data.Dataset, parse them in parallel to read the images and encode as tf.train.Examples, use tf.data.Dataset.shard to distribute them into different TFRecord shards (e.g. train-001-of-010.tfrecord, train-002-of-010.tfrecord, etc.), and for each shard finally write them to the corresponding file.
Since I've been debugging this for hours I haven't gotten any single particular error to fix, otherwise I would provide it. I've struggled to find any up to date tutorial that doesn't either (a) come from 2017 and use queue runners, (b) use a tf.Session (I'm using tensorflow 1.15 but official docs keep telling me to phase out sessions), (c) Conveniently do the record creating in pure python, which makes a simple tutorial but is too slow for any actual application, or (d) use already created TFRecords and just skip the whole process.
If necessary, I can put together an example of what I'm talking about. But since I'm getting stuck at every level of the process, at the moment it seems unhelpful.
Tldr:
If anyone has utilized tf.data.Dataset to create TFRecord shards in parallel please point me in a better direction than google has.

FMU-module method get_states_list()

I found a limitation of the FMU-module method get_states_list(). This method seems to bring a list only of continuous time states and not of discrete time states. I do usually make models that contain both continuous and discrete time sub-models describing process and control system and I am very interested to be able to get a list of ALL states in the system.
One possibility could have been get_fmu_state() but I get the exception text “This FMU does not support get and set FMU-state”.
Another possibility would perhaps be bring out a larger list of all variables using and sort out those variables that contain in the declaration "fixed=true”, but this attribute I am not sure how to bring out, although other attributes can be brought out like min, max, nominal. The method get_model_variables() could perhaps be of help but I only get some address associated to the variable….
What to do?
The get_states_list method is a mapping back to the FMI specification which only includes the continuous time states. So this is by design.

Categories

Resources