I am currently using the tsfresh package for a project (predictive maintenance).
It is working really well and now I want to implement it live.
However, the issue is that I don't know how to store the feature engineering that has been applied to my original dataset in order to do the same feature engineering to the data that I am streaming (receiving live).
Do you have any idea if there is a parameter or a function that allows to store the feature engineering performed by tsfresh?
(I am using the extract_relevant_features function).
After searching through various post it turns out that the answer is that you can save your parameter into a dictionnary (see here).
This dictionnary can be can later be called with the function extract_features to extract only those parameters.
Related
Just a simple question, I'm stuck at a scenario where I want to pass multiple information other than the pipeline itself inside a PMML file.
Other information like:
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features.
Correlation of all features with target.
There can be more of those, but the situation is like this as you can see. I know they can be easily sent with other file specifically made for it, but since it is regarding the ML model, I want them in the single file: PMML.
The Question is:
Can we add any additional information in PMML file that is extra in nature and might not be related with the model so that it can be used on the another side?
If that becomes possible, somehow, it would be much more helpful.
The PMML standard is extensible with custom elements. However, in the current case, all your needs appear to be served by existing PMML elements.
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features
You can store descriptive statistics about features using the ModelStats element.
Correlation of all features with target
You can store function information using the ModelExplanation element.
I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks
I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.
Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.
Recently I was given a task by a potential employer to do the following :
- transfer a data set to S3
- create metadata for the data set
- creat a feature for the data set in spark
Now this is a trainee position, and I am new to data engineering in terms of concepts and I am having trouble understanding how or even if metadata is used to create a feature.
I have gone through numerous sites in feature engineering and metadata but none of which really give me an indication on if metadata is directly used to build a feature.
what I have gathered so far from sites is that when you build a feature it extracts certain columns from a given data set and then you put this information into a feature vector for the ML algorithm to learn from. So to me, you could just build a feature directly from the data set directly, and not be concerned with the metadata.
However, I am wondering if is it common to use metadata to search for given information within multiple datasets to build the feature, i.e you look in the metadata file see certain criteria
that fit the feature your building and then load the data in from the metadata and build the feature from there to train the model.
So as an example say I have multiple files or certain car models for manufacture i.e (vw golf, vw fox, etc) and it contains the year and the price of the car for that year and I would like the ML algorithm to predict the depreciation of the car for the future or depreciation of the newest model of that car for years to come. Instead of going directly through all the dataset, you would check the metadata (tags, if that the correct wording) for certain attributes to train the model then by using the (tags) it loads the data in from the specific data sets.
I could very well be off base here, or my example I given above may be completely wrong, but if anyone could just explain how metadata can be used to build features if it can that would be appreactied or even if links to data engineering websites that explain. It just over the last day or two researching, I find that there more on data sic than data engineering itself and most data engineering info is coming from blogs so I feel like there a pre-existing knowledge I am supposed to have when reading them.
P.S though not a coding question, I have used the python tag as it seems most data engineers use python.
I'll give synopsis on this !!!
Here we need to understand two conditions
1)Do we have features which directly related in building ML models.
2)are we in data scarcity ?
Always make a question , what the problem statement suggest us in generating features ?
There are many ways we can generate features from given dataset like PCA,truncated SVD,TSNE used for dimensionality reduction techniques where new features are created from given features.feature engineering techniques like fourier features,trignometric features etc. and then we move to the metadata like type of feature,size of feature,time when it extracted if it etc..like this metadata also helps us in creating features for building ML Models but this depends how we have performed feature engineering on datacorpus of respective Problem.
I was wondering whether there is a way how to perform interactive variables grouping (similar to one enabled by SAS Miner software) in PySpark/Python world. Variable grouping is intergral part of model development, so I suppose there has to be already some tool/library that might support this. Does anyone have experience with this?
Currently no such library exists for Python.
Interactive variable grouping is a multi-step process (offered as a node called IGN in SAS Enterprise Miner) that is part of SAS EM Credit Scoring solution and not base SAS. Although there are tools in Python world for some of the IGN steps such as binning, WoE, Gini, decision trees, etc. Scikit-learn is a good starting point for that.
There are a lot of Scikit-learn related projects including domain-specific ones. A project for credit scoring is a potential candidate in that list.
Is it possible to save a predictive model in Python?
Background: I'm building regression models in SPSS and I need to share those when done. Unfortunately, no one else has SPSS.
Idea: My idea is to make the model in Python, do something XYZ, use another library to convert XYZ into and exe that will pick up a csv file with data and spit out the model fit results on that data. In this way, I can share the model with anyone I want without the need of SPSS or other expensive software
Challenge: I need to find out XYZ, how do I save the instance when the model is built. For example, in case of linear/logistic, it would be the set of coefficients.
PS: I'm using linear/logistic as examples, in reality, I need to share more complex models like SVM etc.
Using FOSS (Free & Open Source Software) is great to facilitate collaboration. Consider using R or Sage (which has a Python backbone and includes R) so that you can freely share programs and data. Or even use Sagemath Cloud so that you can work collaboratively in real-time.
Yes, this is possible. What you're looking for is scitkit-learn in combination with joblib. A working example of your problem can be found in this question.