Remove/Edit feature definitions from AWS Sagemaker Feature Group

Remove/Edit feature definitions from AWS Sagemaker Feature Group - python

How do I edit/remove feature definitions (name/type) from my AWS Sagemaker Feature Group? From what I encounter in the Feature Store API, there are just options to delete Feature Group or record. I Tried to search the documentation for feature delete/edit methods without success. The current solution I see is to delete the Feature Group and recreate it with the correct feature definitions.

SageMaker Feature Store supports the ability to delete an entire feature record, but not a specific feature. More specifically, the current version of Feature Store supports only immutable feature groups. Once you create a feature group, its schema cannot be changed.

Related

How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions in ES?

I want to implement Elasticsearch on a customized corpus. I have installed elasticsearch of version 7.5.1 and I do all my work in python using the official client.
Here I have a few questions:
How to customize preprocess pipeline? For example, I want to use a BertTokenizer to convert strings to tokens instead of ngrams
How to customize scoring function of each document w.r.t. the query? For example, I want to compare effects of tf-idf with bm25, or even using some neural models for scoring.
If there is great tutorial in python, please share with me. Thanks in advance.

You can customize the similarity function when creating an index. See the Similarity Module section of the documentation. You can find a good article that compares classical TF_IDF with BM25 on the OpenSource Connections site.
In general, Elasticsearch uses an inverted index to look up all documents that contain a specific word or token.
It sounds like you want to use vector fields for scoring, there is a good article on the elastic blog that explains how you can achieve that. Be aware that as of now Elasticsearch is not using vector fields for retrieval, only for scoring, if you want to use vector fields for retrieval you have to use a plugin, or the OpenSearch fork, or wait for version 8.
In my opinion, using ANN in real-time during search is too slow and expensive, and i have yet to see improvements in relevancy with normal search requests.
I would do the preprocessing of your documents in your own python environment before indexing and not use any Elasticsearch pipelines or plugins. It is easier to debug and iterate outside of Elasticsearch.
You could also take a look at the Haystack Project, it might have a lot of the functionality that you are looking for, already build in.

Store and reuse tsfresh feature engineering performed

I am currently using the tsfresh package for a project (predictive maintenance).
It is working really well and now I want to implement it live.
However, the issue is that I don't know how to store the feature engineering that has been applied to my original dataset in order to do the same feature engineering to the data that I am streaming (receiving live).
Do you have any idea if there is a parameter or a function that allows to store the feature engineering performed by tsfresh?
(I am using the extract_relevant_features function).

After searching through various post it turns out that the answer is that you can save your parameter into a dictionnary (see here).
This dictionnary can be can later be called with the function extract_features to extract only those parameters.

Best way to manage train/test/val splits on AzureML

I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks

I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.

Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.

How to use pre labeled training data for Python Dedupe

I am using Python Dedupe package for record linkage tasks.
It means matching Company names in one data set to other.
The Dedupe package allows user to label pairs for training Logistic Regression model. However, it's a manual process and one need to input y/n for each pair shown on screen.
I want to load a training file which has 3 columns say, Company 1, Company 2, Match
Where Match can take value yes or no if Company 1 and Company 2 are same or different respectively.
I am following this source code but couldn't find a way to load a file for training.
Also, the doc shows one can change default Classifier but not sure how to do this
Can anyone please help me on this

Look up the trainingDataLink function in the dedupe documentation. It’s designed to handle pre-labeled data for record linkage.

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.

What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.