How to delete Databricks feature tables through the Python API - python

The documentation explains how to delete feature tables through the UI.
Is it possible to do the same using the Python FeatureStoreClient? I cannot find anything in the docs: https://docs.databricks.com/_static/documents/feature-store-python-api-reference-0-3-7.pdf
Use case: we use ephemeral dev environments for development and we have automated deletion of resources when the environment is torn down. Now we are considering using the feature store, but we don't know how to automate deletion.

You can delete a feature table using the Feature Store Python API.
It is described here:
http://docs.databricks.com.s3-website-us-west-1.amazonaws.com/applications/machine-learning/feature-store/feature-tables.html#delete-a-feature-table
Use drop_table to delete a feature table. When you delete a table with drop_table, the underlying Delta table is also dropped.
fs.drop_table(
name='recommender_system.customer_features'
)

The answer above is correct, but note that the drop_table() function is experimental according to databricks documentation for the Feature Store Client API so it could be removed at any time. Besides that fs is also in reference to:
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.drop_table(name='recommender_system.customer_features')

Related

DataBricks (10.2) Undocumented Case Sensitivity Related to Feature Store Database/Table Access

I created an input table intended to feed DataBricks Feature Store, mounting it (in Linux) and calling it as proscribed in DataBricks documentation (from their "RawDatasets" code example):
SourceDataFrameName_df = spark \
.read \
.format('delta') \
.load("dbfs:/mnt/path/dev/version/database_name.tablename_extension")
However, this call fails with a "not-found"/"doesn't exist" error report related to locating the "database_name.tablename_extension" resource. This is how the name displays everywhere within the DataBricks GUI - that is as all lower-case.
I spent much time reviewing DataBricks documentation and SO while reviewing my DataBricks system setup but cannot find the solution to this error. Please assist.
This is an as-yet undocumented issue related to the nature of DataBricks Feature Store operations. Since DataBricks is largely pass-through (using registered views rather than storing the source data), the mount is a key issue here.
This issue may not be documented/highlighted adequately in their documentation because it is actually a Linux-thing, since that operating system is case-sensitive (whereas DataBricks appears to be largely case-agnostic). In this example, the original database/Linux engineer created the table/mount this way:
database_name.TableName_Extension
Since the mount references a Linux path, the path is case-sensitive, too. So, the proper way to load this source dataset from such a mount would be:
SourceDataFrameName_df = spark \
.read \
.format('delta') \
.load("dbfs:/mnt/path/dev/version/database_name.TableName_Extension")
The problem is that this case-sensitive nomenclature could potentially be unknown (and unknowable) if the DataBricks developer/engineer and the database/Linux developer/engineer are not the same person! For example, it might have been labeled "database_name.Tablename_extension" or "database_name.TableName_EXTENSION" or any other combination thereof.
Obviously, this information isn't difficult to find, if the needy user knows to look for it. Beware.

persist data to DynamoDB in Alexa hosted app

Are there any really good articles breaking down how to persist data into DynamoDB from Alexa? I can't seem to find a good article to break down step by step on how to persist a slot value into DynamoDB. I see in the Alexa docs here about implementing the code in Python, but that seems to be only part of what I'm looking for.
There's really no comprehensive breakdown of this, as like this tutorial, that persists data to S3. I would like to try to find something similar for DynamoDB. If there's an answer from a previous question that has answered it, let me know and I can mark it as a duplicate.
You can just use a tutorial which uses python and aws lambdas.
Like this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.html
the amazon article is more about the development kit which can give you some nice features to store persistent attributes for a users.
so usually I have a persistent store for users (game scores, ..., last use of skill whatever) and additional data in an other table
The persistence adapter has an interface spec that abstracts away most of the details operationally. You should be able to change persistence adapters by initializing one that meets the spec, and in the initialization there may be some different configuration options. But the way you put things in and get them out should remain functionally the same.
You can find the configuration options for S3 and Dynamo here. https://developer.amazon.com/en-US/docs/alexa/alexa-skills-kit-sdk-for-python/manage-attributes.html
I have written a "local persistence adapter" in JavaScript to let me store values in flat files at localhost instead of on S3 when I'm doing local dev/debug. Swapping the two out (depending on environment) is all handled at adapter initialization. My handlers that use the attributes manager don't change.

How to copy Elasticsearch data to SQL Server

I want to send data from kibana(Elasticsearch) to mysql.
Is there any simple way to do so directly or if it possible through python?
I think the whole task can be divided into two parts:
how to fetch the data from elasticsearch (you can do it via python): https://elasticsearch-py.readthedocs.io/en/master/
how to add data to mysql (you can do it via python):
https://dev.mysql.com/doc/connector-python/en/connector-python-example-cursor-transaction.html
Btw, you can check this page to find out the sample script for getting all documents from one index in ES via python: https://discuss.elastic.co/t/get-all-documents-from-an-index/86977
What you need is called an ETL, I am not giving an exact answer, since your question is more general.
You can develop a small Python script to achieve this, but in general, this is more useful to use a real ETL.
I recommend Apache Spark, with the official elasticsearch-hadoop plugin:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc
Exemple in Scala (but you could use Python or Java or R):
val df = sqlContext.read().format("org.elasticsearch.spark.sql").load("spark/trips")
df.write.jdbc(jdbcUrl, "_table_")
The benefits:
Spark will distribute work via workers (will read all elasticsearch
shards in same time!)
Handle failover
Let you modify data

Creating custom source for reading from cloud datastore using latest python apache_beam cloud datafow sdk

Recently cloud dataflow python sdk was made available and I decided to use it. Unfortunately the support to read from cloud datastore is yet to come so I have to fall back on writing custom source so that I can utilize the benefits of dynamic splitting, progress estimation etc as promised. I did study the documentation thoroughly but am unable to put the pieces together so that I can speed up my entire process.
To be more clear my first approach was:
querying the cloud datastore
creating ParDo function and passing the returned query to it.
But with this it took 13 minutes to iterate over 200k entries.
So I decided to write custom source that would read the entities efficiently. But am unable to achieve that due to my lack of understanding of putting the pieces together. Can any one please help me with how to create custom source for reading from datastore.
Edited:
For first approach the link to my gist is:
https://gist.github.com/shriyanka/cbf30bbfbf277deed4bac0c526cf01f1
Thank you.
In the code you provided, the access to Datastore happens before the pipeline is even constructed:
query = client.query(kind='User').fetch()
This executes the whole query and reads all entities before the Beam SDK gets involved at all.
More precisely, fetch() returns a lazy iterable over the query results, and they get iterated over when you construct the pipeline, at beam.Create(query) - but, once again, this happens in your main program, before the pipeline starts. Most likely, this is what's taking 13 minutes, rather than the pipeline itself (but please feel free to provide a job ID so we can take a deeper look). You can verify this by making a small change to your code:
query = list(client.query(kind='User').fetch())
However, I think your intention was to both read and process the entities in parallel.
For Cloud Datastore in particular, the custom source API is not the best choice to do that. The reason is that the underlying Cloud Datastore API itself does not currently provide the properties necessary to implement the custom source "goodies" such as progress estimation and dynamic splitting, because its querying API is very generic (unlike, say, Cloud Bigtable, which always returns results ordered by key, so e.g. you can estimate progress by looking at the current key).
We are currently rewriting the Java Cloud Datastore connector to use a different approach, which uses a ParDo to split the query and a ParDo to read each of the sub-queries. Please see this pull request for details.

A good blobstore / memcache solution

Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs

Categories

Resources