The project I'm working in is still using google-api-python-client which is deprecated and the official documentation has no examples for it. I've gotten BigQuery working with it but I can't seem to figure out how to set configuration properties, specifically so that I can run a query with BATCH priority.
Can anyone point me in the right direction?
The answer is to use jobs().insert() rather than jobs().query(). Inserting a new job asynchronously gives the caller the ability to specify a wide range of options but requires them to run another command to get the results.
So assuming gs is your authenticated service object:
# insert an asynchronous job
jobr = gs.jobs().insert(projectId='abc-123', body={'configuration':{'query':{'query':'SELECT COUNT(*) FROM schema.table'}}}).execute()
# get query results of job
gs.jobs().getQueryResults(projectId='abc-123', jobId=jobr['jobReference']['jobId']).execute()
Related
The documentation explains how to delete feature tables through the UI.
Is it possible to do the same using the Python FeatureStoreClient? I cannot find anything in the docs: https://docs.databricks.com/_static/documents/feature-store-python-api-reference-0-3-7.pdf
Use case: we use ephemeral dev environments for development and we have automated deletion of resources when the environment is torn down. Now we are considering using the feature store, but we don't know how to automate deletion.
You can delete a feature table using the Feature Store Python API.
It is described here:
http://docs.databricks.com.s3-website-us-west-1.amazonaws.com/applications/machine-learning/feature-store/feature-tables.html#delete-a-feature-table
Use drop_table to delete a feature table. When you delete a table with drop_table, the underlying Delta table is also dropped.
fs.drop_table(
name='recommender_system.customer_features'
)
The answer above is correct, but note that the drop_table() function is experimental according to databricks documentation for the Feature Store Client API so it could be removed at any time. Besides that fs is also in reference to:
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.drop_table(name='recommender_system.customer_features')
so I have been asked to write a python script that pulls out all the Glue databases in our aws account, and then lists all the tables and partitions in the database in a CSV file? Its acceptable for it to just run on desktop for now, would really love some guidance on how to do this/direction on how to go about this as I'm a new junior and would like to explore my options before going back to my manager
format:
layout of csv file
Can be easily done using Boto3 - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client
I'll start it off for you and you can figure out the rest.
import boto3
glue_client = boto3.client('glue')
db_name_list = [db['Name'] for db in glue_client.get_databases()['DatabaseList']]
I haven't tested this code but it should create a list of all names of your databases. From here you can then use this information to run nested loops to get your tables get_tables(DatabaseName= ...) and then next your partitions get_partitions(DatabaseName=...,TableName=...).
Make sure to read the documentation to double check the arguments youre providing are correct.
EDIT: You will also likely need to use a paginator if you have a large amount of values to be returned. Best practice would be to use the paginator for all three calls which would just mean an additional loop at each step. Documentation about paginator is here - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Paginator.GetDatabases
And there is plenty of stackoverflow examples on how to use it.
I am currently porting some code from Spark to the MSSQL Analytical Services with Python. Everything is nice and dandy, but I am not sure if my solution is the correct one for multiple inputs for the scripts.
Consider the following code snippet:
DROP PROCEDURE IF EXISTS SampleModel;
GO
CREATE PROCEDURE SampleModel
AS
BEGIN
exec sp_execute_external_script
#language =N'Python',
#script=N'
import sys
sys.path.append("C:\path\to\custom\package")
from super_package.sample_model import run_model
OutputDataSet = run_model()'
WITH RESULT SETS ((Score float));
END
GO
INSERT INTO [dbo].[SampleModelPredictions] (prediction) EXEC [dbo].[SampleModel]
GO
I have a custom package called super_package and a sample model called sample_model. Since this model uses multiple database tables as input, and I would rather have everything in one place I have a module which connects to the database and fetches the data directly:
def go_go_get_data(query, config):
return rx_data_step(RxSqlServerData(
sql_query=query,
connection_string=config.connection_string,
user=config.user,
password=config.password))
Inside the run_model() function I fetch all necessary data from the database with the go_go_get_data function.
If the data is too big to handle in one go I would to some pagination.
In general I cannot join the tables so this solution doesn't work.
The questions is: Is this the right approach to tackle this problem? Or did I miss something? For now this works, but as I am still in the development / tryout phase I cannot be certain that this will scale. I would rather use the parameters for the stored procedure than fetching inside the Python context.
As you've already figured out, sp_execucte_external_script only allows one result set to be passed in. :-(
You can certainly query from inside the script to fetch data as long as your script is okay with the fact that it's not executing under the current SQL session's user's permissions.
If pagination is important and one data set is significantly larger than the others and you're using Enterprise Edition, you might consider passing the largest data set into the script in chunks using sp_execute_external_script's streaming feature.
If you'd like all of your data to be assembled in SQL Server (vs. fetched by queries in your script), you could try to serialize the result sets and then pass them in as parameters (link describes how to do this in R but something similar should be possible with Python).
I've been able to append/create a table from a Pandas dataframe using the pandas-gbq package. In particular using the to_gbq method. However, When I want to check the table using the BigQuery web UI I see the following message:
This table has records in the streaming buffer that may not be visible in the preview.
I'm not the only one to ask, and it seems that there's no solution to this yet.
So my questions are:
1. Is there a solution to the above problem (namely the data not being visible in the web UI).
2. If there is no solution to (1), is there another way that I can append data to an existing table using the Python BigQuery API? (Note the documentation says that I can achieve this by running an asynchronous query and using writeDisposition=WRITE_APPEND but the link that it provides doesn't explain how to use it and I can't work it out).
That message is just a UI notice, it should not hold you back.
To check data run a simple query and see if it's there.
To read only the data that is still in Streaming Buffer use this query:
#standardSQL
SELECT count(1)
FROM `dataset.table` WHERE _PARTITIONTIME is null
Recently cloud dataflow python sdk was made available and I decided to use it. Unfortunately the support to read from cloud datastore is yet to come so I have to fall back on writing custom source so that I can utilize the benefits of dynamic splitting, progress estimation etc as promised. I did study the documentation thoroughly but am unable to put the pieces together so that I can speed up my entire process.
To be more clear my first approach was:
querying the cloud datastore
creating ParDo function and passing the returned query to it.
But with this it took 13 minutes to iterate over 200k entries.
So I decided to write custom source that would read the entities efficiently. But am unable to achieve that due to my lack of understanding of putting the pieces together. Can any one please help me with how to create custom source for reading from datastore.
Edited:
For first approach the link to my gist is:
https://gist.github.com/shriyanka/cbf30bbfbf277deed4bac0c526cf01f1
Thank you.
In the code you provided, the access to Datastore happens before the pipeline is even constructed:
query = client.query(kind='User').fetch()
This executes the whole query and reads all entities before the Beam SDK gets involved at all.
More precisely, fetch() returns a lazy iterable over the query results, and they get iterated over when you construct the pipeline, at beam.Create(query) - but, once again, this happens in your main program, before the pipeline starts. Most likely, this is what's taking 13 minutes, rather than the pipeline itself (but please feel free to provide a job ID so we can take a deeper look). You can verify this by making a small change to your code:
query = list(client.query(kind='User').fetch())
However, I think your intention was to both read and process the entities in parallel.
For Cloud Datastore in particular, the custom source API is not the best choice to do that. The reason is that the underlying Cloud Datastore API itself does not currently provide the properties necessary to implement the custom source "goodies" such as progress estimation and dynamic splitting, because its querying API is very generic (unlike, say, Cloud Bigtable, which always returns results ordered by key, so e.g. you can estimate progress by looking at the current key).
We are currently rewriting the Java Cloud Datastore connector to use a different approach, which uses a ParDo to split the query and a ParDo to read each of the sub-queries. Please see this pull request for details.