I have written a stored procedure in Bigquery and trying to call it within a dataflow pipeline. This works for the SELECT queries but not for the stored procedure:
pipeLine = beam.Pipeline(options=options)
rawdata = ( pipeLine
| beam.io.ReadFromBigQuery(
query="CALL my_dataset.create_customer()", use_standard_sql=True)
)
pipeLine.run().wait_until_finish()
Stored procedure:
CREATE OR REPLACE PROCEDURE my_dataset.create_customer()
BEGIN
SELECT *
FROM `project_name.my_dataset.my_table`
WHERE customer_name LIKE "%John%"
ORDER BY created_time
LIMIT 5;
END;
I am able to create the stored procedure and call it within the Bigquery console. But, in the dataflow pipeline, it throws an error while calling it:
"code": 400,
"message": "configuration.query.destinationEncryptionConfiguration cannot be set for scripts",
"message": "configuration.query.destinationEncryptionConfiguration cannot be set for scripts",
"domain": "global",
"reason": "invalid"
"status": "INVALID_ARGUMENT"
Edit:
Is there any other method in beam that I can use to call the stored procedure in bigquery ?
I see multiple threads raised on the same issue, but did not find answer for it, so thought to post this question. Thank you for any help.
The principle of a procedure is to perform a job and to return nothing. The principle of a function is to perform a job and to return something.
You can't use a stored procedure as a read from in Dataflow, your error is normal. The parametrized views are in the pipe to achieve what you want. The solution for now is to use a UDF or to directly write the query in your code.
EDIT 1
What do you want to do?
Do you want to get data? If so, it's not a procedure that you have to use.
Do you want to simply call a Stored Procedure? If so, simply perform an API call, with the BigQuery client library to run a call query, that's all. But you have to update your stored procedure because for now, it's only a projection and it's useless "for a procedure".
Related
I'm working on an ETL with Apache Beam and Dataflow using Python and I'm using BigQuery as a database/datawarehouse.
The ETL basically performs some processing then updates data that is already in BigQuery. Since there is no update transform in Apache Beam, I had to use the BigQuery SDK and write my own UPDATE query, and map it to each row.
The queries work fine when done sequentially, but when I use multiple workers, I get the following error:
{'reason': 'invalidQuery', 'message': 'Could not serialize access to table my_table due to concurrent update'}
I made sure that the same row is never accessed/updated concurrently (a row is basically an id, and each id is unique), I've also tried to run the same code with a simple Python script without Beam/Dataflow, and I still got the same error when I started using multiple threads instead of one.
Has anyone got the same problem using BigQuery SDK ? And do you have any suggestions to avoid that problem ?
I think it's better from your Beam Dataflow job to append the data.
Bigquery is more append oriented and the BigueryIO in Beam is adapted for append operation.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can deduplicate the data in batch mode with the following steps :
Create a staging and final tables
Your orchestrator truncates your staging table
Your orchestrator runs your Dataflow job
Dataflow job reads your data
Dataflow job writes the result in append mode to Bigquery in the staging table
Your orchestrator run a task with a merge query with Bigquery between the staging and final tables. The merge query allows to insert or update the line in the final table if the element exists.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
Example of a merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
I had a use case where I had a BQ table that was containing around 150K records, and I needed to update its content monthly (which means around 100K UPDATE and couple of thousands APPEND.
When I designed my Beam/Dataflow job to update the records with the BQ python API library, I fall in Quota issues (limited number of updated) as well as the concurrency issue.
I had to change the approach my pipeline was working with, from reading the BQ table and updating the record, to process the BQ table, update what needs to be updated, and append what's new, and save to a new BQ table.
Once the job is successfully finished with no error, you can replace the old one with the new created table.
GCP mentions:
Running two mutating DML statements concurrently against a table will
succeed as long as the two statements don’t modify data in the same
partition. Two jobs that try to mutate the same partition may
sometimes experience concurrent update failures.
And then :
BigQuery now handles such failures automatically. To do this, BigQuery
will restart the job.
Can this retry mechanism be a solution at all ? Anyone to elaborate on this?
Source: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
I'm trying to convert my python ETLs to airflow.
I have an ETL written in python to copy data from Elastic to MSSQL.
I've build a DAG with 3 tasks.
task 1- get the latest date from the table in MSSQL
task 2- generate an elastic query based on that date retrieved from the previous task plus some filters (must not and sould) taken from a different table in MSSQL (less relevant).
eventually generating a body like so:
{ "query": {
"bool": {
"filter": {
"range": {
"#timestamp": { "gt": latest_timestamp }
}
},
"must_not": [],
"should": [],
"minimum_should_match": 1
}
}
}
task 3- scroll the elastic index using the body generated in the previous task and write the data to mssql.
My DAG fails on the 3rd task with the error:
parsing exception: Expected [START_OBJECT] but foud [START_ARRAY]
I've taken the generated body and ran it on elastic in dev tools and it is working fine.
So I have no idea what is the problem and how to debug it.
Any ideas?
I've found the problem.
I use XCOM to pass the body between Task2 and Task3.
Apparently something in the XCOM is messing with the body (I don't see it in the UI anyway).
When I put the logic of the body (task2) and call the search with the same body without passing it via XCOM everything is working as expected.
So beware of using XCOM cause it has side effects apparently.
I'm not a big ELK guy, but I would assume a different format is required there.
When you are doing "scroll the elastic index", you most probably use some API that expects one query format, while in dev console another query format is expected.
E.g. this thread:
https://discuss.elastic.co/t/unable-to-send-json-data-to-elastic-search/143506/3
Kibana Post Search - Expected [START_OBJECT] but found [VALUE_STRING]
So, check what format is expected by API handle you use to scroll through the data. Or, if still unclear, please share the function you use for scrolling in task 3.
Also,
GRANT EXECUTE permission to ALL STORED PROCEDURES in snowflake.
I have create a stored procedure in the snowflake database but I am getting error while trying to execute that stored procedure.
create or replace procedure get_column_scale(column_index float)
returns float not null
language javascript
as
$$
var stmt = snowflake.createStatement(
{sqlText: "select EmployeeKey, EmployeeCode from stproc_test_employees;"}
);
stmt.execute(); // ignore the result set; we just want the scale.
return stmt.getColumnScale(COLUMN_INDEX); // Get by column index (1-based)
$$
;
i am executing like below
CALL get_column_scale(1);
I'm getting this error when trying to execute the stored procedure with Snowflake
Error [100183] [P0000]: Execution error in stored procedure GET_COLUMN_SCALE:
compilation error:
'SYEMPLOYEES' does not exist or not authorized.
Statement.execute, line 5 position 9
I am thinking it's execute permission i need to add but I don't have idea where need to configure stored procedure permission in Snowflake.
Is anyone have idea about to give permission for stored procedure/table?
A few things that might help you.
I'd recommend fully-qualifying that table name in the SELECT statement, this way whenever the stored procedure is called, the "context" of the user's session will not matter, as long as the session's current role has access to the table and schema you should be good.
A fully-qualified table has the form: database_name.schema_name.object_name
Example: hr_prod_db.hr_schema.employees
You can read more about object name resolution at this link: https://docs.snowflake.net/manuals/sql-reference/name-resolution.html
I'd recommend you spend a little bit of time reading about "Session State", at the following link, as this link discuses "Caller's rights" vs. "Owner's rights" stored procedures. If your procedure is only going to be called from a session with the role of the stored procedure owner, this shouldn't matter, but if you are granting USAGE on the procedure to another role, it's very important to understand this and set this properly.
https://docs.snowflake.net/manuals/sql-reference/stored-procedures-usage.html#session-state
If your procedure is going to be called by a session that has it's current role set to a different role than the "owning role, you'll need to ensure the proper grants on the procedure (and schema + database) to the role that is going to be executing the procedure. This is all outlined here in this document quite thoroughly, pay particular attention to this as in your example code you have a table or view name that is different than what your error message is reporting, so perhaps stproc_test_employees is a view on top of SYEMPLOYEES:
https://docs.snowflake.net/manuals/sql-reference/stored-procedures-usage.html#access-control-privileges
Note: When/if you grant usage on this procedure to another role, you will need to include the datatype of the arguments, example:
GRANT USAGE ON database_name.schema_name.get_column_scale(float) TO ROLE other_role_name_here;
I hope this helps...Rich
For those reading this answer in 2022, the correct syntax for giving permission to execute a procedure is as follows:
GRANT USAGE ON PROCEDURE
get_column_scale(float)
TO ROLE other_role_name_here;
The project I'm working in is still using google-api-python-client which is deprecated and the official documentation has no examples for it. I've gotten BigQuery working with it but I can't seem to figure out how to set configuration properties, specifically so that I can run a query with BATCH priority.
Can anyone point me in the right direction?
The answer is to use jobs().insert() rather than jobs().query(). Inserting a new job asynchronously gives the caller the ability to specify a wide range of options but requires them to run another command to get the results.
So assuming gs is your authenticated service object:
# insert an asynchronous job
jobr = gs.jobs().insert(projectId='abc-123', body={'configuration':{'query':{'query':'SELECT COUNT(*) FROM schema.table'}}}).execute()
# get query results of job
gs.jobs().getQueryResults(projectId='abc-123', jobId=jobr['jobReference']['jobId']).execute()
I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.
Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?
If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.
You would need to create your own version of the BigQuerySink to to do this.
If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:
PCollection<TableRow> rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);
You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.