I'm working on an ETL with Apache Beam and Dataflow using Python and I'm using BigQuery as a database/datawarehouse.
The ETL basically performs some processing then updates data that is already in BigQuery. Since there is no update transform in Apache Beam, I had to use the BigQuery SDK and write my own UPDATE query, and map it to each row.
The queries work fine when done sequentially, but when I use multiple workers, I get the following error:
{'reason': 'invalidQuery', 'message': 'Could not serialize access to table my_table due to concurrent update'}
I made sure that the same row is never accessed/updated concurrently (a row is basically an id, and each id is unique), I've also tried to run the same code with a simple Python script without Beam/Dataflow, and I still got the same error when I started using multiple threads instead of one.
Has anyone got the same problem using BigQuery SDK ? And do you have any suggestions to avoid that problem ?
I think it's better from your Beam Dataflow job to append the data.
Bigquery is more append oriented and the BigueryIO in Beam is adapted for append operation.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can deduplicate the data in batch mode with the following steps :
Create a staging and final tables
Your orchestrator truncates your staging table
Your orchestrator runs your Dataflow job
Dataflow job reads your data
Dataflow job writes the result in append mode to Bigquery in the staging table
Your orchestrator run a task with a merge query with Bigquery between the staging and final tables. The merge query allows to insert or update the line in the final table if the element exists.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
Example of a merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
I had a use case where I had a BQ table that was containing around 150K records, and I needed to update its content monthly (which means around 100K UPDATE and couple of thousands APPEND.
When I designed my Beam/Dataflow job to update the records with the BQ python API library, I fall in Quota issues (limited number of updated) as well as the concurrency issue.
I had to change the approach my pipeline was working with, from reading the BQ table and updating the record, to process the BQ table, update what needs to be updated, and append what's new, and save to a new BQ table.
Once the job is successfully finished with no error, you can replace the old one with the new created table.
GCP mentions:
Running two mutating DML statements concurrently against a table will
succeed as long as the two statements don’t modify data in the same
partition. Two jobs that try to mutate the same partition may
sometimes experience concurrent update failures.
And then :
BigQuery now handles such failures automatically. To do this, BigQuery
will restart the job.
Can this retry mechanism be a solution at all ? Anyone to elaborate on this?
Source: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
Related
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.
Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.
We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.
Hope to get pointed in the right direction for how to start down this path.
Stream the full row on every update to BigQuery.
Let the table accommodate multiple rows for the same primary entity.
Write a view eg table_last that picks the most recent row.
This way you have all your queries near-realtime on real data.
You can deduplicate occasionally the table by running a query that rewrites self table with latest row only.
Another approach is if you have 1 final table, and 1 table which you stream into, and have a MERGE statement that runs scheduled every X minutes to write the updates from streamed table to final table.
I'm learning AWS Glue. With traditional ETL a common pattern is to look up the primary key from the destination table to decide if you need to do an update or an insert (aka upsert design pattern). With glue there doesn't seem to be that same control. Plain writing out the dynamic frame is just a insert process. There are two design patterns I can think of how to solve this:
Load the destination as data frame and in spark, left outer join to only insert new rows (how would you update rows if you needed to? delete then insert??? Since I'm new to spark this is most foreign to me)
Load the data into a stage table and then use SQL to perform the final merge
It's this second method that I'm exploring first. How can I in the AWS world execute a SQL script or stored procedure once the AWS Glue job is complete? Do you do a python-shell job, lambda, directly part of glue, some other way?
I have used pymysql library as a zip file uploaded to AWS S3, and configured in the AWS Glue job parameters. And for UPSERTs, I have used INSERT INTO TABLE....ON DUPLICATE KEY.
So based on the primary key validations, the code would either update a record if already exists, or insert a new record. Hope this helps. Please refer this:
import pymysql
rds_host = "rds.url.aaa.us-west-2.rds.amazonaws.com"
name = "username"
password = "userpwd"
db_name = "dbname"
conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
with conn.cursor() as cur:
insertQry="INSERT INTO ZIP_TERR(zip_code, territory_code, "
"territory_name,state) "
"VALUES(zip_code, territory_code, territory_name, state) "
"ON DUPLICATE KEY UPDATE territory_name = "
"VALUES(territory_name), state = VALUES(state);"
cur.execute(insertQry)
conn.commit()
cur.close()
In the above code sample, territory-code, zip-code are primary keys. Please refer here as well: More on looping inserts using a for loops
As always, AWS' changing feature list resolves much of these problems (arising from user demand and common work patterns).
AWS have published documentation on Updating and Inserting new data, using staging tables (which you mentioned in your second strategy).
Generally speaking the most rigorous approach for ETL is to truncate and reload source data, but this depends on your source data. If your source data is a time series dataset spanning billions of records, you may need to use a delta/incremental load pattern.
I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.
Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?
If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.
You would need to create your own version of the BigQuerySink to to do this.
If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:
PCollection<TableRow> rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);
You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.