I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.
Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?
If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.
You would need to create your own version of the BigQuerySink to to do this.
If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:
PCollection<TableRow> rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);
You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.
Related
I'm working on an ETL with Apache Beam and Dataflow using Python and I'm using BigQuery as a database/datawarehouse.
The ETL basically performs some processing then updates data that is already in BigQuery. Since there is no update transform in Apache Beam, I had to use the BigQuery SDK and write my own UPDATE query, and map it to each row.
The queries work fine when done sequentially, but when I use multiple workers, I get the following error:
{'reason': 'invalidQuery', 'message': 'Could not serialize access to table my_table due to concurrent update'}
I made sure that the same row is never accessed/updated concurrently (a row is basically an id, and each id is unique), I've also tried to run the same code with a simple Python script without Beam/Dataflow, and I still got the same error when I started using multiple threads instead of one.
Has anyone got the same problem using BigQuery SDK ? And do you have any suggestions to avoid that problem ?
I think it's better from your Beam Dataflow job to append the data.
Bigquery is more append oriented and the BigueryIO in Beam is adapted for append operation.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can deduplicate the data in batch mode with the following steps :
Create a staging and final tables
Your orchestrator truncates your staging table
Your orchestrator runs your Dataflow job
Dataflow job reads your data
Dataflow job writes the result in append mode to Bigquery in the staging table
Your orchestrator run a task with a merge query with Bigquery between the staging and final tables. The merge query allows to insert or update the line in the final table if the element exists.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
Example of a merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
I had a use case where I had a BQ table that was containing around 150K records, and I needed to update its content monthly (which means around 100K UPDATE and couple of thousands APPEND.
When I designed my Beam/Dataflow job to update the records with the BQ python API library, I fall in Quota issues (limited number of updated) as well as the concurrency issue.
I had to change the approach my pipeline was working with, from reading the BQ table and updating the record, to process the BQ table, update what needs to be updated, and append what's new, and save to a new BQ table.
Once the job is successfully finished with no error, you can replace the old one with the new created table.
GCP mentions:
Running two mutating DML statements concurrently against a table will
succeed as long as the two statements don’t modify data in the same
partition. Two jobs that try to mutate the same partition may
sometimes experience concurrent update failures.
And then :
BigQuery now handles such failures automatically. To do this, BigQuery
will restart the job.
Can this retry mechanism be a solution at all ? Anyone to elaborate on this?
Source: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
I came across problem, that when I query for large data (35M rows, 22GB data), the same query got executed multiple times (e.g. 400 times) in background. I understand that data is partitioned/shuffled in some way. It greatly increases query cost.
This is how I query for data:
from google.cloud import bigquery
bqclient = bigquery.Client(project)
query_job = bqclient.query(query).result()
df_result = query_job.to_dataframe()
Where project and query are Python strings.
I am using google-cloud-bigquery==2.30.1.
I am looking for any programmatic solutions to reduce query costs. E.g. is there different class/config/object/method/library that would handle such queries in better way?
I suspect it's because you're calling result() twice, once when you run query_job = bqclient.query(query).result() and once when you run df_result = query_job.to_dataframe() (by calling query_job again). Not sure why it's running so many times but probably has to do with how result() works (https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query should have more info)
The "basic" answer to what you want is df_result = bqclient.query(query).to_dataframe(). However, if you're querying a large dataset, this will likely take time. See
Reading from BigQuery into a Pandas DataFrame and performance issues for a better way to do this.
Side note on reducing query costs: in case you're working in a local Python environment, you probably don't want to be processing 22GB worth of data there. If you're, say, building an ML model, you probably want to extract say 1/1000th of you data (a simple LIMIT in SQL won't reduce your query costs, you want to partition your table on a date column and filter on that OR create a new table with a subset of rows and query that) and work on that. Note the model on your subset of data won't be accurate, it's just to make sure your Python code works on that data. You'd then deploy your code to a cloud environment and run it on the full data. Good end-to-end example here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive2/structured/solutions.
I'm new to using Airflow (and newish to Python.)
I need to migrate some very large MySQL tables to s3 files using Airflow. All of the relevant hooks and operators in Airflow seem geared to using Pandas dataframes to load the full SQL output into memory and then transform/export to the desired file format.
This is causing obvious problems for the large tables which cannot fully fit into memory and are failing. I see no way to have Airflow read the query results and save them off to a local file instead of tanking it all up into memory.
I see ways to bulk_dump to output results to a file on the MySQL server using the MySqlHook, but no clear way to transfer that file to s3 (or to Airflow local storage then to s3).
I'm scratching my head a bit because I've worked in Pentaho which would easily handle this problem, but cannot see any apparent solution.
I can try to slice the tables up into small enough chunks that Airflow/Pandas can handle them, but that's a lot of work, a lot of query executions, and there are a lot of tables.
What would be some strategies for moving very large tables from a MySQL server to s3?
You don't have to use Airflow transfer operators if they don't fit to your scale. You can (and probably should) create your very own CustomMySqlToS3Operator with the logic that fits to your process.
Few options:
Don't transfer all the data in one task. slice the data based on dates/number of rows/other. You can use several tasks of CustomMySqlToS3Operator in your workflow. This is not alot of work as you mentioned. This is simply the matter of providing the proper WHERE conditions to the SQL queries that you generate. Depends on the process that you build You can define that every run process the data of a single day thus your WHERE condition is simple date_column between execution_date and next_execution_date (you can read about it in https://stackoverflow.com/a/65123416/14624409 ) . Then use catchup=True to backfill runs.
Use Spark as part of your operator.
As you pointed you can dump the data to local disk and then upload it to S3 using load_file method of S3Hook. This can be done as part of the logic of your CustomMySqlToS3Operator or if you prefer as Python callable from PythonOperator.
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
I have a aggregated data table in bigquery that has millions of rows. This table is growing everyday.
I need a way to get 1 row from this aggregate table in milliseconds to append data in real time event.
What is the best way to tackle this problem?
BigQuery is not build to respond in miliseconds, so you need an other solution in between. It is perfectly fine to use BigQuery to do the large aggregration calculation. But you should never serve directly from BQ where response time is an issue of miliseconds.
Also be aware, that, if this is an web application for example, many reloads of a page, could cost you lots of money.. as you pay per Query.
There are many architectual solution to fix such issues, but what you should use is hard to tell without any project context and objectives.
For realtime data we often use PubSub to connect somewhere in between, but that might be an issue if the (near) realtime demand is an aggregrate.
You could also use materialized views concept, by exporting the aggregrated data to a sub component. For example cloud storage -> pubsub , or a SQL Instance / Memory store.. or any other kind of microservice.