How to append to a table in BigQuery using Python BigQuery API - python

I've been able to append/create a table from a Pandas dataframe using the pandas-gbq package. In particular using the to_gbq method. However, When I want to check the table using the BigQuery web UI I see the following message:
This table has records in the streaming buffer that may not be visible in the preview.
I'm not the only one to ask, and it seems that there's no solution to this yet.
So my questions are:
1. Is there a solution to the above problem (namely the data not being visible in the web UI).
2. If there is no solution to (1), is there another way that I can append data to an existing table using the Python BigQuery API? (Note the documentation says that I can achieve this by running an asynchronous query and using writeDisposition=WRITE_APPEND but the link that it provides doesn't explain how to use it and I can't work it out).

That message is just a UI notice, it should not hold you back.
To check data run a simple query and see if it's there.
To read only the data that is still in Streaming Buffer use this query:
#standardSQL
SELECT count(1)
FROM `dataset.table` WHERE _PARTITIONTIME is null

Related

Unable to run multiple UPDATE with BigQuery via Python SDK

I'm working on an ETL with Apache Beam and Dataflow using Python and I'm using BigQuery as a database/datawarehouse.
The ETL basically performs some processing then updates data that is already in BigQuery. Since there is no update transform in Apache Beam, I had to use the BigQuery SDK and write my own UPDATE query, and map it to each row.
The queries work fine when done sequentially, but when I use multiple workers, I get the following error:
{'reason': 'invalidQuery', 'message': 'Could not serialize access to table my_table due to concurrent update'}
I made sure that the same row is never accessed/updated concurrently (a row is basically an id, and each id is unique), I've also tried to run the same code with a simple Python script without Beam/Dataflow, and I still got the same error when I started using multiple threads instead of one.
Has anyone got the same problem using BigQuery SDK ? And do you have any suggestions to avoid that problem ?
I think it's better from your Beam Dataflow job to append the data.
Bigquery is more append oriented and the BigueryIO in Beam is adapted for append operation.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can deduplicate the data in batch mode with the following steps :
Create a staging and final tables
Your orchestrator truncates your staging table
Your orchestrator runs your Dataflow job
Dataflow job reads your data
Dataflow job writes the result in append mode to Bigquery in the staging table
Your orchestrator run a task with a merge query with Bigquery between the staging and final tables. The merge query allows to insert or update the line in the final table if the element exists.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
Example of a merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
I had a use case where I had a BQ table that was containing around 150K records, and I needed to update its content monthly (which means around 100K UPDATE and couple of thousands APPEND.
When I designed my Beam/Dataflow job to update the records with the BQ python API library, I fall in Quota issues (limited number of updated) as well as the concurrency issue.
I had to change the approach my pipeline was working with, from reading the BQ table and updating the record, to process the BQ table, update what needs to be updated, and append what's new, and save to a new BQ table.
Once the job is successfully finished with no error, you can replace the old one with the new created table.
GCP mentions:
Running two mutating DML statements concurrently against a table will
succeed as long as the two statements don’t modify data in the same
partition. Two jobs that try to mutate the same partition may
sometimes experience concurrent update failures.
And then :
BigQuery now handles such failures automatically. To do this, BigQuery
will restart the job.
Can this retry mechanism be a solution at all ? Anyone to elaborate on this?
Source: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery

How do you handle large query results for a simple select in bigquery with the python client library?

I have a table where I wrote 1.6 million records, and each has two columns: an ID, and a JSON string column.
I want to select all of those records and write the json in each row as a file. However, the query result is too large, and I get the 403 associated with that:
"403 Response too large to return. Consider specifying a destination table in your job configuration."
I've been looking at the below documentation around this and understand that they recommend specifying a table for the results and viewing them there, BUT all I want to do is select * from the table, so that would effectively just be copying it over, and I feel like I would run into the same issue querying that result table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery.FIELDS.allow_large_results
What is the best practice here? Pagination? Table sampling? list_rows?
I'm using the python client library as stated in the question title. My current code is just this:
query = f'SELECT * FROM `{project}.{dataset}.{table}`'
return client.query(query)
I should also mention that the IDs are not sequential, they're just alphanumerics.
The best practice and efficient way is to export your data and then download it instead of querying the whole table (SELECT *).
From there, you may extract your needed data from the exported files (eg. CSV, JSON, etc) using python code without having to wait for your code to finish the SELECT * query.

Is there any way we can load BigTable data into BigQuery?

I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.

Google BigQuery Results Don't Show

I created a python script that pushes a pandas dataframe into Google BigQuery and it looks as though I'm able to query the table directly from GBQ. However, another user is unable to view the results when they query from that same table I generated on GBQ. This seems to be a Big Query issue because when they tried to connect to GBQ and query the table indirectly using pandas, it seemed to work fine (pd.read_gbq("SELECT * FROM ...", project_id)). What is causing this strange behaviour?
What I'm seeing:
What they are seeing:
I've encountered this when loading tables to BigQuery via Python GBQ. If you take the following steps, the table will display properly
Load dataframe to BigQuery via Python GBQ
SELECT * FROM uploaded_dataset.uploaded_dataset; doing so will properly show the table
Within the BigQuery UI, save the table (as a new table name)
From there, you will be able to see the table properly. Unfortunately, I don't know how to resolve this without a manual step in the UI.

Python: How to update (overwrite) Google BigQuery table using pandas dataframe

I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.

Categories

Resources