python bigquery api load_table_from_dataframe not uploading to partitioned table - python

the folowing function runs but when i check bigquery there is nothing there. I am using the $ decorator to add a specific date and i have waited to be sure the data has uploaded:
def write_truncate_table(client, table_id, df, schema=None):
job_config = bigquery.LoadJobConfig(
schema=schema,
# to append use "WRITE_APPEND" or don't pass job_config at all (appending is default)
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
return job.result()

Related

Python unit testing for a Cloud Function that loads GCS files to BigQuery

It is the first time I am using a Cloud Function and this cloud function is just doing one single job: everytime a file is uploaded to a GCS bucket, the cloud function is running and copying that file (.csv) to a BigQuery table without any transformations. What would be the most efficient wat to test (unit not integration) the gcs_to_bq method?
def get_bq_table_name(file_name):
if re.search('car', file_name):
return 'car'
return 'bike'
def gcs_to_bq(event, context):
# Construct a BigQuery client object.
client = bigquery.Client()
bq_table = get_bq_table_name(event['name'] )
table_id = f'xxx.yyy.{bq_table}'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("datetime", "STRING"),
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("id", "STRING"),
],
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://" + event['bucket'] + '/' + event['name']
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
I think you would need three things for unit testing
create a fake .cvs file and upload it to a stage/dev GCS bucket.
create a staging dataset in BQ.
create a fake event object that presents (1).
Then your unit testing is to call gcs_to_bq() with (3) and check if the table is created correctly in (2).
As you could see, though it's unit testing, it requires setting up cloud resources.
There is a GCS emulator that could help if you want to create GCS stub/mock completely local but I never tried it.
https://github.com/fsouza/fake-gcs-server

How to append query results using BigQuery Python API

I cannot find a way to append results of my query to a table in BigQuery that already exists and is partitioned by hour.
I have only found this solution: https://cloud.google.com/bigquery/docs/writing-results#writing_query_results.
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """SELECT * FROM table1 JOIN table2 ON table1.art_n=table2.artn"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
But providing a destination table to bigquery.QueryJobConfig overwrites it, and I did not find that bigquery.QueryJobConfig would have an option to specify if_exists or something. As far as I understand, I need to apply job.insert to query results, but I do not understand how.
I also did not find any good advice around, maybe someone can point me to it?
Just in case, my real query is huge and I load it from a separate JSON file.
When you create your job_config, you need to set the write_disposition to WRITE_APPEND:
[..]
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
write_disposition='WRITE_APPEND'
)
[..]
See here.
You can add below lines to append data into existing table:
job_config.write_disposition = 'WRITE_APPEND'
Complete Code:
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(destination="myproject.mydataset.target_table")
job_config.write_disposition = 'WRITE_APPEND'
sql = """SELECT * FROM table1 JOIN table2 ON table1.art_n=table2.artn"""
query_job = client.query(sql, job_config=job_config)
query_job.result()
The parameter that you were looking for is called write_disposition. You want to use WRITE_APPEND to append to a table.

Appending CSV to BigQuery table with Python client

I have a new CSV file each week in the same format, which I need to append to a BigQuery table using the Python client. I successfully created the table using the first CSV, but I am unsure how to append subsequent CSVs going forward. The only way I have found is the google.cloud.bigquery.client.Client().insert_rows() method. See api link here. This would require me to first read the CSV in as a list of dictionaries. Is there a better way to append data from a CSV to a BigQuery table?
See simple example below
# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://your_bucket/path/your_file.csv"
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))
see more details in BigQuery Documentation

BigQuery Python Client: Creating a Table from Query with a Table Description

I'm using the python client to create tables via SQL as explained in the docs (https://cloud.google.com/bigquery/docs/tables) like so:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'your_dataset_id'
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location='US',
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
This works well except that the client function for creating a table via SQL query uses a job_config object, and job_config receives a table_ref, not a table object.
I found this doc for creating tables with description here: https://google-cloud-python.readthedocs.io/en/stable/bigquery/usage.html, But this is for tables NOT created from queries.
Any ideas on how to create a table from query while specifying a description for that table?
Since you want to do more than only save the SELECT result to a new table the best way for you is not use a destination table in your job_config variable rather use a CREATE command
So you need to do 2 things:
Remove the following 2 lines from your code
table_ref = client.dataset(dataset_id).table('your_table_id')
job_config.destination = table_ref
Replace your SQL with this
#standardSQL
CREATE TABLE dataset_id.your_table_id
PARTITION BY DATE(_PARTITIONTIME)
OPTIONS(
description = 'this table was created via agent #123'
) AS
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Running Bigquery query uncached using Python API

Hi I am using BigQuery and with its Python API submitting Queries to get results. I am using the method - bqclient.query("PASS THE QUERY") to execute the query programmatically. I am trying to do a performance test but BigQuery returns cached results. Is there a way I can set cache = False in the Python API while calling the bqclient.query method. Through the BigQuery documentation I have see that we can set useQueryCache property to false, but am not sure where to set it.
Current Code
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache = False
query_job = bigquery.query(select_query, job_config = job_config)
query represents the query that I want to run.
Thank you
You need to set useQueryCache. See here for more info. Not the lower case underscore format:
[..]
QUERY = ('SELECT ..')
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache = False
query_job = client.query(QUERY, job_config=job_config)
[..]

Categories

Resources