I have a database that has records stretching back to 2014 that I have to migrate it to BigQuery, and I think that using the partitioned tables feature will help on the performance of the database.
So far, I loaded a small sample of the real data via the web UI, and while the table was already partitioned, all the data went to a single partition with the date that I had run the query in, which was expected, to be fair.
I searched the documentation sites and I ran into this, which I'm not sure if that's what I'm looking for.
I have two questions:
1) In the above example, they use the decorator on a SELECT query, but can I use it on a INSERT query as well?
2) I'm using the Python client to connect to the BigQuery API, and I while I found the table.insert_data method, I couldn't find anything that refers specifically to insert in the partitions, and I'm wondering if I missed it or I will have to use the query API to also insert data.
Investigated this a bit more:
1) I don't think I've managed to run an INSERT query at all, but this is moot for me, because..
2) Turns out that it is possible to insert in the partitions directly using the Python client, but it wasn't obvious to me:
I was using this snippet to insert some data into a table:
from google.cloud import bigquery
items = [
(1, 'foo'),
(2, 'bar')
]
client = bigquery.Client()
dataset = client.dataset('<dataset>')
table = dataset.table('<table_name>')
table.reload()
print table.insert_data(items)
The key is appending a $ and a date (say, 20161201) to the table name in the selector, like so:
table = dataset.table('<table_name>$20161201')
And it should insert in the correct partition.
Related
I have a table where I wrote 1.6 million records, and each has two columns: an ID, and a JSON string column.
I want to select all of those records and write the json in each row as a file. However, the query result is too large, and I get the 403 associated with that:
"403 Response too large to return. Consider specifying a destination table in your job configuration."
I've been looking at the below documentation around this and understand that they recommend specifying a table for the results and viewing them there, BUT all I want to do is select * from the table, so that would effectively just be copying it over, and I feel like I would run into the same issue querying that result table.
https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction
https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationQuery.FIELDS.allow_large_results
What is the best practice here? Pagination? Table sampling? list_rows?
I'm using the python client library as stated in the question title. My current code is just this:
query = f'SELECT * FROM `{project}.{dataset}.{table}`'
return client.query(query)
I should also mention that the IDs are not sequential, they're just alphanumerics.
The best practice and efficient way is to export your data and then download it instead of querying the whole table (SELECT *).
From there, you may extract your needed data from the exported files (eg. CSV, JSON, etc) using python code without having to wait for your code to finish the SELECT * query.
I'm trying to query a cosmos db and store a table into a pandas data frame (or just as a list, the problem is the same), using the following code
table_link= 'dbs/'+database_name+'/colls/'+container_name
query= 'SELECT * FROM '+container_name
df=pd.DataFrame(client.QueryItems(table_link,query,
{'enableCrossPartitionQuery': True}))
but I have two problems with the output (see the image attached).
First, I have extra columns id, $pk, &id....that shouldn't be there (I could just ask for the columns that I want in the query, but I have several tables and that would mean to write a different one for each one of them). And second, for the actual columns of the table, I get a dict with two keys "t" and "v" beign v the real value of that field.
Any help? I'm not sure if this is the expected behaviour of QueryItems, but I don't see any way to avoid it.
Please check the Target API for your Cosmos DB account. More than likely it is Table API. If the API is not SQL API, you will need to use the SDK specific for a particular API of Cosmod DB account.
I'm learning AWS Glue. With traditional ETL a common pattern is to look up the primary key from the destination table to decide if you need to do an update or an insert (aka upsert design pattern). With glue there doesn't seem to be that same control. Plain writing out the dynamic frame is just a insert process. There are two design patterns I can think of how to solve this:
Load the destination as data frame and in spark, left outer join to only insert new rows (how would you update rows if you needed to? delete then insert??? Since I'm new to spark this is most foreign to me)
Load the data into a stage table and then use SQL to perform the final merge
It's this second method that I'm exploring first. How can I in the AWS world execute a SQL script or stored procedure once the AWS Glue job is complete? Do you do a python-shell job, lambda, directly part of glue, some other way?
I have used pymysql library as a zip file uploaded to AWS S3, and configured in the AWS Glue job parameters. And for UPSERTs, I have used INSERT INTO TABLE....ON DUPLICATE KEY.
So based on the primary key validations, the code would either update a record if already exists, or insert a new record. Hope this helps. Please refer this:
import pymysql
rds_host = "rds.url.aaa.us-west-2.rds.amazonaws.com"
name = "username"
password = "userpwd"
db_name = "dbname"
conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
with conn.cursor() as cur:
insertQry="INSERT INTO ZIP_TERR(zip_code, territory_code, "
"territory_name,state) "
"VALUES(zip_code, territory_code, territory_name, state) "
"ON DUPLICATE KEY UPDATE territory_name = "
"VALUES(territory_name), state = VALUES(state);"
cur.execute(insertQry)
conn.commit()
cur.close()
In the above code sample, territory-code, zip-code are primary keys. Please refer here as well: More on looping inserts using a for loops
As always, AWS' changing feature list resolves much of these problems (arising from user demand and common work patterns).
AWS have published documentation on Updating and Inserting new data, using staging tables (which you mentioned in your second strategy).
Generally speaking the most rigorous approach for ETL is to truncate and reload source data, but this depends on your source data. If your source data is a time series dataset spanning billions of records, you may need to use a delta/incremental load pattern.
I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
I've been able to append/create a table from a Pandas dataframe using the pandas-gbq package. In particular using the to_gbq method. However, When I want to check the table using the BigQuery web UI I see the following message:
This table has records in the streaming buffer that may not be visible in the preview.
I'm not the only one to ask, and it seems that there's no solution to this yet.
So my questions are:
1. Is there a solution to the above problem (namely the data not being visible in the web UI).
2. If there is no solution to (1), is there another way that I can append data to an existing table using the Python BigQuery API? (Note the documentation says that I can achieve this by running an asynchronous query and using writeDisposition=WRITE_APPEND but the link that it provides doesn't explain how to use it and I can't work it out).
That message is just a UI notice, it should not hold you back.
To check data run a simple query and see if it's there.
To read only the data that is still in Streaming Buffer use this query:
#standardSQL
SELECT count(1)
FROM `dataset.table` WHERE _PARTITIONTIME is null