BigQuery - insert row larger than 1MB

BigQuery - insert row larger than 1MB - python

My python app is storing result data in BigQuery. In the code I am generating JSON that reflects target BQ table structure and then insert it.
Generally it works fine, but fails to save rows, which size exceeds 1 MB. This is an limitation of using streaming inserts.
I checked Google API documentation: https://googleapis.dev/python/bigquery/latest/index.html
It seems, that Client methods like insert_rows or insert_rows_json are using insertAll method underneath - which uses streaming mechanism.
Is there a way to invoke "standard" BigQuery insert from python code to insert row larger than 1MB? It would be rather rare occurrence, so I am not concerned about quotas regarding daily table insert count limit.

The Client library cannot go around the API limits.
See current quotas, a row as of this writing cannot be larger than 1MB.
The workaround we used is to save records in NJSON to GCS in 100MB batches - we use gcsfs library - and then execute a bq.load() job.
I have actually just logged a feature request here to increase the limit as this is very limiting. If interested, make sure to "star" it to gain traction.

Related

Firestore read and update is taking 1-12 sec delay to complete in Python

I am using firebase python client to write data to firestore.Any read / write operation at least takes 1 second to complete.Firestore DB is in us-central and our server is in Singapore.
Is it what causing issues?
During read, I have used a where query with limit like below.
collection_ref.where(
u"field", u"==", u"field_value").limit(1).get()
During write, I use set and update(dict)
Sometimes the lag is around 10 to 12 sec
Did anyone face similar issues?
Any pointers will be appreciated

This article on why is Cloud Firestore query slow mentioned the lists of reasons
If you are downloading a bunch of data you probably don’t need to download all of them.The solution would be to limit the amount that comes back.
Your offline cache is too big. Cloud Firestore does some amazing offline caching but this local cache does not apply the same indexes that the server does. This means when you query documents in your offline cache cloud Firestore needs to pack every documents stored locally for the collection being queried and compare it against your query.The solution is limit how much data is being stored in offline cache.
Without composite indexing Firestore would have to do a lot of searching to get the results set.So instead, create a composite index so Firestore can do a quick lookup.
Used to Realtime Database.Realtime Database generally has a lower latency,you are not really going to notice the difference.But if app needs every second of latency you are probably better off using Realtime
Database in these scenarios.
The laws of physics are keeping you down. Your customer might be too far away from your Firestore Database and the actual latency is taking too long. To fix this use real time listeners which is a technique called latency compensation.

How to get individual row from bigquery table less then a second?

I have a aggregated data table in bigquery that has millions of rows. This table is growing everyday.
I need a way to get 1 row from this aggregate table in milliseconds to append data in real time event.
What is the best way to tackle this problem?

BigQuery is not build to respond in miliseconds, so you need an other solution in between. It is perfectly fine to use BigQuery to do the large aggregration calculation. But you should never serve directly from BQ where response time is an issue of miliseconds.
Also be aware, that, if this is an web application for example, many reloads of a page, could cost you lots of money.. as you pay per Query.
There are many architectual solution to fix such issues, but what you should use is hard to tell without any project context and objectives.
For realtime data we often use PubSub to connect somewhere in between, but that might be an issue if the (near) realtime demand is an aggregrate.
You could also use materialized views concept, by exporting the aggregrated data to a sub component. For example cloud storage -> pubsub , or a SQL Instance / Memory store.. or any other kind of microservice.

BigQuery - Update Tables With Changed/Deleted Records

Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.
Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.
We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.
Hope to get pointed in the right direction for how to start down this path.

Stream the full row on every update to BigQuery.
Let the table accommodate multiple rows for the same primary entity.
Write a view eg table_last that picks the most recent row.
This way you have all your queries near-realtime on real data.
You can deduplicate occasionally the table by running a query that rewrites self table with latest row only.
Another approach is if you have 1 final table, and 1 table which you stream into, and have a MERGE statement that runs scheduled every X minutes to write the updates from streamed table to final table.

How to fix query problem on Azure CosmosDB that occurs only on collections with large data?

I'm trying to read from a CosmosDB collection (MachineCollection) with a large amount of data (58 GB data; index-size 9 GB). Throughput is set to 1000 RU/s. The collection is partitioned with a Serial number, Read Location (WestEurope, NorthEurope), Write Location (WestEurope). Simultaneously to my reading attempts, the MachineCollection is fed with data every 20 seconds.
The problem is that I can not query any data via Python. If I execute the query on CosmosDB Data Explorer I get results in no time. (e.g. querying for a certain serial number).
For troubleshooting purposes, I have created a new Database (TestDB) and a TestCollection. In this TestCollection, there are 10 datasets of MachineCollection. If I try to read from this MachineCollection via Python it succeeds and I am able to save the data to CSV.
This makes me wonder why I am not able to query data from MachineCollection when configuring TestDB and TestCollection with the exact same properties.
What I have already tried for the querying via Python:
options['enableCrossPartitionQuery'] = True
Querying using PartitionKey: options['partitionKey'] = 'certainSerialnumber'
Same as always. Works with TestCollection, but not with MachineCollection.
Any ideas on how to resolve this issue are highly appreciated!

Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
As I know,Azure portal doesn't automatically help you optimize your SQL so you need to handle this in the sdk or rest api.
You could set value of Max Item Count and paginate your data using continuation token. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
Another case you could refer to:How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?

Is it possible to Bulk Insert using Google Cloud Datastore

We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.

As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.