Writing to Bigtable from Python

Writing to Bigtable from Python - python

I am developing an IoT data pipeline using Python and Bigtable, and writes are desperately slow.
I have tried both Python client libraries offered by Google. The native API implements a Row class with a commit method. By iteratively committing rows in that way from my local development machine, the write performance on a production instance with 3 nodes is roughly 15 writes / 70 KB per second --granted, the writes are hitting a single node because of the way my test data is batched, and the data is being uploaded from a local network... However Google represents 10,000 writes per second per node and the upload speed from my machine is 30 MB/s, so clearly the gap lies elsewhere.
I have subsequently tried the happybase API with much hope because the interface provides a Batch class for inserting data. However, after disappointingly hitting the same performance limit, I realized that the happybase API is nothing more than a wrapper around the native API, and the Batch class simply commits rows iteratively in very much the same way as my original implementation.
What am I missing?

I know I'm late to this question, but for anyone else who comes across this, the google cloud libraries for python now allow bulk writes with mutations_batcher. Link to the documentation.
You can use batcher.mutate_rows and then batcher.flush to send all rows to be updated in one network call, avoiding the iterative row commits.

Related

DynamoDB on-demand table: does intensive writing affect reading

I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.

That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.

Document AI - Improving batch process time for a single document?

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.

As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.

Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

How to store BIG DATA as global variables in Dash Python?

I have a problem with my Dash application put in a server of a remote office. Two users running the app will experience interactions with each other due to table import followed by table pricing (the code for pricing is around 10,000 lines and pull out 8 tables). While looking on the internet, I saw that to solve this problem, it was enough to create html.Div preceded by the conversation of dataframes in JSON. However, this solution is not possible because I have to store 9 tables totaling 200,000 rows and 500 columns. So, I looked into the cache solution. However, this option does not create errors but increases the execution time of the program considerably. Going from a table of 20,000 vehicles to 200,000 it increases the compute time by almost * 1,000 and it is horrible every time I change the settings of the graphics.
I use cache filesystem and i used the exemple 4 of this : https://dash.plotly.com/sharing-data-between-callbacks. By doing some time calculations, I noticed that it is not accessing the cache that is the problem (about 1sec) but converting the JSON tables to dataframe (almost 60 seconds per callback). About 60 seconds is the time also corresponding to the pricing, so it is the same to call the cache in a callback as it is to price in a callback.
1/ do you have an idea that would save a dataframe not a JSON in the form of a cache or with a technique like the invisible html.Div or a cookie system or whatever other methods ?
2/ with the Redis or Memcached, we have to provide return json?
2/ If so, how do we set it up, taking example 4 from the previous link because I have an error "redis.exceptions.ConnectionError: Error 10061 connecting to localhost: 6379. No connection could be established because l target computer expressly refused it. " ?
3/ Do you also know if turning off the application automatically deletes the cache without following the default_timeout?

I think your issue can be solved using dash_extensions and specifically server side call back caches, might be worth a shot to implement.
https://community.plotly.com/t/show-and-tell-server-side-caching/42854

Google Cloud Datastore latency

I have an API deployed on Cloud Run where each request results in a read + write to Cloud Datastore. A non-trivial amount of the requests are first timers (read from Datastore will return null), so adding caching in front of it may not be too helpful.
Over the past month, the average wall time around calling Datastore and having the data (data = client.get(key, eventual=True)) is 48ms. The payloads are small (a list of dicts, with 10 elements on average, and each dict has two floats).
I'm not sure if I should say that latency is high, but my API has a budget of 100ms to do all that it needs to do and return. If just the data fetching takes ~50% of that time, I'm looking for ways to optimize things.
Questions:
In general, how does 50ms sound for fairly small payloads, fetched by key, from within GCP?
What should I expect (in terms of latency) from Memorystore within GCP?

Assuming you are using Cloud Run and Datastore on the same location, I would say that 50ms is around the expected latency you'd have for reading on datastore, the size of the payload does not matter as much for reads (10 - 1000 document reads do not make a big difference on time of processing/propagating).
Since you have such a small window for your API to operate, this could indeed be a problem if some unnexpected delays occur.
I have never used Memorystore so I can't say what you could expect of actual latency but it might be a better option given that every ms to your app counts.

Crate AMI Performance Lower With Flask-RESTful Endpoint on EC2

Launched a Simple Crate AMI EC2 Instance and opened up the ports for Crate on 4200 and 5000 for Flask.
When I run the EC2 instance with Crate AMI, the speeds are slower but still fast enough (~1-2 Second), but when I call the same with the Flask Endpoint which calls the Crate DB (on the same instance) by passing a query to it, it takes close to 10 seconds.
I tested the endpoint on a localhost and there was no change to the speed execution as such. Hence, I've ruled out the code being the problem.
My questions:
Why are the queries being run through the Flask-Restful endpoint on EC2 so slow?
Does it make a difference in speed performance to make an EC2 AMI from scratch and install CrateDB into it, than an out-of-the-box Crate AMI?

That can be one of several things, mostly however I suspect a 'hardware' issue:
Are the hardware specs the same? more cores, more memory, SSD vs spinning disks?
Is the environment variable CRATE_HEAP_SIZE set to half the available RAM? (/etc/sysconfig/crate)
Is the CREATE TABLE statement the same? A different number of cores result in a different number of shards if not specified. Oversharding/undersharding will degrade performance noticeably.
I am assuming the table size and queries are the same ;) otherwise seemingly minor changes can make a difference in performance. Partitioned tables optimize if the partition column is in the WHERE clause, as well as queries hitting the primary key(s) directly are way faster. Similarly, aggregations/comparisons on Strings are slower than on numeric types etc.
Cheers, Claus

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.