Document AI - Improving batch process time for a single document?

Document AI - Improving batch process time for a single document? - python

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.

As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.

Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

Related

DynamoDB on-demand table: does intensive writing affect reading

I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.

That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.

Google Cloud Datastore latency

I have an API deployed on Cloud Run where each request results in a read + write to Cloud Datastore. A non-trivial amount of the requests are first timers (read from Datastore will return null), so adding caching in front of it may not be too helpful.
Over the past month, the average wall time around calling Datastore and having the data (data = client.get(key, eventual=True)) is 48ms. The payloads are small (a list of dicts, with 10 elements on average, and each dict has two floats).
I'm not sure if I should say that latency is high, but my API has a budget of 100ms to do all that it needs to do and return. If just the data fetching takes ~50% of that time, I'm looking for ways to optimize things.
Questions:
In general, how does 50ms sound for fairly small payloads, fetched by key, from within GCP?
What should I expect (in terms of latency) from Memorystore within GCP?

Assuming you are using Cloud Run and Datastore on the same location, I would say that 50ms is around the expected latency you'd have for reading on datastore, the size of the payload does not matter as much for reads (10 - 1000 document reads do not make a big difference on time of processing/propagating).
Since you have such a small window for your API to operate, this could indeed be a problem if some unnexpected delays occur.
I have never used Memorystore so I can't say what you could expect of actual latency but it might be a better option given that every ms to your app counts.

set_sequential_download() and set_piece_deadline() in libtorrent

i'm working on my project which is to make a streaming client over libtorrent.
i'm using the python client (python binding).
i searched a lot about these functions set_sequential_download() and set_piece_deadline() and i couldn't find a good answer on how to force download pieces in order, which means first piece 1 and then 2,3,4 etc..
i saw people are asking this in forums, but none of them got a good answer on the changes need to be done in order it to succeed.
i understood that the set_sequential_download() just asks for the pieces in order but in fact they are randomly downloaded. i tried to change the deadline of the pieces using set_piece_deadline() , increment each piece but it doesn't work for me at all.
** UPDATE
the goal i'm trying to acomplish , it's downloading one piece at a time so i can make a streaming throgh torrents.
i hope some of you can help me,
thanks Ben.

set_sequential_download() will request pieces in order. However:
all peers may not have all pieces. If the next piece you want to download is 3 and one of your peers doesn't have 3 but the next it has is 5, libtorrent will start requesting blocks from piece 5 from that peer.
peers provide varying upload rates, which means that some peers will satisfy your request sooner than others.
This makes it possible for the pieces to complete out-of-order.
set_piece_deadline() is a more flexible way to specify piece priority. It supports arbitrary range requests (as described by Jacob Zelek). Its main feature, though, is that it uses a different approach to requesting blocks. Instead of considering a peer at a time, and asking "what should I request from this peer", it considers a piece at a time, asking "which peer should I request this block from".
This makes it deliberately attempt to make pieces complete in the order of their deadlines. It is still an estimate based on historical download rates from peers, and if the bottleneck for download rates is your own download capacity, it may be very difficult to make predictions of future download rates for peers. A few important things to keep in mind when using the `set_piece_deadline()`` API are:
It's not important that the deadline is in the future. If the deadline cannot be met given the current download or upload capacity, the pieces will be prioritized in the order they were asked to be completed.
If a deadline is far out in the future, libtorrent may wait to prioritize it until it believe it needs to request it to make the deadline. If you're streaming a large file, and you know the bit-rate, you can set up deadlines for every piece, and if your capacity is higher than the bitrate, you'll still request some pieces in rarest-first order. Improves swarm quality.
When streaming data, it's absolutely critical to read-ahead. If you don't set the deadline until you want the piece, you'll always fall behind. There's typically a pretty long round-trip between requesting a piece and completing it. If you don't keep the request pipe full of deadline-pieces, libtorrent will start requesting other pieces again, and you'll get non-prioritized pieces interleaved with your high-priority pieces. You should probably keep a few seconds and at least a few pieces as read-ahead. For video, I would imagine tens of megabytes is appropriate (but experimentation and measurement is the best way to tweak it).
If you are in fact looking to stream video to a player or web browser over HTTP, you may want to take a look at (or use and submit pull requests to):
https://github.com/arvidn/libtorrent-webui/blob/master/src/file_downloader.cpp
that's a file-downloader provider that fits into simple http framework in that repository.
UPDATE:
If all you want is to guarantee that piece 1 completes before piece 2 (at any cost, specifically very poor performance), you can set the priority of all pieces to 0, except for the one piece you want to download. Once it completes, you'll be notified by an alert and you can set the priority of the next piece you want to 1. And so on.
This will be incredibly slow, since you'll pause the download constantly, and be in constant end-game mode (where you may download the same block from multiple peers, if one is slow). For instance, if you have more peers than there are blocks in one piece, you will leave download bandwidth unused, by not being able to request from all peers.

I've ran into the same problem as you. Setting a torrent to sequential download means the pieces will be downloaded in a somewhat ordered fashion. This may be the intuitive solution for streaming. However, streaming video is more complicated then just downloading all the pieces in order.
Video files come in different containers (e.g. mkv, mp4, avi) and different codes (h264, theora, etc). Some codecs/containers store metadata/headers in different locations in a file. I can't remember off the top of my head but a certain container/codec stores all header information at the end of the file. Such a file may not stream well if downloaded sequentially.
Unless you write the code for determining which pieces are needed to start streaming, you will have to rely on an existing mechanisms. Take for example Peerflix which spawns a browser video player, VLC, of Mplayer. These applications have a good idea of what byte ranges they need for various containers/codecs. When Peerflix launches VLC to play, lets say, an AVI file, VLC will attempt to read the first several bytes and last several bytes (headers).
The genius behind Peerflix is that it tries to serve the video file through it's own web server and therefore knows what byte ranges of the file VLC is seeking. It then determines which pieces the byte ranges fall into and prioritizes those pieces. Peerflix uses some Node.js BitTorrent library, whose exact piece prioritization mechanisms are unknown to me. However, in the case of libtorrent-rasterbar, the set_piece_deadline() function allows you to signal the library to what pieces you need. In my experience, once I determined the pieces needed, I would call set_piece_deadline() with a short deadline (50ms or so) and wait for the arrival. Please note that using set_piece_dealine() is incompatible with sequential downloads (just set them to false).
One thing to note, libtorrent-rasterbar will not write the piece to the hard drive as soon as it gets it. This is a trap I fell into because I tried to read that byte range from the file when the piece arrived. For this you will need to run a thread to catch the alerts that libtorrent-rasterbar passes to your application. More specifically you will receive the raw binary data for that piece in a read_piece_alert.

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary

I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.

First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.

How to reduce latency of data sent through a REST api

I have an application which obtains data in JSON format from one of our other servers. The problem I am facing is, there is is significant delay when when requesting for this information. Since a lot of data is passed (approx 1000 records per request where each record is pretty huge) is there a way that compression would help reducing the speed. If so which compression scheme would you recommend.
I read on another thread that they pattern of data also matters a lot on they type of compression that needs to be used. The pattern of data is consistent and resembles the following
:desc=>some_description
:url=>some_url
:content=>some_content
:score=>some_score
:more_attributes=>more_data
Can someone recommend a solution to how I could reduce this delay. They delay is approx 6-8 seconds. I'm using Ruby on Rails to develop this application and the server providing the data uses Python for the most part.

I would first look at how much of this 8s delay is related to:
Server side processing (how much took for the data to be generated)
There are a lot of techniques to improve this time, including:
DB indexes
caching
a faster to_json library
Some excellent resources are the NewRelic podcasts on Rails scalability http://railslab.newrelic.com/2009/02/09/episode-7-fragment-caching
Transmission delay(how much time took for the data to be sent between the server and the client)
if the keys are pretty much the same, you may implement the sollution from Compression algorithm for JSON encoded packets? ; You may want to look at https://github.com/WebReflection/json.hpack/wiki/specs-details and http://www.nwhite.net/?p=242
in addition to this, you may also compress (gzip) it from your frontend server
http://httpd.apache.org/docs/2.0/mod/mod_deflate.html
http://wiki.nginx.org/NginxHttpGzipModule
If the data structure is constant and you can also try to implement a binary service, that is much much faster, includes compression, but also more difficult to mantain, like thrift:
http://www.igvita.com/2007/11/30/ruby-web-services-with-facebooks-thrift/
If this is suitable to your needs, maybe you can make some kind of a versioning/cache system server-side, and send only the records that were modified (but that is pretty heavy to implement)

gzip might significantly reduce the size of text data and optimize load speeds. It's also recommended by YSlow.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.