Real-time data collection and 'offline' processing

Real-time data collection and 'offline' processing - python

I have a continuous stream of data. I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. Presumably no problem. HDF5 file format should do great! OOC data, no problem. Pytables.
Now the trouble. Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). This involving reading the same file I'm writing.
How do people do this?
Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around.
Partial solutions:
It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. This seems like exactly what I want. I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. Any tips here would be welcomed. I'm using Conda package manager.
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. How do I ensure that I'm not missing data going by while doing this?
This also seems like it might be computationally expensive, but perhaps there's no getting around that.
Collect less data. What's the fun in that?

I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect.
pipeline example:
raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file
At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on.
if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream
hope it helped.

Related

Document AI - Improving batch process time for a single document?

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.

As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.

Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

Python parallelism preserving data

I need to repeatedly calculate very large python arrays based on small input and very large constant bulk if data, stored on the drive. I can successfully parallelize it by splitting that input bulk and joining response. Here comes the problem: sending identical data bulk to the pool is too slow. Moreover, I double required memory. Ideally I would read data in the thread from the file, and keep it there for multiple re-use.
How do I do it? I can only think of creating multiple servers that will listen to requests from the pool. Somehow it looks unnatural solution to quite common problem. Do I miss better solution?
best regards,
Vladimir

Software Paradigm for Pushing Data Through a System

tl-dr: I wanted you feedback if the correct software design pattern to use would be a Push/Pull Pipeline pattern.
Details:
Let's say I have several software algorithms/blocks which process data coming into a software system:
[Download Data] --> [Pre Process Data] --> [ML Classification] --> [Post Results]
The download data block simply loiters until midnight when new data is available and then downloads new data. The pre-process data simply loiters until newly available downloaded data is present, and then preprocesses the data. The Machine Learning (ML) Classification block simply loiters until new data is available to classify, etc.
The entire system seems to be event driven and I think fits the push/pull paradigm perfectly?
The [Download Data] block would be a producer? The consumers would be all the subsequent blocks with the exception of the [Plot Results] which would be a results collector?
Producer = pull
Consumer = pull then push
result collector = pull
I'm working within a python framework. This implementation looked ideal:
https://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/patterns/pushpull.html
https://github.com/ashishrv/pyzmqnotes
Push/Pull Pipeline Pattern
I'm totally open to using another software paradigm other than push/pull if I've missed the mark here. I'm also open to using another repo as well.
Thanks in advance for your help with the above!

I've done similar pipelines many many times and very much like to break it into blocks like that. Why? Mainly for automatic recovery from any errors. If something gets delayed, it will auto recover next hour. If something needs to be fixed mid-pipeline, fix it and name it so it gets picked up next cycle. (That and the fact smaller blocks are easier to design, build, and test).
For example, your [Download Data] should run every hour to look for waiting data: if none, go back to sleep; if some, download it to a file with a name containing a timestamp and state: 2020-0103T2153.downloaded.json. [Pre Process Data] should run every hour to look for files named *.downloaded.json: if none, go back to sleep; if one or more, pre-processes each in increasing timestamp order with output to <same-timestamp>.pre-processed.json. Etc, etc for each step.
Doing it this way meant may unplanned events auto recovered and nobody would know unless they looked in the log files (you should log each so you know what happened). Easy to sleep at night :)
In these scenarios, the event driving this is just time-of-day via crontab. When "awoken", each step in the pipeline just looks to see if it has any work waiting for it. Trying to make the file-creation event initiate things was non-simple especially if you need to re-initiate things (would need to re-create the file).
I wouldn't use a message queue as that's more complicated and more suited when you have to handle incoming messages as they arrive. Your case is more simple batch file processing so keep it simple and sleep at night.

Python Strategy for Large Scale Analysis (on-the-fly or deferred)

To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?
I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?
On-the-fly: Process data on-the-fly and store parametric data into a database
Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon
Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?
Are there any python projects that are already geared toward this kind of analysis?
Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary

I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.
The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.
Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.

First: "fastest execution to prevent user from waiting" means some kind of deferred processing. Once you decide to defer the processing -- so the user doesn't see it -- the choice between flat-file and database is essentially irrelevant with respect to end-user-wait time.
Second: databases are slow. Flat files are fast. Since you're going to use celery and avoid end-user-wait time, however, the distinction between flat file and database becomes irrelevant.
Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
This is fastest. Celery to load flat files.

How to reduce latency of data sent through a REST api

I have an application which obtains data in JSON format from one of our other servers. The problem I am facing is, there is is significant delay when when requesting for this information. Since a lot of data is passed (approx 1000 records per request where each record is pretty huge) is there a way that compression would help reducing the speed. If so which compression scheme would you recommend.
I read on another thread that they pattern of data also matters a lot on they type of compression that needs to be used. The pattern of data is consistent and resembles the following
:desc=>some_description
:url=>some_url
:content=>some_content
:score=>some_score
:more_attributes=>more_data
Can someone recommend a solution to how I could reduce this delay. They delay is approx 6-8 seconds. I'm using Ruby on Rails to develop this application and the server providing the data uses Python for the most part.

I would first look at how much of this 8s delay is related to:
Server side processing (how much took for the data to be generated)
There are a lot of techniques to improve this time, including:
DB indexes
caching
a faster to_json library
Some excellent resources are the NewRelic podcasts on Rails scalability http://railslab.newrelic.com/2009/02/09/episode-7-fragment-caching
Transmission delay(how much time took for the data to be sent between the server and the client)
if the keys are pretty much the same, you may implement the sollution from Compression algorithm for JSON encoded packets? ; You may want to look at https://github.com/WebReflection/json.hpack/wiki/specs-details and http://www.nwhite.net/?p=242
in addition to this, you may also compress (gzip) it from your frontend server
http://httpd.apache.org/docs/2.0/mod/mod_deflate.html
http://wiki.nginx.org/NginxHttpGzipModule
If the data structure is constant and you can also try to implement a binary service, that is much much faster, includes compression, but also more difficult to mantain, like thrift:
http://www.igvita.com/2007/11/30/ruby-web-services-with-facebooks-thrift/
If this is suitable to your needs, maybe you can make some kind of a versioning/cache system server-side, and send only the records that were modified (but that is pretty heavy to implement)

gzip might significantly reduce the size of text data and optimize load speeds. It's also recommended by YSlow.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.