How to reduce latency of data sent through a REST api - python

I have an application which obtains data in JSON format from one of our other servers. The problem I am facing is, there is is significant delay when when requesting for this information. Since a lot of data is passed (approx 1000 records per request where each record is pretty huge) is there a way that compression would help reducing the speed. If so which compression scheme would you recommend.
I read on another thread that they pattern of data also matters a lot on they type of compression that needs to be used. The pattern of data is consistent and resembles the following
:desc=>some_description
:url=>some_url
:content=>some_content
:score=>some_score
:more_attributes=>more_data
Can someone recommend a solution to how I could reduce this delay. They delay is approx 6-8 seconds. I'm using Ruby on Rails to develop this application and the server providing the data uses Python for the most part.

I would first look at how much of this 8s delay is related to:
Server side processing (how much took for the data to be generated)
There are a lot of techniques to improve this time, including:
DB indexes
caching
a faster to_json library
Some excellent resources are the NewRelic podcasts on Rails scalability http://railslab.newrelic.com/2009/02/09/episode-7-fragment-caching
Transmission delay(how much time took for the data to be sent between the server and the client)
if the keys are pretty much the same, you may implement the sollution from Compression algorithm for JSON encoded packets? ; You may want to look at https://github.com/WebReflection/json.hpack/wiki/specs-details and http://www.nwhite.net/?p=242
in addition to this, you may also compress (gzip) it from your frontend server
http://httpd.apache.org/docs/2.0/mod/mod_deflate.html
http://wiki.nginx.org/NginxHttpGzipModule
If the data structure is constant and you can also try to implement a binary service, that is much much faster, includes compression, but also more difficult to mantain, like thrift:
http://www.igvita.com/2007/11/30/ruby-web-services-with-facebooks-thrift/
If this is suitable to your needs, maybe you can make some kind of a versioning/cache system server-side, and send only the records that were modified (but that is pretty heavy to implement)

gzip might significantly reduce the size of text data and optimize load speeds. It's also recommended by YSlow.

Related

Document AI - Improving batch process time for a single document?

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.
As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.
Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?
Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

Real-time data collection and 'offline' processing

I have a continuous stream of data. I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. Presumably no problem. HDF5 file format should do great! OOC data, no problem. Pytables.
Now the trouble. Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). This involving reading the same file I'm writing.
How do people do this?
Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around.
Partial solutions:
It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. This seems like exactly what I want. I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. Any tips here would be welcomed. I'm using Conda package manager.
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. How do I ensure that I'm not missing data going by while doing this?
This also seems like it might be computationally expensive, but perhaps there's no getting around that.
Collect less data. What's the fun in that?
I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect.
pipeline example:
raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file
At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on.
if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream
hope it helped.

GAE: Aggregate work results from tasks? (for a GAE query performance issue)

Which of those options is best to span out work on GAE (to be completed within a reuest timeframe)?
Use of tasks, store the results in memcache, periodically query memcache in the request and hope the tasks complete in time
Use of urlfetch to get results of tasks, error handling and security will be a pain though.
Use of backend instances? (seems insane)
Or a JAVA instance (seems totally insane)
Background:
It´s ridiculous to even have to do this. I need to deliver 10k datastore items as a JSON. Apparently the issue is that Python takes a lot of time to process the datastore results (Java seems much faster). This is well covered:
25796142, 11509368 and
21941954
Approach:
As there is nothing to optimize on the Software side (can´t re-write GAE), the approach would be to span work out over multiple instances and to aggregate the results.
Querying keys only and getting query cursors for chunks of 2k items performs reasonably well and there tasks could be spun off to get the results in 2k chunks. The question is about how to best aggregate the results.
It is not "ridiculous" to have to do this: it is an accepted consequence of the scalability offered by GAE. If you don't like the tradeoffs made to enable that scalability, you should choose another platform.
It's also unclear why you think using backend instances is "insane". Using Java would indeed be strange, but only because there's no reason to think it would perform any better.
However, there is a perfectly good way to do this which does not involve any of the hacks you mention, and that is to use the mapreduce framework, which is expressly made for collecting large quantities of data.

Writing a Python data analysis server for a Java interface

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.
What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)
I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?
I was thinking to learn how to use sockets and pass YAML strings.
Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?
There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?
Does any of the methods have advantages for debugging or other pecularities?
I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.
(I do not consider solutions like Jython, JEPP, ...)
Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.
http://twistedmatrix.com/documents/current/core/examples/
Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.
The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.
Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.
If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Categories

Resources