Store Python tree in memory and access it frequently

Store Python tree in memory and access it frequently - python

I have a ternary search tree written in Python loaded with around 200K valid English words. I am using it for dictionary look-up, as I am writing a Boggle-like app which accesses the tree very frequently to judge whether a sequence of letters is a valid word.
Right now my app is just a script that you call from the CLI. However, I'm architecting my app as a client-server model. Since it takes quite some time to load all the words into the tree, I don't want to do that for every request made to the server. Ideally, the tree should persist as an in-memory object in the server, receiving requests and sending responses.
I have tried Pyro4 but the overhead of network I/O grows larger as the frequency of access increases. So it's not a viable option with diminishing rate of returns. I wish to implement a lower-level solution with lesser I/O overhead. I have read up about shared memory and server process (https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes) but I'm not sure how to adapt them to non-primitive Python objects such as my ternary search tree. This is my first time doing this sort of thing, so I'd appreciate it if you guys can give some guidance.

Related

Document AI - Improving batch process time for a single document?

I'm working on a GCP Document AI project. First, let me say this - the OCR works fine :-). I'm curios to know about possibilities of improvement if possible.
What happens now
I have a python module written, which will get the OCR done for a tiff file which is uploaded via a portal or collected by an agent in the system. The module is written in a way to avoid local usage of original file content, as the file is readily available in a cloud bucket. But, the price I have to pay is to use the batch_process_documents() API instead of process_document().
An observation
This is an obvious one, as the document if submitted via inline API gets OCR back in less than 5 seconds most time. But, the batch (with a single document :-|) takes more than 45 seconds almost every time. Sometimes it goes beyond a minute or more.
I'm searching for a solution, to reduce the OCR call time. The inline API does not support gcs uris as much as I'm aware, so either I need to download the content, upload it back via inline api and then do an OCR - or I need to live with the performance reduction.
Is there any one who has handled a similar case ? Or if there are any ways to tackle this without using batch api or downloading the content ? Any help is appreciated.

As per your requirement, since your concern is related to the latency when comparing the response time between the process and batchProcess method calls for the Document API, using a single document with results of 5 and 45 seconds respectively.
The process_documents() method has limits on the number of pages and file size that can be sent and it only allows for one document file per API call.
The batch_process_documents() method allows asynchronous processing of larger files and batch processing of multiple files.
Single requests are oriented to smaller amounts of data that usually takes a very small amount of time to process but may have low performance when dealing with a big amount of data, on the other hand batch requests are oriented to handle bigger amounts of data which would have better performance over the single request but may have lower performance when processing a small amount of data.
Regarding your concerns about the latency on these two method calls, looking into the documentation,I am able to find that for the single request or synchronous ("online") operations ( i.e immediate response) the document data is processed in memory and not persisted to disk. Following this in asynchronous offline batch operations the documents are processed in disk, due that the file could be significatively bigger that could not fit in memory. That's why the asynchronous operations take around 10x time vs the synchronous operations.
Each of these method calls has a particular use case, in this case the choice of which one to use would rely on the trade off that's better for you. If the time response is critical and you would like to have the response as soon as possible, you could split the files to fit the size and make the requests as synchronous operations, keeping in mind the quotas and limitations of the API.
This issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this Link.

Since this was originally posted, the Document AI API added a feature to specify a field_mask in the processing request, which limits the fields returned in the Document object output. This can reduce the latency in some requests since the response will be a smaller size.
https://cloud.google.com/document-ai/docs/send-request#online-processor

Middleware to optimize postgres

In my company, we have an ingestion service written in Go whose job is to take messages from a HTTP end point and store them in Postgres. It receives a peak throughput of 50,000 messages/second. However, our database can handle a maximum of 30,000 messages/second.
Is it possible to write a middleware in Python to optimize this? If so please explain.

It seems to be pretty unrelated to Python or any particular programming language.
These are typical questions to be asked and answers to be given:
Are there duplicates? If yes, don't save every message immediately but rather wait for duplicates (for what some kind of RAM-originated cache is required, the simplest one is <thread-safe?> hashtable).
Batch your message into large enough packs and then dump them into PostgreSQL all-at-once. You have to determine what is "large enough" based on load tests.
Can you drop some of those messages? If your data is not of critical importance, or at least not all of it, then you may detect overload by tracking number of pending messages and start to throw incoming stuff away until load becomes acceptable.

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?

Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

Flask: Using a global variable to load data files into memory

I have a large XML file which is opened, loaded into memory and then closed by a Python class. A simplified example would look like this:
class Dictionary():
def __init__(self, filename):
f = open(filename)
self.contents = f.readlines()
f.close()
def getDefinitionForWord(self, word):
# returns a word, using etree parser
And in my Flask application:
from dictionary import Dictionary
dictionary = Dictionary('dictionary.xml')
print 'dictionary object created'
#app.route('/')
def home():
word = dictionary.getDefinitionForWord('help')
I understand that in an ideal world, I would use a database instead of XML and make a new connection to this database on every request.
I understood from the docs that the Application Context in Flask means that each request will cause dictionary = new Dictionary('dictionary.xml') to be recreated, therefore opening a file on disk and re-reading the whole thing into memory. However, when I look at the debug output, I see the dictionary object created line printed exactly once, despite connecting from multiple sources (different sessions?).
My first question is:
As it seems to be the case that my application only loads the XML file once... Then I can assume that it resides in memory globally, and can be safely read by a large amount of simultaneous requests, limited only by RAM on my server- right? If the XML is 50MB, then it would take approx. 50MB in memory and be served up to simultaneous requests at high speed... I'm guessing it's not that easy.
And my second question is:
If this is not the case, what sort of limits am I going to hit on my ability to handle large amounts of traffic? How many requests can I handle if I have a 50MB XML being repeatedly opened, read from disk, and closed? I presume one at a time.
I realise this is vague and dependent on hardware but I'm new to Flask, python, and programming for the web, and just looking for guidance.
Thanks!

It is safe to keep it that way as long as the global object is not modified. That is a WSGI feature as explained in the Werkzeug docs1 (library which Flask is built on top of).
That data is going to be kept in memory of each worker process of WSGI app server. That does not mean once, but the number of processes (workers) is small and constant (does not depend on number of sessions or traffic).
So, it is possible to keep it that way.
That said, I would use a proper database on your place. If you have 16 workers, your data will take at least 800 MB of RAM (the number of workers is usually twice the number of processors). If the XML grows and you finally decide to use a database service, you will need to rewrite your code.
If the reason to keep it memory is that PostgreSQL and MySQL are too slow, you could use a SQLite kept in an in-memory filesystem like RAMFS of TMPFS. It gives you the speed, the SQL interface and you will probably save RAM usage. Migration to PostgreSQL or MySQL would be much easier too (in terms of code).

What's a good Flask/Python/WSGI analog to the PHP Apache shared memory stores like apc_store/apc_fetch?

I've done a couple of years of large-scale game server development in PHP. A load balancer delegates incoming requests to one server in a cluster. In the name of better performance, we began caching all static data (essentially the game world's model objects) on each of the instances in that cluster, directly in Apache shared memory, using apc_store and apc_fetch.
For a number of reasons, we're now beginning to develop a similar game framework in Python, using the Flask microframework. At first glance, this instance's memory store is the one piece that doesn't appear to translate directly to Python/Flask. We're presently considering running Memcached locally on each instance (to avoid streaming fairly large model objects over-the-wire from our main Memcached cluster.)
What can we use instead?

I would think that even in this case you might want to consider having a centralized key/value store system rather than a series of independent ones on each server. Unless your load balancer always redirects the same users to the same servers you could run into a case where a user's requests are routed to different servers each time so each node would have to retrieve the game state instead of accessing it from a shared cache.
Also the memory strain that a local key/value store on each system might incur could slow down your game server's other functions. Though that largely depends on the amount of data being cached.
In general the best approach would be to run some benchmarks to see what kind of performance you'd get with a memcached cluster and the types of objects you're storing vs local storage.
Depending on what other features you want from you key/value store you might also want to look into some alternatives like mongodb (http://www.mongodb.org/).

[Five-months later]
Our game framework is done.
In the end, we decided to store the static data in fully initialized sqlalchemy model instances in each web server. When a newly-booted game server is warming up, these instances are first constructed by hitting a shared MySQL db.
Since our Model factories defer to an instance pool, the Model instances need only be constructed once per deployment per server – this is important, because at our scale, MySQL would weep under any sort of ongoing load. We accomplished our goal of not streaming this data over the wire by keeping the item definitions as close to our app code as possible: in the app code itself.
I now realize that my original question was naive, because unlike in the LAMP stack, the Flask server keeps running between requests, the server's memory itself is "shared memory" – there's no need for something like APC to make it so. In fact, anything outside of the request processing scope it self and Flask's threadsafe local store, can be considered "shared memory".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.