Flask: Using a global variable to load data files into memory - python

I have a large XML file which is opened, loaded into memory and then closed by a Python class. A simplified example would look like this:
class Dictionary():
def __init__(self, filename):
f = open(filename)
self.contents = f.readlines()
f.close()
def getDefinitionForWord(self, word):
# returns a word, using etree parser
And in my Flask application:
from dictionary import Dictionary
dictionary = Dictionary('dictionary.xml')
print 'dictionary object created'
#app.route('/')
def home():
word = dictionary.getDefinitionForWord('help')
I understand that in an ideal world, I would use a database instead of XML and make a new connection to this database on every request.
I understood from the docs that the Application Context in Flask means that each request will cause dictionary = new Dictionary('dictionary.xml') to be recreated, therefore opening a file on disk and re-reading the whole thing into memory. However, when I look at the debug output, I see the dictionary object created line printed exactly once, despite connecting from multiple sources (different sessions?).
My first question is:
As it seems to be the case that my application only loads the XML file once... Then I can assume that it resides in memory globally, and can be safely read by a large amount of simultaneous requests, limited only by RAM on my server- right? If the XML is 50MB, then it would take approx. 50MB in memory and be served up to simultaneous requests at high speed... I'm guessing it's not that easy.
And my second question is:
If this is not the case, what sort of limits am I going to hit on my ability to handle large amounts of traffic? How many requests can I handle if I have a 50MB XML being repeatedly opened, read from disk, and closed? I presume one at a time.
I realise this is vague and dependent on hardware but I'm new to Flask, python, and programming for the web, and just looking for guidance.
Thanks!

It is safe to keep it that way as long as the global object is not modified. That is a WSGI feature as explained in the Werkzeug docs1 (library which Flask is built on top of).
That data is going to be kept in memory of each worker process of WSGI app server. That does not mean once, but the number of processes (workers) is small and constant (does not depend on number of sessions or traffic).
So, it is possible to keep it that way.
That said, I would use a proper database on your place. If you have 16 workers, your data will take at least 800 MB of RAM (the number of workers is usually twice the number of processors). If the XML grows and you finally decide to use a database service, you will need to rewrite your code.
If the reason to keep it memory is that PostgreSQL and MySQL are too slow, you could use a SQLite kept in an in-memory filesystem like RAMFS of TMPFS. It gives you the speed, the SQL interface and you will probably save RAM usage. Migration to PostgreSQL or MySQL would be much easier too (in terms of code).

Related

How can I share a cache between Gunicorn workers?

I am working on a small service using Gunicorn and Flask (Python 3.6).The pseudocode below shows roughly the behavior I want. There are a lot of serialized foo objects, and I want to hold as many of these in memory as possible and delete them on a LRU basis.
cache = Cache()
#app.route('/')
def foobar():
name = request.args['name']
foo = cache.get(name)
if foo is None:
foo = load_foo(name)
cache.add(foo)
return foo.bar()
The problem I am having is I do not know how to share this cache between Gunicorn workers. I'm working with limited memory and don't want to be holding duplicate objects. Certain objects will be used very often and some probably never, so I think it really makes sense to hold them in memory.
This is just something that will only be taking requests from another application (both running on the same server), I just wanted to keep this code separate. Am I going the completely wrong direction by even using Gunicorn in the first place?
I don't see anything wrong with using Gunicorn, but it's probably not necessary to think about scaling horizontally unless you are close to putting this into production. Anyway, I'd recommend using a separate service as a cache, rather than having one in python memory. That way, each worker can open a connection to the cache as needed. Redis is a popular option, but you may have to do some data manipulation to store the data, e.g. store the data as a JSON string rather than a python object. Redis can act as a LRU cache by configuring it: https://redis.io/topics/lru-cache

Using a global static variable server wide in Django

I have a very long list of objects that I would like to only load from the db once to memory (Meaning not for each session) this list WILL change it's values and grow over time by user inputs, The reason I need it in memory is because I am doing some complex searches on it and want to give a quick answer back.
My question is how do I load a list on the start of the server and keep it alive through sessions letting them all READ/WRITE to it.
Will it be better to do a heavy SQL search instead of keeping the list alive through my server?
The answer is that this is bad idea, you are opening a pandora's box specially since you need write access as well. However all is not lost. You can quite easily use redis for this task.
Redis is a peristent data store but at the same time everything is held in memory. If the redis server runs on the same device as the web server access is almost instantaneous

Preventing multiple loading of large objects in uWSGI workers?

I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.

Store Python tree in memory and access it frequently

I have a ternary search tree written in Python loaded with around 200K valid English words. I am using it for dictionary look-up, as I am writing a Boggle-like app which accesses the tree very frequently to judge whether a sequence of letters is a valid word.
Right now my app is just a script that you call from the CLI. However, I'm architecting my app as a client-server model. Since it takes quite some time to load all the words into the tree, I don't want to do that for every request made to the server. Ideally, the tree should persist as an in-memory object in the server, receiving requests and sending responses.
I have tried Pyro4 but the overhead of network I/O grows larger as the frequency of access increases. So it's not a viable option with diminishing rate of returns. I wish to implement a lower-level solution with lesser I/O overhead. I have read up about shared memory and server process (https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes) but I'm not sure how to adapt them to non-primitive Python objects such as my ternary search tree. This is my first time doing this sort of thing, so I'd appreciate it if you guys can give some guidance.

Python: file-based thread-safe queue

I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.
You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.

Categories

Resources