I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.
Related
Does anyone have a minimal working example of how to use uWSGI to share memory across requests in say Django?
I have a large file in proprietary format (not database-compatible) that I need to load for each request.
An instagram post got me thinking which states:
For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.
How would you set something like that up?
There are multiple ways to handle this:
Share by "abusing" copy-on-write for read-only data
If your data is read-only, you can leverage the fact that uWSGI is executing your python code to get the application before forking into multiple processes. This means all the data that is already loaded before the fork happens will be shared with all your processes.
This can be a great tool because you don't have to do anything dealing with multi-processing to enjoy this mechanism. But be careful, as soon as any process writes to this data, it will copy it first to get its own local version.
Django doesn't make it easy because all the views are lazy. This means django won't try to run code related to your view when application is created. Therefore to enjoy the pre-fork sharing you need to load the data in a code outside your views. For instance you can consider loading the data right before or after you built your application object (like in the gist linked by #john-strood).
Use uWSGI cache framework
If you need to write to this data, a first solution is to use uWSGI cache framework. It's fairly easy to use. You need to configure in advance how much memory you need, and then all your processes can read and write to it. You don't have to deal with locking or other multi-processing related issues.
The drawback is that you still incur IO latency between your processes and uWSGI cache's process. This is insignificant for tiny chunks of data, but would be prohibitive for gigabytes.
Use shared memory manually
As a last resort, if your data is not read-only and you need to load large chunks at all request, so large that even sending through a unix socket would take too long, then you need to load your data directly in the shared memory space. Here uWSGI won't help, and you will have to deal with locking and multi-processing issues yourself.
You can refer to multiprocessing's shared memory documentation.
I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.
I have a large XML file which is opened, loaded into memory and then closed by a Python class. A simplified example would look like this:
class Dictionary():
def __init__(self, filename):
f = open(filename)
self.contents = f.readlines()
f.close()
def getDefinitionForWord(self, word):
# returns a word, using etree parser
And in my Flask application:
from dictionary import Dictionary
dictionary = Dictionary('dictionary.xml')
print 'dictionary object created'
#app.route('/')
def home():
word = dictionary.getDefinitionForWord('help')
I understand that in an ideal world, I would use a database instead of XML and make a new connection to this database on every request.
I understood from the docs that the Application Context in Flask means that each request will cause dictionary = new Dictionary('dictionary.xml') to be recreated, therefore opening a file on disk and re-reading the whole thing into memory. However, when I look at the debug output, I see the dictionary object created line printed exactly once, despite connecting from multiple sources (different sessions?).
My first question is:
As it seems to be the case that my application only loads the XML file once... Then I can assume that it resides in memory globally, and can be safely read by a large amount of simultaneous requests, limited only by RAM on my server- right? If the XML is 50MB, then it would take approx. 50MB in memory and be served up to simultaneous requests at high speed... I'm guessing it's not that easy.
And my second question is:
If this is not the case, what sort of limits am I going to hit on my ability to handle large amounts of traffic? How many requests can I handle if I have a 50MB XML being repeatedly opened, read from disk, and closed? I presume one at a time.
I realise this is vague and dependent on hardware but I'm new to Flask, python, and programming for the web, and just looking for guidance.
Thanks!
It is safe to keep it that way as long as the global object is not modified. That is a WSGI feature as explained in the Werkzeug docs1 (library which Flask is built on top of).
That data is going to be kept in memory of each worker process of WSGI app server. That does not mean once, but the number of processes (workers) is small and constant (does not depend on number of sessions or traffic).
So, it is possible to keep it that way.
That said, I would use a proper database on your place. If you have 16 workers, your data will take at least 800 MB of RAM (the number of workers is usually twice the number of processors). If the XML grows and you finally decide to use a database service, you will need to rewrite your code.
If the reason to keep it memory is that PostgreSQL and MySQL are too slow, you could use a SQLite kept in an in-memory filesystem like RAMFS of TMPFS. It gives you the speed, the SQL interface and you will probably save RAM usage. Migration to PostgreSQL or MySQL would be much easier too (in terms of code).
I have a python script to analyze user behavior from log file.
This script reads from several large files(about 50 GB each) by using file.readlines(), and then analyze them line by line and save the results in a dict of python object, after all lines are analyzed, the dict is wrote to the disk.
As I have a sever which has 64 cores and 96 GB memory, I start 10 processes of this script and each of which handle part of data. Besides, in order to save the time spent on IO operation, I use file.readlines(MAX_READ_LIMIT) instead of file.readline() and set MAX_READ_LIMIT = 1 GB.
After running this script on sever while using top command to show the task resource, I find that although each process of my script will occupy only about 3.5 GB memory(40 GB in total), there is only 380 MB left on the server (there is no other significant memory-consuming app running on the server at the same time).
So, I was wondering where is the memory? there should be about 96-40=36GB memory left?
please tell me if I make some mistakes on above observations.
One hypothesis is that the memory unused is NOT placed back into memory pool immediately, So I was wondering how to release unused memory explicitly and immediately.
I learned from python document that there are two complementary methods to manage memory in python: garbage collect and reference counting, and according to python doc:
Since the collector supplements the reference counting already used in
Python, you can disable the collector if you are sure your program
does not create reference cycles.
So, which one should I use for my case, del obj or gc.collect() ?
using file.readlines() , then analyze data line by line
This is a bad design. readlines reads the entire file and returns a Python list of strings. If you only need to process the data line-by-line, then iterate through the file without using readlines:
with open(filename) as f:
for line in f:
# process line
This will massively reduce the amount of memory your program requires.
I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.