I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!
Related
[I'm using Python 3.5.2 (x64) in Windows.]
I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).
The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.
I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.
With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).
Architecturally, how would one handle this problem efficiently in Python 3.5?
Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and multiprocessing.Queue (more specifically, JoinableQueue) is faster though not (yet) optimal.
Edit 2: One other thing I've gathered is, if you have lots of jobs to do (particularly in Windows, where spawn() is the only option and is costly too), creating long-running parallel processes is better than creating them over and over again.
Suggestions - preferably ones that use multiprocessing components - are still very welcome!
In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.
I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.
Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.
Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.
Shared memory via the shared Array and RawArray functionality was costly in terms of performance.
Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).
I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.
While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!
At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!
I have a python process serving as a WSGI-apache server. I have many copies of this process running on each of several machines. About 200 megabytes of my process is read-only python data. I would like to place these data in a memory-mapped segment so that the processes could share a single copy of those data. Best would be to be able to attach to those data so they could be actual python 2.7 data objects rather than parsing them out of something like pickle or DBM or SQLite.
Does anyone have sample code or pointers to a project that has done this to share?
This post by #modelnine on StackOverflow provides a really great comprehensive answer to this question. As he mentioned, using threads rather than process-forking in your webserver can significantly lesson the impact of this. I ran into a similar problem trying to share extremely-large NumPy arrays between CLI Python processes using some type of shared memory a couple of years ago, and we ended up using a combination of a sharedmem Python extension to share data between the workers (which proved to leak memory in certain cases, but, it's fixable probably). A read-only mmap() technique might work for you, but I'm not sure how to do that in pure-python (NumPy has a memmapping technique explained here). I've never found any clear and simple answers to this question, but hopefully this can point you in some new directions. Let us know what you end up doing!
It's difficult to share actual python objects because they are bound to the process address space. However, if you use mmap, you can create very usable shared objects. I'd create one process to pre-load the data, and the rest could use it. I found quite a good blog post that describes how it can be done: http://blog.schmichael.com/2011/05/15/sharing-python-data-between-processes-using-mmap/
Since it's read-only data you won't need to share any updates between processes (since there won't be any updates) I propose you just keep a local copy of it in each process.
If memory constraints is an issue you can have a look at using multiprocessing.Value or multiprocessing.Array without locks for this: https://docs.python.org/2/library/multiprocessing.html#shared-ctypes-objects
Other than that you'll have to rely on an external process and some serialising to get this done, I'd have a look at Redis or Memcached if I were you.
One possibility is to create a C- or C++-extension that provides a Pythonic interface to your shared data. You could memory map 200MB of raw data, and then have the C- or C++-extension provide it to the WSGI-service. That is, you could have regular (unshared) python objects implemented in C, which fetch data from some kind of binary format in shared memory. I know this isn't exactly what you wanted, but this way the data would at least appear pythonic to the WSGI-app.
However, if your data consists of many many very small objects, then it becomes important that even the "entrypoints" are located in the shared memory (otherwise they will waste too much memory). That is, you'd have to make sure that the PyObject* pointers that make up the interface to your data, actually themselves point to the shared memory. I.e, the python objects themselves would have to be in shared memory. As far as I can read the official docs, this isn't really supported. However, you could always try "handcrafting" python objects in shared memory, and see if it works. I'm guessing it would work, until the Python interpreter tries to free the memory. But in your case, it won't, since it's long-lived and read-only.
I'm workin with fairly large dataframes and textfiles (thousands of docs) that I am opening up in my ipython notebook. I'm noticing that after a while, my computer becomes really slow. Is there a way to take inventory of my python program to find out what's slowing down my computer?
You have a few options. First, you can use third party tools like heapy or PySizer to evaluate your memory usage at different points in your program. This (now closed) SO question discusses them a little bit. Additionally, there is a third option simply called 'memory_profiler' hosted here on GitHub, and according to this blog there are some special shortcuts in IPython for memory_profiler.
Once you have identified the data structures that are consuming the most memory, there are a few options:
Refactor to take advantage of garbage collection
Examine the flow of data through your program and see if there are any places where large data structures are kept around when they don't need to be. If you have a large data structure that you do some processing on, put that processing in a function and returned the processed result so the original memory hog can go out of scope and be destroyed.
A comment suggested using the del statement. Although the commenter is correct that it will free memory, it really should indicate to you that your program isn't structured correctly. Python has good garbage collection, and if you find yourself manually messing with memory freeing, you should probably put that section of code in a function or method instead, and let the garbage collector do its thing.
Temporary Files
If you really need access to large data structures (almost) simultaneously, consider writing one or several of them to temporary files while not needed. You can use the JSON or Pickle libraries to write stuff out in sophisticated formats, or simply pprint your data to a file and read it back in later.
I know that seems like some kind of manual hard disk thrashing, but it gives you great control over exactly when the writes to and reads from the hard disk occur. Also, in this case only your files are bouncing on and off the disk. When you use up your memory and swapping starts occurring, everything gets bounced around - data files, program instructions, memory page tables, etc... Everything grinds to a halt instead of just your program running a little more slowly.
Buy More Memory
Yes, this is an option. But like the del statement, it can usually be avoided by more careful data abstraction and should be a last resort, reserved for special cases.
iPython it's a wonderful tool, but sometimes it tends to slow things up.
If you have large print output statements, lots of graphics, or your code has grown too big, the autosave takes forever to snap your Notebooks. Try autosaving sparingly with:
%autosave 300
Or disabling it entirely:
%autosave 0
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?
Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...
Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.
I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.
Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.
Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?
I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.