I wrote an application that parses a massive amount of files into objects in memory. This is a very time-consuming action. When it's done, it saves the generated objects to pickled files. The second time I run it, it unpickles everything and only re-parses the files that have changed.
The codebase, however, is not stable, and it's not safe to simply define my software by a statically typed version number. I want to figure out if any of the code of my application has changed since I pickled the data, and if it did, discard the cache and rebuild it.
How do I generate a signature of a list of python modules? My first thought was pickling the modules, but that doesn't work.
Related
I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.
You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]
The python documentation says this about the sync method:
Write back all entries in the cache if the shelf was opened with
writeback set to True. Also empty the cache and synchronize the
persistent dictionary on disk, if feasible. This is called
automatically when the shelf is closed with close().
I am really having a hard time understanding this.
How does accessing data from cache differ from accessing data from disk?
And does emptying the cache affect how we can access the data stored
in a shelve?
For whoever is using the data in the Shelve object, it is transparent whether the data is cached or is on disk. If it is not on the cache, the file is read, the cache filled, and the value returned. Otherwise, the value as it is on the cache is used.
If the cache is emptied on calling sync, that means only that on the next value fetched from the same Shelve instance, the file will be read again. Since it is all automatic, there is no difference. The documentation is mostly describing how it is implemented.
If you are trying to open the same "shelve" file with two concurrent apps, or even two instances of shelve on the same program, chances are you are bound to big problems. Other than that, it just behaves as a "persistent dictionary" and that is it.
This pattern of writing to disk and re-reading from a single file makes no difference for a workload of a single user in an interactive program. For a Python program running as a server with tens to thousands of clients, or even a single big-data processing script, where this could impact actual performance, Shelve is hardly a usable thing anyway.
I have a big pickle file containing hundreds of trained r-models in python: these are stats models built with the library rpy2.
I have a class that loads the pickle file every time one of its methods is called (this method is called several times in a loop).
It happens that the memory required to load the pickle file content (around 100 MB) is never freed, even if there is no reference pointing to loaded content. I correctly open and close the input file. I have also tried to reload pickle module (and even rpy) at every iteration. Nothing changes. It seems that just the fact of loading the content permanently locks some memory.
I can reproduce the issue, and this is now an open issue in the rpy2 issue tracker: https://bitbucket.org/rpy2/rpy2/issues/321/memory-leak-when-unpickling-r-objects
edit: The issue is resolved and the fix is included in rpy2-2.7.5 (just released).
If you follow this advice, please do so tentatively because I am not 100% sure of this solution but I wanted to try to help you if I could.
In Python the garbage collection doesn't use reference counting anymore, which is when Python detects how many objects are referencing an object, then removes it from memory when objects no longer are referencing it.
Instead, Python uses scheduled garbage collection. This means Python sets a time when it garbage collects instead of doing it immediately. Python switched to this system because calculating references can slow programs down (especially when it isn't needed)
In the case of your program, even though you no longer point to certain objects Python might not have come around to freeing it from memory yet, so you can do so manually using:
gc.enable() # enable manual garbage collection
gc.collect() # check for garbage collection
If you would like to read more, here is the link to Python garbage collection documentation. I hope this helps Marco!
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
After having successfully build a static data structure (see here), I would want to avoid having to build it from scratch every time a user requests an operation on it. My naïv first idea was to dump the structure (using python's pickle) into a file and load this file for each query. Needless to say (as I figured out), this turns out to be too time-consuming, as the file is rather large.
Any ideas how I can easily speed up this thing? Splitting the file into multiple files? Or a program running on the server? (How difficult is this to implement?)
Thanks for your help!
You can dump it in a memory cache (such as memcached).
This method has the advantage of cache key invalidation. When underlying data changes you can invalidate your cached data.
EDIT
Here's the python implementation of memcached: python-memcached. Thanks NicDumZ.
If you can rebuild your Python runtime with the patches offered in the Unladen Swallow project, you should see speedups of 40% to 150% in pickling, 36% to 56% in unpickling, according to their benchmarks; maybe that might help.
My suggestion would be not to rely on having an object structure. Instead have a byte array (or mmap'd file etc) which you can do random access operations on and implement the cross-referencing using pointers inside that structure.
True, it will introduce the concept of pointers to your code, but it will mean that you don't need to unpickle it each time the handler process starts up, and it will also use a lot less memory (as there won't be the overhead of python objects).
As your database is going to be fixed during the lifetime of a handler process (I imagine), you won't need to worry about concurrent modifications or locking etc.
Even if you did what you suggest, you shouldn't have to rebuild it on every user request, just keep an instance in memory in your worker process(es), which means it won't take too long to build as you only build it when a new worker process starts.
The number one way to speed up your web application, especially when you have lots of mostly-static modules, classes and objects that need to be initialized: use a way of serving files that supports serving multiple requests from a single interpreter, such as mod_wsgi, mod_python, SCGI, FastCGI, Google App Engine, a Python web server... basically anything except a standard CGI script that starts a new Python process for every request. With this approach, you can make your data structure a global object that only needs to be read from a serialized format for each new process—which is much less frequent.