What would be the best way to handle lightweight crash recovery for my program?
I have a Python program that runs a number of test cases and the results are stored in a dictionary which serves as a cache. If I could save (and then restore) each item that is added to the dictionary, I could simply run the program again and the caching would provide suitable crash recovery.
You may assume that the keys and values in the dictionary are easily convertible to strings ie. using either str or the pickle module.
I want this to be completely cross platform - well at least as cross platform as Python is
I don't want to simply write out each value to a file and load it in my program might crash while I am writing the file
UPDATE: This is intended to be a lightweight module so a DBMS is out of the question.
UPDATE: Alex is correct in that I don't actually need to protect against crashes while writing out, but there are circumstances where I would like to be able to manually terminate it in a recoverable state.
UPDATE Added a highly limited solution using standard input below
There's no good way to guard against "your program crashing while writing a checkpoint to a file", but why should you worry so much about that?! What ELSE is your program doing at that time BESIDES "saving checkpoint to a file", that could easily cause it to crash?!
It's hard to beat pickle (or cPickle) for portability of serialization in Python, but, that's just about "turning your keys and values to strings". For saving key-value pairs (once stringified), few approaches are safer than just appending to a file (don't pickle to files if your crashes are far, far more frequent than normal, as you suggest tjey are).
If your environment is incredibly crash-prone for whatever reason (very cheap HW?-), just make sure you close the file (and fflush if the OS is also crash-prone;-), then reopen it for append. This way, worst that can happen is that the very latest append will be incomplete (due to a crash in the middle of things) -- then you just catch the exception raised by unpickling that incomplete record and redo only the things that weren't saved (because they weren't completed due to a crash, OR because they were completed but not fully saved due to a crash, comes to much the same thing in the end).
If you have the option of checkpointing to a database engine (instead of just doing so to files), consider it seriously! The DB engine will keep transaction logs and ensure ACID properties, making your application-side programming much easier IF you can count on that!-)
The pickle module supports serializing objects to a file (and loading from file):
http://docs.python.org/library/pickle.html
One possibility would be to create a number of smaller files ... each representing a subset of the state that you're trying to preserve and each with a checksum or tag indicating that it's complete as the last line/datum of the file (just before the file is closed).
If the checksum/tag is good then the rest of the data can be considered valid ... though program would then have to find all of these files, open and read all of them, and use meta data you've provided (in their headers or their names?) to determine which ones constitute the most recent cohesive state representation (or checkpoint) from which you can continue processing.
Without knowing more about the nature of the data that you're working with it's impossible to be more specific.
You can use files, of course, or you could use a DBMS system just about as easily. Any decent DBMS (PostgreSQL, MySQL if you're using the proper storage back-ends) can give you ACID guarantees and transactional support. So the data you read back should always be consistent with the constraints that you put in your schema and/or with the transactions (BEGIN, COMMIT, ROLLBACK) that you processed.
A possible advantage of posting your serialized date to a DBMS is that you can host the DBMS on a separate system (which is unlikely to suffer the same instabilities as your test host at the same times).
Pickle/cPickle have problems.
I use the JSON module to serialize objects out. I like it because not only does it work on any OS, but it will work fine in other programming languages, too; many other languages and platforms have readily-accessible JSON deserialization support, which makes it easy to use the same objects in different programs.
Solution with severe restrictions
If I don't worry about it crashing while writing out and I only want to allow manual termination, I can use standard output to control this. Unfortunately, this can only terminate the program when a control point is reached. This could be solved by creating a new thread to read standard input. This thread could use a global lock to check if the main thread is inside a critical section (writing to a file) and terminate the program if this is not the case.
Downsides:
This is reasonably complex
It adds an extra thread
It stops me using standard input for anything else
Related
I'm strictly a python script writer, only ever done one-off scripts. Mainly for string manipulation etc. However, I consider myself proficient enough to be able to handle (with much searching) most of the implementation details of what I want to do (most of which is already done in various scripts).
My current project would involve a UI (let's assume in PyQT, I have not decided but probably wouldn't go with tkinter) which displays data. I haven't done the UI as my scripts so far have all be command-line.
I'd like there to be a separate process which handles the updating of said data. The data store would be a bunch of XML files (unfortunately this is a requirement of the project[1]). Due to the unbounded number of XML files potentially available, I think a separate process would prevent my UI getting locked. In my language of choice (C++ with QT) I'd just use threading, but reading about the GIL it seems I should instead use processes.
My current idea is for one process which reads the XML files and potentially encodes them in some convenient format for my UI process. This process would probably also monitor the data store for any possible file additions/deletions/modifications. Finally, in the encoding process, I probably also want to maintain an index of search terms to increase responsiveness. I expect fairly heavy computational load in this process, which is why I intend to split it off. A full scan of my current data store (not yet doing all the processing I would want) takes about half a second, and I plan to grow it.
The UI process accepts user input (for example, a search term) and displays the necessary results. There will also be a slight amount of processing, but nothing taxing. The user may also choose to save the record she's currently viewing, but I'm undecided whether the actual file change should be done by the UI process or it should be handed off to the background process.
In conclusion:-
What's the best way to share what I presume will be a large-ish python object between my processes? Is it queues, pipes, writing/reading to a separate database object, or something else?
I'm operating on the assumption that the UI process needs the ENTIRE data store. In practice, it possibly only needs a summary (think client-server architecture between UI process and data store process), but this would of course involve more overhead coding/maintenance wise. Is this considered good practice for an application which will always only run on one device?
Additional information:-
[1] - Requirement for XML files is because they are easily shared between devices via file-sync services such as dropbox etc. in a reasonably atomic manner. Since this project requires record-based synchronization, including allowing simultaneous edits (post-merging is possible) in different machines, I'd rather let the third party file-sync service handle it than write my own buggy synchronization tool. Also, and most crucially, there are already users of this data store using it in its current XML form, so it would be extremely difficult to change it.
Say i store a password in plain text in a variable called passWd as a string.
How does python release this variable once i discard of it (for instance, with del passWd or passWd= 'new random data')?
Is the string stored as a byte-array meaning it can be overwritten in the memoryplace that it originally existed or is it a fixed set in a memory area which can't be modified and there for when assining a new value a new memory area is created and the old area is discareded but not overwritten by null?
I'm questioning how Python implements the safety of memory areas and would like to know more about it, mainly because i'm curious :)
From what i've gathered so far, using del (or __del__) causes the interpreter to not release memory areas of that variable automaticly which can cause issues, and also i'm not sure that del is so thurrow on deleting the values. But that's just from what i've gathered and not something in black or white :)
The main reason for me asking, is I'm intending to write a hand-over application that gets a string, does some I/O, passes it along to another subsystem (bootloader for raspberry pi for instance) and the interface is written in Python (how odd that must sound in some peoples ears..) and i'm not worried that the data is compromised during the I/O calculations but that a memory dump might be occuring in between the two subsystem handovers. or if the system is frozen (say a hiberation) say 20min after the system is booted and i removed the variable as fast as i could, but somehow it's still in the memory despite me doing a del passWd :)
(Ps. I've asked on Superuser, they refered me here aand i'm sorry for poor grammar!)
Unless you use custom coded input methods to get the password, it will be in many more places then just your immutable string. So don't worry too much.
The OS should take care that any data from your process is cleared before the memory is allocated to another process. This may of course fail if the page is copied to disk (swapped out or hibernated).
Secure password entry is not easy. Maybe you can find a special library or module that handles this.
I finally whent with two solutions.
ld_preload to replace the functionality of the string handling of Python on a lower level.
One other option which is a bit easier was to develop my own C library that has more functionality then what Python offers through the standard string handling.
Mainly the C code has a shread() function that writes over the memory area where the string "was" stored and some other error checks.
However, #Ber gave me a good enough answer to start developing my own solution since (as he pointed out) there is no secure method in Python and python stores strings in way to many places and relies on the OS (which, on it's own isn't a bad thing except when you don't trust the OS you are installing your realtively secure application on).
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!
I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?
Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.
I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.
Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}
Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.
You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.