This question already has answers here:
Best method of saving data
(5 answers)
Closed 3 years ago.
I occasionally have Python programs that take a long time to run, and that I want to be able to save the state of and resume later. Does anyone have a clever way of saving the state either every x seconds, or when the program is exiting?
Put all of your "state" data in one place and use a pickle.
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
If you want to save everything, including the entire namespace and the line of code currently executing to be restarted at any time, there is not a standard library module to do that.
As another poster said, the pickle module can save pretty much everything into a file and then load it again, but you would have to specifically design your program around the pickle module (i.e. saving your "state" -- including variables, etc -- in a class).
If you ok with OOP, consider creating a method for each class that output a serialised version ( using pickle ) to file. Then add a second method to load in the instance the data, and if the pickled file is there you call the load method instead of the processing one.
I use this approach for ML and it really seed up my workflow.
In the traditional programming approach the obvious way to save a state of variables or objects after some point of execution is serialization.
So if you want to execute the program after some heavy already computed state we need to start only from the deserialization part.
These steps will be mostly needed mostly in data science modelling where we load a huge CSV or some other data and compute it and we do not want to recompute every time we run a program.
http://jupyter.org/ - A tool which does these serialization/deserialization automatically without you doing any manual work.
All you need to do is do execute a selected portion of your python code let us say line 10-15, which are dependent on pervious lines 1-9. Jupyter saves the state of 1-9 for you. Explore a tutorial it and give it a try.
Related
I have an application that dynamically generates a lot of Python modules with class factories to eliminate a lot of redundant boilerplate that makes the code hard to debug across similar implementations and it works well except that the dynamic generation of the classes across the modules (hundreds of them) takes more time to load than simply importing from a file. So I would like to find a way to save the modules to a file after generation (unless reset) then load from those files to cut down on bootstrap time for the platform.
Does anyone know how I can save/export auto-generated Python modules to a file for re-import later. I already know that pickling and exporting as a JSON object won't work because they make use of thread locks and other dynamic state variables and the classes must be defined before they can be pickled. I need to save the actual class definitions, not instances. The classes are defined with the type() function.
If you have ideas of knowledge on how to do this I would really appreciate your input.
You’re basically asking how to write a compiler whose input is a module object and whose output is a .pyc file. (One plausible strategy is of course to generate a .py and then byte-compile that in the usual fashion; the following could even be adapted to do so.) It’s fairly easy to do this for simple cases: the .pyc format is very simple (but note the comments there), and the marshal module does all of the heavy lifting for it. One point of warning that might be obvious: if you’ve already evaluated, say, os.getcwd() when you generate the code, that’s not at all the same as evaluating it when loading it in a new process.
The “only” other task is constructing the code objects for the module and each class: this requires concatenating a large number of boring values from the dis module, and will fail if any object encountered is non-trivial. These might be global/static variables/constants or default argument values: if you can alter your generator to produce modules directly, you can probably wrap all of these (along with anything else you want to defer) in function calls by compiling something like
my_global=(lambda: open(os.devnull,'w'))()
so that you actually emit the function and then a call to it. If you can’t so alter it, you’ll have to have rules to recognize values that need to be constructed in this fashion so that you can replace them with such calls.
Another detail that may be important is closures: if your generator uses local functions/classes, you’ll need to create the cell objects, perhaps via “fake” closures of your own:
def cell(x): return (lambda: x).__closure__[0]
This question already has answers here:
Best method of saving data
(5 answers)
Closed 3 years ago.
I occasionally have Python programs that take a long time to run, and that I want to be able to save the state of and resume later. Does anyone have a clever way of saving the state either every x seconds, or when the program is exiting?
Put all of your "state" data in one place and use a pickle.
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
If you want to save everything, including the entire namespace and the line of code currently executing to be restarted at any time, there is not a standard library module to do that.
As another poster said, the pickle module can save pretty much everything into a file and then load it again, but you would have to specifically design your program around the pickle module (i.e. saving your "state" -- including variables, etc -- in a class).
If you ok with OOP, consider creating a method for each class that output a serialised version ( using pickle ) to file. Then add a second method to load in the instance the data, and if the pickled file is there you call the load method instead of the processing one.
I use this approach for ML and it really seed up my workflow.
In the traditional programming approach the obvious way to save a state of variables or objects after some point of execution is serialization.
So if you want to execute the program after some heavy already computed state we need to start only from the deserialization part.
These steps will be mostly needed mostly in data science modelling where we load a huge CSV or some other data and compute it and we do not want to recompute every time we run a program.
http://jupyter.org/ - A tool which does these serialization/deserialization automatically without you doing any manual work.
All you need to do is do execute a selected portion of your python code let us say line 10-15, which are dependent on pervious lines 1-9. Jupyter saves the state of 1-9 for you. Explore a tutorial it and give it a try.
This question already has answers here:
Best method of saving data
(5 answers)
Closed 3 years ago.
I occasionally have Python programs that take a long time to run, and that I want to be able to save the state of and resume later. Does anyone have a clever way of saving the state either every x seconds, or when the program is exiting?
Put all of your "state" data in one place and use a pickle.
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
If you want to save everything, including the entire namespace and the line of code currently executing to be restarted at any time, there is not a standard library module to do that.
As another poster said, the pickle module can save pretty much everything into a file and then load it again, but you would have to specifically design your program around the pickle module (i.e. saving your "state" -- including variables, etc -- in a class).
If you ok with OOP, consider creating a method for each class that output a serialised version ( using pickle ) to file. Then add a second method to load in the instance the data, and if the pickled file is there you call the load method instead of the processing one.
I use this approach for ML and it really seed up my workflow.
In the traditional programming approach the obvious way to save a state of variables or objects after some point of execution is serialization.
So if you want to execute the program after some heavy already computed state we need to start only from the deserialization part.
These steps will be mostly needed mostly in data science modelling where we load a huge CSV or some other data and compute it and we do not want to recompute every time we run a program.
http://jupyter.org/ - A tool which does these serialization/deserialization automatically without you doing any manual work.
All you need to do is do execute a selected portion of your python code let us say line 10-15, which are dependent on pervious lines 1-9. Jupyter saves the state of 1-9 for you. Explore a tutorial it and give it a try.
I cannot understand it. Very simple, and obvious functionality:
You have a code in any programming language, You run it. In this code You generate variables, than You save them (the values, names, namely everything) to a file, with one command. When it's saved You may open such a file in Your code also with simple command.
It works perfect in matlab (save Workspace , load Workspace ) - in python there's some weird "pickle" protocol, which produces errors all the time, while all I want to do is save variable, and load it again in another session (?????)
f.e. You cannot save class with variables (in Matlab there's no problem)
You cannot load arrays in cPickle (but YOu can save them (?????) )
Why don't make it easier?
Is there a way to save the current variables with values, and then load them?
What you are describing is Matlab environment feature not a programming language.
What you need is a way to store serialized state of some object which could be easily done in almost any programming language. In python world pickle is the easiest way to achieve it and if you could provide more details about the errors it produces for you people would probably be able to give you more details on that.
In general for object oriented languages (including python) it is always a good approach to incapsulate a your state into single object that could be serialized and de-serialized and then store/load an instance of such class. Pickling and unpickling of such objects works perfectly for many developers so this must be something specific to your implementation.
Since you're talking about Matlab, you probably want to try out IPython, which is a shell for Python offering much more functionality than the standard interpreter shell you get when executing Python.
Among this functionality is the ability to load/save workspace sessions, create macros out of session input etc., which is probably more like what you are used to in Matlab (I actually use both and find IPython to be much more elegant, but YMMV):
http://ipython.scipy.org
PiCloud has implemented a fancier pickle, but I can't find the code. I saw a poster session.
Generally in Python instantiated objects don't have any one way to recreate them, and in some cases its particularly difficult (like an open file) as it takes several steps to recreate.
I take issue with the statement that the saving of variables in Matlab is an environment function. the "save" statement in matlab is a function and part of the matlab language not just a command. It is a very useful function as you don't have to worry about the trivial minutia of file i/o and it handles all sorts of variables from scalar, matrix, objects, structures.
It's an old thread, but thought I should throw it out there anyway - Spyder the Scientific Python development environment allows you to do just this through the Variable explorer. There's a button there Save data that packs your whole workspace up in a .spydata file that you can later reload. Works like a charm when you're switching between projects!
What would be the best way to handle lightweight crash recovery for my program?
I have a Python program that runs a number of test cases and the results are stored in a dictionary which serves as a cache. If I could save (and then restore) each item that is added to the dictionary, I could simply run the program again and the caching would provide suitable crash recovery.
You may assume that the keys and values in the dictionary are easily convertible to strings ie. using either str or the pickle module.
I want this to be completely cross platform - well at least as cross platform as Python is
I don't want to simply write out each value to a file and load it in my program might crash while I am writing the file
UPDATE: This is intended to be a lightweight module so a DBMS is out of the question.
UPDATE: Alex is correct in that I don't actually need to protect against crashes while writing out, but there are circumstances where I would like to be able to manually terminate it in a recoverable state.
UPDATE Added a highly limited solution using standard input below
There's no good way to guard against "your program crashing while writing a checkpoint to a file", but why should you worry so much about that?! What ELSE is your program doing at that time BESIDES "saving checkpoint to a file", that could easily cause it to crash?!
It's hard to beat pickle (or cPickle) for portability of serialization in Python, but, that's just about "turning your keys and values to strings". For saving key-value pairs (once stringified), few approaches are safer than just appending to a file (don't pickle to files if your crashes are far, far more frequent than normal, as you suggest tjey are).
If your environment is incredibly crash-prone for whatever reason (very cheap HW?-), just make sure you close the file (and fflush if the OS is also crash-prone;-), then reopen it for append. This way, worst that can happen is that the very latest append will be incomplete (due to a crash in the middle of things) -- then you just catch the exception raised by unpickling that incomplete record and redo only the things that weren't saved (because they weren't completed due to a crash, OR because they were completed but not fully saved due to a crash, comes to much the same thing in the end).
If you have the option of checkpointing to a database engine (instead of just doing so to files), consider it seriously! The DB engine will keep transaction logs and ensure ACID properties, making your application-side programming much easier IF you can count on that!-)
The pickle module supports serializing objects to a file (and loading from file):
http://docs.python.org/library/pickle.html
One possibility would be to create a number of smaller files ... each representing a subset of the state that you're trying to preserve and each with a checksum or tag indicating that it's complete as the last line/datum of the file (just before the file is closed).
If the checksum/tag is good then the rest of the data can be considered valid ... though program would then have to find all of these files, open and read all of them, and use meta data you've provided (in their headers or their names?) to determine which ones constitute the most recent cohesive state representation (or checkpoint) from which you can continue processing.
Without knowing more about the nature of the data that you're working with it's impossible to be more specific.
You can use files, of course, or you could use a DBMS system just about as easily. Any decent DBMS (PostgreSQL, MySQL if you're using the proper storage back-ends) can give you ACID guarantees and transactional support. So the data you read back should always be consistent with the constraints that you put in your schema and/or with the transactions (BEGIN, COMMIT, ROLLBACK) that you processed.
A possible advantage of posting your serialized date to a DBMS is that you can host the DBMS on a separate system (which is unlikely to suffer the same instabilities as your test host at the same times).
Pickle/cPickle have problems.
I use the JSON module to serialize objects out. I like it because not only does it work on any OS, but it will work fine in other programming languages, too; many other languages and platforms have readily-accessible JSON deserialization support, which makes it easy to use the same objects in different programs.
Solution with severe restrictions
If I don't worry about it crashing while writing out and I only want to allow manual termination, I can use standard output to control this. Unfortunately, this can only terminate the program when a control point is reached. This could be solved by creating a new thread to read standard input. This thread could use a global lock to check if the main thread is inside a critical section (writing to a file) and terminate the program if this is not the case.
Downsides:
This is reasonably complex
It adds an extra thread
It stops me using standard input for anything else