Keep persistent variables in memory between runs of Python script

Keep persistent variables in memory between runs of Python script - python

Is there any way of keeping a result variable in memory so I don't have to recalculate it each time I run the beginning of my script?
I am doing a long (5-10 sec) series of the exact operations on a data set (which I am reading from disk) every time I run my script.
This wouldn't be too much of a problem since I'm pretty good at using the interactive editor to debug my code in between runs; however sometimes the interactive capabilities just don't cut it.
I know I could write my results to a file on disk, but I'd like to avoid doing so if at all possible. This should be a solution which generates a variable the first time I run the script, and keeps it in memory until the shell itself is closed or until I explicitly tell it to fizzle out. Something like this:
# Check if variable already created this session
in_mem = var_in_memory() # Returns pointer to var, or False if not in memory yet
if not in_mem:
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
in_mem = store_persistent(result)
I've an inkling that the shelve module might be what I'm looking for here, but looks like in order to open a shelve variable I would have to specify a file name for the persistent object, and so I'm not sure if it's quite what I'm looking for.
Any tips on getting shelve to do what I want it to do? Any alternative ideas?

You can achieve something like this using the reload global function to re-execute your main script's code. You will need to write a wrapper script that imports your main script, asks it for the variable it wants to cache, caches a copy of that within the wrapper script's module scope, and then when you want (when you hit ENTER on stdin or whatever), it calls reload(yourscriptmodule) but this time passes it the cached object such that yourscript can bypass the expensive computation. Here's a quick example.
wrapper.py
import sys
import mainscript
part1Cache = None
if __name__ == "__main__":
while True:
if not part1Cache:
part1Cache = mainscript.part1()
mainscript.part2(part1Cache)
print "Press enter to re-run the script, CTRL-C to exit"
sys.stdin.readline()
reload(mainscript)
mainscript.py
def part1():
print "part1 expensive computation running"
return "This was expensive to compute"
def part2(value):
print "part2 running with %s" % value
While wrapper.py is running, you can edit mainscript.py, add new code to the part2 function and be able to run your new code against the pre-computed part1Cache.

To keep data in memory, the process must keep running. Memory belongs to the process running the script, NOT to the shell. The shell cannot hold memory for you.
So if you want to change your code and keep your process running, you'll have to reload the modules when they're changed. If any of the data in memory is an instance of a class that changes, you'll have to find a way to convert it to an instance of the new class. It's a bit of a mess. Not many languages were ever any good at this kind of hot patching (Common Lisp comes to mind), and there are a lot of chances for things to go wrong.

If you only want to persist one object (or object graph) for future sessions, the shelve module probably is overkill. Just pickle the object you care about. Do the work and save the pickle if you have no pickle-file, or load the pickle-file if you have one.
import os
import cPickle as pickle
pickle_filepath = "/path/to/picklefile.pickle"
if not os.path.exists(pickle_filepath):
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
with open(pickle_filepath, 'w') as pickle_handle:
pickle.dump(result, pickle_handle)
else:
with open(pickle_filepath) as pickle_handle:
result = pickle.load(pickle_handle)

Python's shelve is a persistence solution for pickled (serialized) objects and is file-based. The advantage is that it stores Python objects directly, meaning the API is pretty simple.
If you really want to avoid the disk, the technology you are looking for is a "in-memory database." Several alternatives exist, see this SO question: in-memory database in Python.

Weirdly, none of the earlier answers here mention simple text files. The OP says they don't like the idea, but as this is becoming a canonical for duplicates which might not have that constraint, this alternative deserves a mention. If all you need is for some text to survive between invocations of your script, save it in a regular text file.
def main():
# Before start, read data from previous run
try:
with open('mydata.txt', encoding='utf-8') as statefile:
data = statefile.read().rstrip('\n')
except FileNotFound:
data = "some default, or maybe nothing"
updated_data = your_real_main(data)
# When done, save new data for next run
with open('mydata.txt', 'w', encoding='utf-8') as statefile:
statefile.write(updated_data + '\n')
This easily extends to more complex data structures, though then you'll probably need to use a standard structured format like JSON or YAML (for serializing data with tree-like structures into text) or CSV (for a matrix of columns and rows containing text and/or numbers).
Ultimately, shelve and pickle are just glorified generalized versions of the same idea; but if your needs are modest, the benefits of a simple textual format which you can inspect and update in a regular text editor, and read and manipulate with ubiquitous standard tools, and easily copy and share between different Python versions and even other programming languages as well as version control systems etc, are quite compelling.
As an aside, character encoding issues are a complication which you need to plan for; but in this day and age, just use UTF-8 for all your text files.
Another caveat is that beginners are often confused about where to save the file. A common convention is to save it in the invoking user's home directory, though that obviously means multiple users cannot share this data. Another is to save it in a shared location, but this then requires an administrator to separately grant write access to this location (except I guess on Windows; but that then comes with its own tectonic plate of other problems).
The main drawback is that text is brittle if you need multiple processes to update the file in rapid succession, and slow to handle if you have lots of data and need to update parts of it frequently. For these use cases, maybe look at a database (probably start with SQLite which is robust and nimble, and included in the Python standard library; scale up to Postgres or etc if you have entrerprise-grade needs).
And, of course, if you need to store native Python structures, shelve and pickle are still there.

This is a os dependent solution...
$mkfifo inpipe
#/usr/bin/python3
#firstprocess.py
complicated_calculation()
while True:
with open('inpipe') as f:
try:
print( exec (f.read()))
except Exception as e: print(e)
$./first_process.py &
$cat second_process.py > inpipe
This will allow you to change and redefine variables in the first process without copying or recalculating anything. It should be the most efficient solution compared to multiprocessing, memcached, pickle, shelve modules or databases.
This is really nice if you want to edit and redefine second_process.py iteratively in your editor or IDE until you have it right without having to wait for the first process (e.g. initializing a large dict, etc.) to execute each time you make a change.

You can do this but you must use a Python shell. In other words, the shell that you use to start Python scripts must be a Python process. Then, any global variables or classes will live until you close the shell.
Look at the cmd module which makes it easy to write a shell program. You can even arrange so that any commmands that are not implemented in your shell get passed to the system shell for execution (without closing your shell). Then you would have to implement some kind of command, prun for instance, that runs a Python script by using the runpy module.
http://docs.python.org/library/runpy.html
You would need to use the init_globals parameter to pass your special data to the program's namespace, ideally a dict or a single class instance.

You could run a persistent script on the server through the os which loads/calcs, and even periodically reloads/recalcs the sql data into memory structures of some sort and then acess the in-memory data from your other script through a socket.

Related

Overwrite a python file while using it?

I have first file (data.py):
database = {
'school': 2,
'class': 3
}
my second python file (app.py)
import data
del data.database['school']
print(data.database)
>>>{'class': 3}
But in data.py didn't change anything? Why?
And how can I change it from my app.py?

del data.database['school'] modifies the data in memory, but does not modify the source code.
Modifying a source code to manage the persistence of your data is not a good practice IMHO.
You could use a database, a csv file, a json file ...

To elaborate on Gelineau's answer: at runtime, your source code is turned into a machine-usable representation (known as "bytecode") which is loaded into the process memory, then executed. When the del data.database['school'] statement (in it's bytecode form) is executed, it only modifies the in-memory data.database object, not (hopefully!) the source code itself. Actually, your source code is not "the program", it's a blueprint for the runtime process.
What you're looking for is known as data persistance (data that "remembers" it's last known state between executions of the program). There are many solutions to this problem, ranging from the simple "write it to a text or binary file somewhere and re-read it at startup" to full-blown multi-servers database systems. Which solution is appropriate for you depends on your program's needs and constraints, whether you need to handle concurrent access (multiple users / processes editing the data at the same time), etc etc so there's really no one-size-fits-all answers. For the simplest use cases (single user, small datasets etc), json or csv files written to disk or a simple binary key:value file format like anydbm or shelve (both in Python's stdlib) can be enough. As soon as things gets a bit more complex, SQL databases are most often your best bet (no wonder why they are still the industry standard and will remain so for long years).
In all cases, data persistance is not "automagic", you will have to write quite some code to make sure your changes are saved in timely manner.

As what you are trying to achieve is basically related to file operation.
So when you are importing data , it just loads instanc of your file in memory and create a reference from your new file, ie. app.py. So, if you modify it in app.py its just modifiying the instance which is in RAM not in harddrive where your actual file is stored in harddrive.
If you want to change source code of another file "As its not good practice" then you can use file operations.

Reproducible test framework

I am looking for a test or integration framework that supports long, costly tests for correctness. The tests should only be rerun if the code affecting the test has changed.
Ideally the test framework would
find the code of the test
produce a hash of it,
run the code and write to an output file with the hash as the name
or skip if that already exists.
provide a simple overview what tests succeeded and which failed.
It would be OK if the test has to specify the modules and files it depends on.
Python would be ideal, but this problem may be high-level enough that other languages would work too.
Perhaps there exists already a test or build integration framework I can adapt to fit this behaviour?

Basically you need to track what is the test doing so you can check whether it has changed.
Python code can be traced with sys.settrace(tracefunc). There is a module trace that can help with it.
But if it is not just Python code - if the tests execute other programs, test input files etc. and you need to watch it for changes too, then you would need tracing on operating system level, like strace, dtrace, dtruss.
I've created a small demo/prototype of simple testing framework that runs only tests that changed from last run: https://gist.github.com/messa/3825eba3ad3975840400 It uses the trace module. It works this way:
collect tests, each test is identified by name
load test fingerprints from JSON file (if present)
for each test:
if the fingerprint matches the current bytecode of functions listed in the fingerprint, the test is skipped
run test otherwise
trace it while running, record all functions being called
create test fingerprint with function names and bytecode MD5 hashes of each recorded function
save updated test fingerprints to a JSON file
But there is one problem: it's slow. Running code while tracing it with trace.Trace is about 40x slower than without tracing. So maybe you will be just better running all tests without tracing :) But if the tracer would be implemented in C like for example it is in the coverage module it should be faster. (Python trace module is not in C.)
Maybe some other tricks could help with speed. Maybe you are interested just in some top-level function whether they changed or not, so you don't need to trace all function calls.
Have you considered other ways how to speed up expensive tests? Like paralellization, ramdisk (tmpfs)... For example, if you test against a database, don't use the "system" or development one, but run a special instance of the database with lightweight configuration (no prealloc, no journal...) from tmpfs. If it is possible, of course - some tests need to be run on configuration similar to the production.
Some test frameworks (or their plugins) can run only the tests that failed last time - that's different, but kind of similar functinality.

This may not be the most efficient way to do this, but this can be done with Python's pickle module.
import pickle
At the end of your file, have it save itself as a pickle.
myfile = open('myfile.py', 'r') #Your script
savefile = open('savefile.pkl', 'w') #File the script will be saved to
#Any file extension can be used but I like .pkl for "pickle"
mytext = myfile.readlines()
pickle.dump(mytext, savefile) #Saves list from readlines() as a pickle
myfile.close()
savefile.close()
And then at the beginning of your script (after you have pickled it once already), add the code bit that checks it against the pickle.
myfile = ('myfile.py', 'r')
savefile = ('savefile.pkl', 'r')
mytext = myfile.readlines
savetext = pickle.load(savefile)
myfile.close()
savefile.close()
if mytext == savetext:
#Do whatever you want it to do
else:
#more code
That should work. It's a little long, but it's pure python and should do what you're looking for.

Sharing Data in Python

I have some Pickled data, which is stored on disk, and it is about 100 MB in size.
When my python program is executed, the picked data is loaded using the cPickle module, and all that works fine.
If I execute the python multiple times using python main.py for example, each python process will load the same data multiple times, which is the correct behaviour.
How can I make it so, all new python process share this data, so it is only loaded a single time into memory?

If you're on Unix, one possibility is to load the data into memory, and then have the script use os.fork() to create a bunch of sub-processes. As long as the sub-processes don't attempt to modify the data, they would automatically share the parent's copy of it, without using any additional memory.
Unfortunately, this won't work on Windows.
P.S. I once asked about placing Python objects into shared memory, but that didn't produce any easy solutions.

Depending on how seriously you need to solve this problem, you may want to look at memcached, if that is not overkill.

Storing user data in a Python script

What is the preferred/ usual way of storing data that is entered by the user when running a Python script, if I need the data again the next time the script runs?
For example, my script performs calculations based on what the user enters and then when the user runs the script again, it fetches the result from the last run.
For now, I write the data to a text file and read it from there. I don't think that I would need to store very large records ( less than 100, I'd say).
I am targeting Windows and Linux users both with this script, so a cross platform solution would be good. My only apprehension with using a text file is that I feel it might not be the best and the usual way of doing it.
So my question is, if you ever need to store some data for your script, how do you do it?

you could use a slite database or a CSV file. They are both very easy to work with but lend themselves to rows with the same type of information.
The best option might be shelve module
import shelve
shelf = shelve.open(filename)
shelf['key1'] = value1
shelf['key2'] = value2
shelf.close()
# next run
shelf.open(filename)
value1 = shelf['key1']
#etc

For small amounts of data, Python's pickle module is great for stashing away data you want easy access to later--just pickle the data objects from memory and write to a (hidden) file in the user's home folder (good for Linux etc.) or Application Data (on Windows).
Of, as #aaronnasterling mentioned, a sqlite3 file-based database is small, fast and easy that it's no wonder that so many popular programs like Firefox and Pidgin use it.

For 100 lines, plain text is fine with either the standard ConfigParser or csv modules.
Assuming your data structure is simple, text affords opportunities (e.g. grep, vi, notepad) that more complex formats preclude.

Since you only need the last result, just store the result in a file.
Example
write('something', wb)
It will only store the last result. Then when you re-run the script, do a open and read the previous result.

Reading and writing to/from memory in Python

Let's imagine a situation: I have two Python programs. The first one will write some data (str) to computer memory, and then exit. I will then start the second program which will read the in-memory data saved by the first program.
Is this possible?

Sort of.
python p1.py | python p2.py
If p1 writes to stdout, the data goes to memory. If p2 reads from stdin, it reads from memory.
The issue is that there's no "I will then start the second program". You must start both programs so that they share the appropriate memory (in this case, the buffer between stdout and stdin.)

What are all these nonsense answers? Of course you can share memory the way you asked, there's no technical reason you shouldn't be able to persist memory other than lack of usermode API.
In Linux you can use shared memory segments which persist even after the program that made them is gone. You can view/edit them with ipcs(1). To create them, see shmget(2) and the related syscalls.
Alternatively you can use POSIX shared memory, which is probably more portable. See shm_overview(7)
I suppose you can do it on Windows like this.

Store you data into "memory" using things like databases, eg dbm, sqlite, shelve, pickle, etc where your 2nd program can pick up later.

No.
Once the first program exits, its memory is completely gone.
You need to write to disk.

The first one will write some data
(str) to computer memory, and then
exit.
The OS will then ensure all that memory is zeroed before any other program can see it. (This is an important security measure, as the first program may have been processing your bank statement or may have had your password).
You need to write to persistent storage - probably disk. (Or you could use a ramdisk, but that's unlikely to make any difference to real-world performance).
Alternatively, why do you have 2 programs? Why not one program that does both tasks?

Yes.
Define a RAM file-system.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/

You can also set up persistent shared memory area and have one program write to it and the other read it. However, setting up such things is somewhat dependent on the underlying O/S.

Maybe the poster is talking about something like shared memory? Have a look at this: http://poshmodule.sourceforge.net/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.