Sharing Data in Python - python

I have some Pickled data, which is stored on disk, and it is about 100 MB in size.
When my python program is executed, the picked data is loaded using the cPickle module, and all that works fine.
If I execute the python multiple times using python main.py for example, each python process will load the same data multiple times, which is the correct behaviour.
How can I make it so, all new python process share this data, so it is only loaded a single time into memory?

If you're on Unix, one possibility is to load the data into memory, and then have the script use os.fork() to create a bunch of sub-processes. As long as the sub-processes don't attempt to modify the data, they would automatically share the parent's copy of it, without using any additional memory.
Unfortunately, this won't work on Windows.
P.S. I once asked about placing Python objects into shared memory, but that didn't produce any easy solutions.

Depending on how seriously you need to solve this problem, you may want to look at memcached, if that is not overkill.

Related

Is it possible in Python to load a large object into memory with one process, and access it in separate independent processes?

I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.

various memory errors, python

I am using Python, but recently I am running a lot into the memory errors.
One is related to saving the plots in .png format. As soon as I try to save them in .pdf format I don't have this problem anymore. How can I still use .png for multiple files?
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
And finally, Python should release all the unused variables, but I think it's not doing so. If I run just one function I have no problem, but if I run two unrelated functions in the row (after finishing the first and before going to the second, in my understanding, all the variables should be released), during the second one, I run yet again into the memory error problem. Therefore I believe, the variables are not released after the first run. How can I force Python to release all of them (I don't want to use del, because there are loads of variables and I don't want to specify every single one of them).
Thanks for your help!
Looking at code would probably bring more clearance.
You can also try doing
import gc
f() #function that eats lots of memory while executing
gc.collect()
This will call the garbage collector and you will be sure that all abandoned objects are deleted. If that doesn't solve the problem, take a look at objgraph library http://mg.pov.lt/objgraph/objgraph.html in order to detect who leaks the memory or to find the places where you've forgotten to remove reference to a memory consuming object.
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
If you use with open(myfile1) as f1: ..., you don't need to worry about closing files or about accidentally leaving files opened.
See here for a good explanation.
As for the other questions, I agree with alex_jordan that it would help if you showed some of your code.

Keep persistent variables in memory between runs of Python script

Is there any way of keeping a result variable in memory so I don't have to recalculate it each time I run the beginning of my script?
I am doing a long (5-10 sec) series of the exact operations on a data set (which I am reading from disk) every time I run my script.
This wouldn't be too much of a problem since I'm pretty good at using the interactive editor to debug my code in between runs; however sometimes the interactive capabilities just don't cut it.
I know I could write my results to a file on disk, but I'd like to avoid doing so if at all possible. This should be a solution which generates a variable the first time I run the script, and keeps it in memory until the shell itself is closed or until I explicitly tell it to fizzle out. Something like this:
# Check if variable already created this session
in_mem = var_in_memory() # Returns pointer to var, or False if not in memory yet
if not in_mem:
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
in_mem = store_persistent(result)
I've an inkling that the shelve module might be what I'm looking for here, but looks like in order to open a shelve variable I would have to specify a file name for the persistent object, and so I'm not sure if it's quite what I'm looking for.
Any tips on getting shelve to do what I want it to do? Any alternative ideas?
You can achieve something like this using the reload global function to re-execute your main script's code. You will need to write a wrapper script that imports your main script, asks it for the variable it wants to cache, caches a copy of that within the wrapper script's module scope, and then when you want (when you hit ENTER on stdin or whatever), it calls reload(yourscriptmodule) but this time passes it the cached object such that yourscript can bypass the expensive computation. Here's a quick example.
wrapper.py
import sys
import mainscript
part1Cache = None
if __name__ == "__main__":
while True:
if not part1Cache:
part1Cache = mainscript.part1()
mainscript.part2(part1Cache)
print "Press enter to re-run the script, CTRL-C to exit"
sys.stdin.readline()
reload(mainscript)
mainscript.py
def part1():
print "part1 expensive computation running"
return "This was expensive to compute"
def part2(value):
print "part2 running with %s" % value
While wrapper.py is running, you can edit mainscript.py, add new code to the part2 function and be able to run your new code against the pre-computed part1Cache.
To keep data in memory, the process must keep running. Memory belongs to the process running the script, NOT to the shell. The shell cannot hold memory for you.
So if you want to change your code and keep your process running, you'll have to reload the modules when they're changed. If any of the data in memory is an instance of a class that changes, you'll have to find a way to convert it to an instance of the new class. It's a bit of a mess. Not many languages were ever any good at this kind of hot patching (Common Lisp comes to mind), and there are a lot of chances for things to go wrong.
If you only want to persist one object (or object graph) for future sessions, the shelve module probably is overkill. Just pickle the object you care about. Do the work and save the pickle if you have no pickle-file, or load the pickle-file if you have one.
import os
import cPickle as pickle
pickle_filepath = "/path/to/picklefile.pickle"
if not os.path.exists(pickle_filepath):
# Read data set from disk
with open('mydata', 'r') as in_handle:
mytext = in_handle.read()
# Extract relevant results from data set
mydata = parse_data(mytext)
result = initial_operations(mydata)
with open(pickle_filepath, 'w') as pickle_handle:
pickle.dump(result, pickle_handle)
else:
with open(pickle_filepath) as pickle_handle:
result = pickle.load(pickle_handle)
Python's shelve is a persistence solution for pickled (serialized) objects and is file-based. The advantage is that it stores Python objects directly, meaning the API is pretty simple.
If you really want to avoid the disk, the technology you are looking for is a "in-memory database." Several alternatives exist, see this SO question: in-memory database in Python.
Weirdly, none of the earlier answers here mention simple text files. The OP says they don't like the idea, but as this is becoming a canonical for duplicates which might not have that constraint, this alternative deserves a mention. If all you need is for some text to survive between invocations of your script, save it in a regular text file.
def main():
# Before start, read data from previous run
try:
with open('mydata.txt', encoding='utf-8') as statefile:
data = statefile.read().rstrip('\n')
except FileNotFound:
data = "some default, or maybe nothing"
updated_data = your_real_main(data)
# When done, save new data for next run
with open('mydata.txt', 'w', encoding='utf-8') as statefile:
statefile.write(updated_data + '\n')
This easily extends to more complex data structures, though then you'll probably need to use a standard structured format like JSON or YAML (for serializing data with tree-like structures into text) or CSV (for a matrix of columns and rows containing text and/or numbers).
Ultimately, shelve and pickle are just glorified generalized versions of the same idea; but if your needs are modest, the benefits of a simple textual format which you can inspect and update in a regular text editor, and read and manipulate with ubiquitous standard tools, and easily copy and share between different Python versions and even other programming languages as well as version control systems etc, are quite compelling.
As an aside, character encoding issues are a complication which you need to plan for; but in this day and age, just use UTF-8 for all your text files.
Another caveat is that beginners are often confused about where to save the file. A common convention is to save it in the invoking user's home directory, though that obviously means multiple users cannot share this data. Another is to save it in a shared location, but this then requires an administrator to separately grant write access to this location (except I guess on Windows; but that then comes with its own tectonic plate of other problems).
The main drawback is that text is brittle if you need multiple processes to update the file in rapid succession, and slow to handle if you have lots of data and need to update parts of it frequently. For these use cases, maybe look at a database (probably start with SQLite which is robust and nimble, and included in the Python standard library; scale up to Postgres or etc if you have entrerprise-grade needs).
And, of course, if you need to store native Python structures, shelve and pickle are still there.
This is a os dependent solution...
$mkfifo inpipe
#/usr/bin/python3
#firstprocess.py
complicated_calculation()
while True:
with open('inpipe') as f:
try:
print( exec (f.read()))
except Exception as e: print(e)
$./first_process.py &
$cat second_process.py > inpipe
This will allow you to change and redefine variables in the first process without copying or recalculating anything. It should be the most efficient solution compared to multiprocessing, memcached, pickle, shelve modules or databases.
This is really nice if you want to edit and redefine second_process.py iteratively in your editor or IDE until you have it right without having to wait for the first process (e.g. initializing a large dict, etc.) to execute each time you make a change.
You can do this but you must use a Python shell. In other words, the shell that you use to start Python scripts must be a Python process. Then, any global variables or classes will live until you close the shell.
Look at the cmd module which makes it easy to write a shell program. You can even arrange so that any commmands that are not implemented in your shell get passed to the system shell for execution (without closing your shell). Then you would have to implement some kind of command, prun for instance, that runs a Python script by using the runpy module.
http://docs.python.org/library/runpy.html
You would need to use the init_globals parameter to pass your special data to the program's namespace, ideally a dict or a single class instance.
You could run a persistent script on the server through the os which loads/calcs, and even periodically reloads/recalcs the sql data into memory structures of some sort and then acess the in-memory data from your other script through a socket.

How to access a data structure from a currently running Python process on Linux?

I have a long-running Python process that is generating more data than I planned for. My results are stored in a list that will be serialized (pickled) and written to disk when the program completes -- if it gets that far. But at this rate, it's more likely that the list will exhaust all 1+ GB free RAM and the process will crash, losing all my results in the process.
I plan to modify my script to write results to disk periodically, but I'd like to save the results of the currently-running process if possible. Is there some way I can grab an in-memory data structure from a running process and write it to disk?
I found code.interact(), but since I don't have this hook in my code already, it doesn't seem useful to me (Method to peek at a Python program running right now).
I'm running Python 2.5 on Fedora 8. Any thoughts?
Thanks a lot.
Shahin
There is not much you can do for a running program. The only thing I can think of is to attach the gdb debugger, stop the process and examine the memory. Alternatively make sure that your system is set up to save core dumps then kill the process with kill --sigsegv <pid>. You should then be able to open the core dump with gdb and examine it at your leisure.
There are some gdb macros that will let you examine python data structures and execute python code from within gdb, but for these to work you need to have compiled python with debug symbols enabled and I doubt that is your case. Creating a core dump first then recompiling python with symbols will NOT work, since all the addresses will have changed from the values in the dump.
Here are some links for introspecting python from gdb:
http://wiki.python.org/moin/DebuggingWithGdb
http://chrismiles.livejournal.com/20226.html
or google for 'python gdb'
N.B. to set linux to create coredumps use the ulimit command.
ulimit -a will show you what the current limits are set to.
ulimit -c unlimited will enable core dumps of any size.
While certainly not very pretty you could try to access data of your process through the proc filesystem.. /proc/[pid-of-your-process]. The proc filesystem stores a lot of per process information such as currently open file pointers, memory maps and what not. With a bit of digging you might be able to access the data you need though.
Still i suspect you should rather look at this from within python and do some runtime logging&debugging.
+1 Very interesting question.
I don't know how well this might work for you (especially since I don't know if you'll reuse the pickled list in the program), but I would suggest this: as you write to disk, print out the list to STDOUT. When you run your python script (I'm guessing also from command line), redirect the output to append to a file like so:
python myScript.py >> logFile.
This should store all the lists in logFile.
This way, you can always take a look at what's in logFile and you should have the most up to date data structures in there (depending on where you call print).
Hope this helps
This answer has info on attaching gdb to a python process, with macros that will get you into a pdb session in that process. I haven't tried it myself but it got 20 votes. Sounds like you might end up hanging the app, but also seems to be worth the risk in your case.

Reading and writing to/from memory in Python

Let's imagine a situation: I have two Python programs. The first one will write some data (str) to computer memory, and then exit. I will then start the second program which will read the in-memory data saved by the first program.
Is this possible?
Sort of.
python p1.py | python p2.py
If p1 writes to stdout, the data goes to memory. If p2 reads from stdin, it reads from memory.
The issue is that there's no "I will then start the second program". You must start both programs so that they share the appropriate memory (in this case, the buffer between stdout and stdin.)
What are all these nonsense answers? Of course you can share memory the way you asked, there's no technical reason you shouldn't be able to persist memory other than lack of usermode API.
In Linux you can use shared memory segments which persist even after the program that made them is gone. You can view/edit them with ipcs(1). To create them, see shmget(2) and the related syscalls.
Alternatively you can use POSIX shared memory, which is probably more portable. See shm_overview(7)
I suppose you can do it on Windows like this.
Store you data into "memory" using things like databases, eg dbm, sqlite, shelve, pickle, etc where your 2nd program can pick up later.
No.
Once the first program exits, its memory is completely gone.
You need to write to disk.
The first one will write some data
(str) to computer memory, and then
exit.
The OS will then ensure all that memory is zeroed before any other program can see it. (This is an important security measure, as the first program may have been processing your bank statement or may have had your password).
You need to write to persistent storage - probably disk. (Or you could use a ramdisk, but that's unlikely to make any difference to real-world performance).
Alternatively, why do you have 2 programs? Why not one program that does both tasks?
Yes.
Define a RAM file-system.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/
You can also set up persistent shared memory area and have one program write to it and the other read it. However, setting up such things is somewhat dependent on the underlying O/S.
Maybe the poster is talking about something like shared memory? Have a look at this: http://poshmodule.sourceforge.net/

Categories

Resources