Reading and writing to/from memory in Python

Reading and writing to/from memory in Python - python

Let's imagine a situation: I have two Python programs. The first one will write some data (str) to computer memory, and then exit. I will then start the second program which will read the in-memory data saved by the first program.
Is this possible?

Sort of.
python p1.py | python p2.py
If p1 writes to stdout, the data goes to memory. If p2 reads from stdin, it reads from memory.
The issue is that there's no "I will then start the second program". You must start both programs so that they share the appropriate memory (in this case, the buffer between stdout and stdin.)

What are all these nonsense answers? Of course you can share memory the way you asked, there's no technical reason you shouldn't be able to persist memory other than lack of usermode API.
In Linux you can use shared memory segments which persist even after the program that made them is gone. You can view/edit them with ipcs(1). To create them, see shmget(2) and the related syscalls.
Alternatively you can use POSIX shared memory, which is probably more portable. See shm_overview(7)
I suppose you can do it on Windows like this.

Store you data into "memory" using things like databases, eg dbm, sqlite, shelve, pickle, etc where your 2nd program can pick up later.

No.
Once the first program exits, its memory is completely gone.
You need to write to disk.

The first one will write some data
(str) to computer memory, and then
exit.
The OS will then ensure all that memory is zeroed before any other program can see it. (This is an important security measure, as the first program may have been processing your bank statement or may have had your password).
You need to write to persistent storage - probably disk. (Or you could use a ramdisk, but that's unlikely to make any difference to real-world performance).
Alternatively, why do you have 2 programs? Why not one program that does both tasks?

Yes.
Define a RAM file-system.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/

You can also set up persistent shared memory area and have one program write to it and the other read it. However, setting up such things is somewhat dependent on the underlying O/S.

Maybe the poster is talking about something like shared memory? Have a look at this: http://poshmodule.sourceforge.net/

Related

various memory errors, python

I am using Python, but recently I am running a lot into the memory errors.
One is related to saving the plots in .png format. As soon as I try to save them in .pdf format I don't have this problem anymore. How can I still use .png for multiple files?
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
And finally, Python should release all the unused variables, but I think it's not doing so. If I run just one function I have no problem, but if I run two unrelated functions in the row (after finishing the first and before going to the second, in my understanding, all the variables should be released), during the second one, I run yet again into the memory error problem. Therefore I believe, the variables are not released after the first run. How can I force Python to release all of them (I don't want to use del, because there are loads of variables and I don't want to specify every single one of them).
Thanks for your help!

Looking at code would probably bring more clearance.
You can also try doing
import gc
f() #function that eats lots of memory while executing
gc.collect()
This will call the garbage collector and you will be sure that all abandoned objects are deleted. If that doesn't solve the problem, take a look at objgraph library http://mg.pov.lt/objgraph/objgraph.html in order to detect who leaks the memory or to find the places where you've forgotten to remove reference to a memory consuming object.

Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
If you use with open(myfile1) as f1: ..., you don't need to worry about closing files or about accidentally leaving files opened.
See here for a good explanation.
As for the other questions, I agree with alex_jordan that it would help if you showed some of your code.

Sharing Data in Python

I have some Pickled data, which is stored on disk, and it is about 100 MB in size.
When my python program is executed, the picked data is loaded using the cPickle module, and all that works fine.
If I execute the python multiple times using python main.py for example, each python process will load the same data multiple times, which is the correct behaviour.
How can I make it so, all new python process share this data, so it is only loaded a single time into memory?

If you're on Unix, one possibility is to load the data into memory, and then have the script use os.fork() to create a bunch of sub-processes. As long as the sub-processes don't attempt to modify the data, they would automatically share the parent's copy of it, without using any additional memory.
Unfortunately, this won't work on Windows.
P.S. I once asked about placing Python objects into shared memory, but that didn't produce any easy solutions.

Depending on how seriously you need to solve this problem, you may want to look at memcached, if that is not overkill.

Python - HardDrive access when opening files

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.

The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.

Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.

Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

How to access a data structure from a currently running Python process on Linux?

I have a long-running Python process that is generating more data than I planned for. My results are stored in a list that will be serialized (pickled) and written to disk when the program completes -- if it gets that far. But at this rate, it's more likely that the list will exhaust all 1+ GB free RAM and the process will crash, losing all my results in the process.
I plan to modify my script to write results to disk periodically, but I'd like to save the results of the currently-running process if possible. Is there some way I can grab an in-memory data structure from a running process and write it to disk?
I found code.interact(), but since I don't have this hook in my code already, it doesn't seem useful to me (Method to peek at a Python program running right now).
I'm running Python 2.5 on Fedora 8. Any thoughts?
Thanks a lot.
Shahin

There is not much you can do for a running program. The only thing I can think of is to attach the gdb debugger, stop the process and examine the memory. Alternatively make sure that your system is set up to save core dumps then kill the process with kill --sigsegv <pid>. You should then be able to open the core dump with gdb and examine it at your leisure.
There are some gdb macros that will let you examine python data structures and execute python code from within gdb, but for these to work you need to have compiled python with debug symbols enabled and I doubt that is your case. Creating a core dump first then recompiling python with symbols will NOT work, since all the addresses will have changed from the values in the dump.
Here are some links for introspecting python from gdb:
http://wiki.python.org/moin/DebuggingWithGdb
http://chrismiles.livejournal.com/20226.html
or google for 'python gdb'
N.B. to set linux to create coredumps use the ulimit command.
ulimit -a will show you what the current limits are set to.
ulimit -c unlimited will enable core dumps of any size.

While certainly not very pretty you could try to access data of your process through the proc filesystem.. /proc/[pid-of-your-process]. The proc filesystem stores a lot of per process information such as currently open file pointers, memory maps and what not. With a bit of digging you might be able to access the data you need though.
Still i suspect you should rather look at this from within python and do some runtime logging&debugging.

+1 Very interesting question.
I don't know how well this might work for you (especially since I don't know if you'll reuse the pickled list in the program), but I would suggest this: as you write to disk, print out the list to STDOUT. When you run your python script (I'm guessing also from command line), redirect the output to append to a file like so:
python myScript.py >> logFile.
This should store all the lists in logFile.
This way, you can always take a look at what's in logFile and you should have the most up to date data structures in there (depending on where you call print).
Hope this helps

This answer has info on attaching gdb to a python process, with macros that will get you into a pdb session in that process. I haven't tried it myself but it got 20 votes. Sounds like you might end up hanging the app, but also seems to be worth the risk in your case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.