Python - HardDrive access when opening files

Python - HardDrive access when opening files - python

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.

The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.

Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.

Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Related

Python: How to link a filename to a memory file

What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.

You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.

Should I delete temporary files created by my script?

It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.

As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.

There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.

About the speed of random file read (Python)

Please take a look at the following code (kind of pseudo code):
index = db.open()
fh = open('somefile.txt','rb')
for i in range(1000):
x = random_integer(1,5000)
pos,length = index[x]
fh.seek(pos)
buffer = fh.read(length)
doSomeThingWith(buffer)
fh.close()
db.close()
I used a database to index the positions and lengths of text segments in a .txt file for random retrieval.
No wonder, if the above code is run repeatedly, the execution takes less and less time.
1) What is responsible for this speed-up? Is it because of things staying in the memory or the "caching" or something?
2) Is there anyway to control it?
3) I've compared with other methods where the text segments are stored in Berkeley DB and so on. When at its fastest, the above code is faster than retrieval from Berkeley DB. How do I judge the performance of my database+file solution? I mean, is it safe to judge it as at least "fast enough"?

what is behind and responsible for this speed-up?
It could be the operating system's disk cache. http://en.wikipedia.org/wiki/Page_cache
Once you've read a chunk of a file from disk once, it will hang around in RAM for a while. RAM is orders of magnitude faster than disk, so you'll see a lot of variability in the time it takes to read random pieces of a large file.
Or, depending on what "db" is, the database implementation could be doing its own caching.
Is there anyway to control it?
If it's the disk cache:
It depends on the operating system, but it's typically a pretty coarse-grained control; for example, you may be forced to disable caching for an entire volume, which would affect other processes on the system reading from that volume, and would affect every other file that lived on that volume. It would also probably require root/admin access.
See this similar question about disabling caching on Linux: Linux : Disabling File cache for a process?
Depending on what you're trying to do, you can force-flush the disk cache. This can be useful in situations where you want to run a test with a cold cache, letting you get an idea of the worst-case performance. (This also depends on your OS and may require root/admin access.)
If it's the database:
Depends on the database. If it's a local database, you may just be seeing disk cache effects, or the database library could be doing its own caching. If you're talking to a remote database, the caching could be happening locally or remotely (or both).
There may be configuration options to disable or control caching at either of these layers.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Reading and writing to/from memory in Python

Let's imagine a situation: I have two Python programs. The first one will write some data (str) to computer memory, and then exit. I will then start the second program which will read the in-memory data saved by the first program.
Is this possible?

Sort of.
python p1.py | python p2.py
If p1 writes to stdout, the data goes to memory. If p2 reads from stdin, it reads from memory.
The issue is that there's no "I will then start the second program". You must start both programs so that they share the appropriate memory (in this case, the buffer between stdout and stdin.)

What are all these nonsense answers? Of course you can share memory the way you asked, there's no technical reason you shouldn't be able to persist memory other than lack of usermode API.
In Linux you can use shared memory segments which persist even after the program that made them is gone. You can view/edit them with ipcs(1). To create them, see shmget(2) and the related syscalls.
Alternatively you can use POSIX shared memory, which is probably more portable. See shm_overview(7)
I suppose you can do it on Windows like this.

Store you data into "memory" using things like databases, eg dbm, sqlite, shelve, pickle, etc where your 2nd program can pick up later.

No.
Once the first program exits, its memory is completely gone.
You need to write to disk.

The first one will write some data
(str) to computer memory, and then
exit.
The OS will then ensure all that memory is zeroed before any other program can see it. (This is an important security measure, as the first program may have been processing your bank statement or may have had your password).
You need to write to persistent storage - probably disk. (Or you could use a ramdisk, but that's unlikely to make any difference to real-world performance).
Alternatively, why do you have 2 programs? Why not one program that does both tasks?

Yes.
Define a RAM file-system.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/

You can also set up persistent shared memory area and have one program write to it and the other read it. However, setting up such things is somewhat dependent on the underlying O/S.

Maybe the poster is talking about something like shared memory? Have a look at this: http://poshmodule.sourceforge.net/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.