About the speed of random file read (Python)

About the speed of random file read (Python) - python

Please take a look at the following code (kind of pseudo code):
index = db.open()
fh = open('somefile.txt','rb')
for i in range(1000):
x = random_integer(1,5000)
pos,length = index[x]
fh.seek(pos)
buffer = fh.read(length)
doSomeThingWith(buffer)
fh.close()
db.close()
I used a database to index the positions and lengths of text segments in a .txt file for random retrieval.
No wonder, if the above code is run repeatedly, the execution takes less and less time.
1) What is responsible for this speed-up? Is it because of things staying in the memory or the "caching" or something?
2) Is there anyway to control it?
3) I've compared with other methods where the text segments are stored in Berkeley DB and so on. When at its fastest, the above code is faster than retrieval from Berkeley DB. How do I judge the performance of my database+file solution? I mean, is it safe to judge it as at least "fast enough"?

what is behind and responsible for this speed-up?
It could be the operating system's disk cache. http://en.wikipedia.org/wiki/Page_cache
Once you've read a chunk of a file from disk once, it will hang around in RAM for a while. RAM is orders of magnitude faster than disk, so you'll see a lot of variability in the time it takes to read random pieces of a large file.
Or, depending on what "db" is, the database implementation could be doing its own caching.
Is there anyway to control it?
If it's the disk cache:
It depends on the operating system, but it's typically a pretty coarse-grained control; for example, you may be forced to disable caching for an entire volume, which would affect other processes on the system reading from that volume, and would affect every other file that lived on that volume. It would also probably require root/admin access.
See this similar question about disabling caching on Linux: Linux : Disabling File cache for a process?
Depending on what you're trying to do, you can force-flush the disk cache. This can be useful in situations where you want to run a test with a cold cache, letting you get an idea of the worst-case performance. (This also depends on your OS and may require root/admin access.)
If it's the database:
Depends on the database. If it's a local database, you may just be seeing disk cache effects, or the database library could be doing its own caching. If you're talking to a remote database, the caching could be happening locally or remotely (or both).
There may be configuration options to disable or control caching at either of these layers.

Related

tmp file exists for some ms, will it be written to SSD?

Is a file stored to disc, when only present for a fraction of a second?
I'm running with python 3.7 on ubuntu 18.04.
I make use of a python script. This script extracts json-files from a zip-package. Resulting files will be processed. Afterwards the resulting files well be deleted.
As I'm running on an SSD. I want to spare write cycles to it.
Does linux buffer such write cycles to the RAM, or do I need to assume, that I'm forcing my poor SSD into sever thousend write cycles per second?

Linux may cache file operations under some circumstances, but you're looking for it to optimize by avoiding ever committing a whole sequence of operations to storage at all, based on there being no net effect. I do not think you can expect that.
It sounds like you might be better served by using a different filesystem in the first place. Linux has memory-backed file systems (served by the tmpfs filesystem driver, for example), so perhaps you want to set up such a filesystem for your application to use for these scratch files.1 Do note, however, that these are backed by virtual memory, so, although this approach should reduce the number of write cycles on your SSD, it might not eliminate all writes.
1 For example, see https://unix.stackexchange.com/a/66331/289373

Run a python source code on RAM

I have written a code which does some processing , I want to reduce the execution time of the program and I think it can be done if I run it on my RAM which is 1GB.
So will running my program form RAM make any difference to my execution time and if yes how it can be done.

Believe it or not, when you use a modernish computer system, most of your computation is done from RAM. (Well, technically, it's "done" from processor registers, but those are filled from RAM so let's brush that aside for the purposes of this answer)
This is thanks to the magic we call caches and buffers. A disk "cache" in RAM is filled by the operating system whenever something is read from permanent storage. Any further reads of that same data (until and unless it is "evicted" from the cache) only read memory instead of the permanent storage medium.
A "buffer" works similarly for write output, with data first being written to RAM and then eventually flushed out to the underlying medium.
So, in the course of normal operation, any runs of your program after the first (unless you've done a lot of work in between), will already be from RAM. Ditto the program's input file: if it's been read recently, it's already cached in memory! So you're unlikely to be able to speed things up by putting it in memory yourself.
Now, if you want to force things for some reason, you can create a "ramdisk", which is a filesystem backed by RAM. In Linux the easy way to do this is to mount "tmpfs" or put files in the /dev/shm directory. Files on a tmpfs filesystem go away when the computer loses power and are entirely stored in RAM, but otherwise behave like normal disk-backed files. From the way your question is phrased, I don't think this is what you want. I think your real answer is "whatever performance problems you think you have, this is not the cause, sorry".

Python - HardDrive access when opening files

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.

The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.

Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.

Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Why does running SQLite (through python) cause memory to "unofficially" fill up?

I'm dealing with some big (tens of millions of records, around 10gb) database files using SQLite. I'm doint this python's standard interface.
When I try to insert millions of records into the database, or create indices on some of the columns, my computer slowly runs out of memory. If I look at the normal system monitor, it looks like the majority of the system memory is free. However, when I use top, it looks like I have almost no system memory free. If I sort the processes by their memory consuption, then non of them uses more than a couple percent of my memory (including the python process that is running sqlite).
Where is all the memory going? Why do top and Ubuntu's system monitor disagree about how much system memory I have? Why does top tell me that I have very little memory free, and at the same time not show which process(es) is (are) using all the memory?
I'm running Ubuntu 11.04, sqlite3, python 2.7.

10 to 1 says you are confused by linux's filesystem buffer/cache
see
ofstream leaking memory
https://superuser.com/questions/295900/linux-sort-all-data-in-memory/295902#295902
Test it by doing (as root)
echo 3 > /proc/sys/vm/drop_caches

The memory may be not assigned to a process, but it can be e.g. a file on tmpfs filesystem (/dev/shm, /tmp sometimes). You should show us the output of top or free (please note those tools do not show a single 'memory usage' value) to let us tell something more about the memory usage.
In case of inserting records to a database it may be a temporary image created for the current transaction, before it is committed to the real database. Splitting the insertion into many separate transactions (if applicable) may help.
I am just guessing, not enough data.
P.S. It seems I mis-read the original question (I assumed the computer slows down) and there is no problem. sehe's answer is probably better.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.