I have a rather naive question about speed performance of reading from a file in python.
I am implementing an application which needs to read from a binary file. The data is organised in blocks (events, all occupying the same space and encoding the same type of information) where for each event it might not be necessary to read all info. I have written a function using memory mapping to maintain a file "pointer" which is used to seek() and read() from the file.
self.f = open("myFile", "rb")
self._mmf = mmap.mmap(self.f.fileno(), length=0, access=mmap.ACCESS_READ)
Since I know which bytes correspond to which info in the event, I was thinking of implementing a function to receive what type of information is needed from the event, seek() to that position and read() only the relevant bytes, then reposition the file pointer at the beginning of the event for eventual following calls (which might or might not be needed).
My question is, is this implementation necessarily expected to be slower wrt reading the entire event once (and eventually use this info only partially) or does it depend on the event size compared to how many calls of seek() I might have in the workflow?
Thanks!
Related
I was looking into the tempfile options in Python, when I ran into SpooledTemporaryFile, now, the description says:
This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
I would like to understand exactly what that means, I have looked around but found no answer, if I get it right:
Is the written data 'buffered' in RAM until it reaches a certain threshold and then saved in disk? What is the advantage of this method over the standard approach? Is it faster? Because in the end it will have to be saved in disk anyway...
Anyway, if anyone can provide me with some insights I'd be grateful.
The data is buffered in memory before writing. The question becomes, will anything be written?
Not necessarily. TemporaryFile, by default, deletes the file after it is closed. If a spooled file never gets big enough to write to disk (and the user never tries to invoke its fileno method), the buffer will simply be flushed when it is closed, with no actual disk I/O taking place.
It might depend on each OS and also be hardware-dependent, but is there a way in Python to ask a write operation on a file to happen "in-place", i.e. at the same place of the original file, i.e. if possible on the same sectors on disk?
Example: let's say sensitivedata.raw, a 4KB file has to be crypted:
with open('sensitivedata.raw', 'r+') as f: # read write mode
s = f.read()
cipher = encryption_function(s) # same exact length as input s
f.seek(0)
f.write(cipher) # how to ask this write operation to overwrite the original bytes?
Example 2: replace a file by null-byte content of the same size, to avoid an undelete tool to recover it (of course, to do it properly we need several passes, with random data and not only null-bytes, but here it's just to give an idea)
with open('sensitivedata.raw', 'r+') as f:
s = f.read()
f.seek(0)
f.write(len(s) * '\x00') # totally inefficient but just to get the idea
os.remove('sensitivedata.raw')
PS: if it really depends a lot on OS, I'm primarily interested in the Windows case
Side-quesion: if it's not possible in the case of a SSD, does this mean that if you once in your life wrote sensitive data as plaintext on a SSD (example: a password in plaintext, a crypto private key, or anything else, etc.), then there is no way to be sure that this data is really erased? i.e. the only solution is to 100% wipe the disk and fill it many passes with random bytes? Is that correct?
That's an impossible requirement to impose. While on most spinning disk drives, this will happen automatically (there's no reason to write the new data elsewhere when it could just overwrite the existing data directly), SSDs can't do this (when they claim to do so, they're lying to the OS).
SSDs can't rewrite blocks; they can only erase a block, or write to an empty block. The implementation of a "rewrite" is to write to a new block (reading from the original block to fill out the block if there isn't enough new data), then (eventually, cause it's relatively expensive) erase the old block to make it available for a future write.
Update addressing side-question: The only truly secure solution is to run your drive through a woodchipper, then crush the remains with a millstone. :-) Really, in most cases, the window of vulnerability on an SSD should be relatively short; erasing sectors is expensive, so even SSDs that don't honor TRIM typically do it in the background to ensure future (cheap) write operations aren't held up by (expensive) erase operations. This isn't really so bad when you think about it; sure, the data is visible for a period of time after you logically erased it. But it was visible for a period of time before you erased it too, so all this is doing is extending the window of vulnerability by (seconds, minutes, hours, days, depending on the drive); the mistake was in storing sensitive data to permanent storage in the first place; even with extreme (woodchipper+millstone) solutions, someone else could have snuck in and copied the data before you thought to encrypt/destroy it.
I have a set of code which runs in an infinite loop, cancelled through terminal with control C. In this code I use the code json.dumps(dictionary,outfile)
I have noticed that this does not actually put the data into the file until I have used control C to terminate the process. Why does the file not update until after the program has terminated?
Anthony Rossi is basically right, you need to flush the data using outfile.flush(). But why is that so?
json.dump expects a "a .write()-supporting file-like object", see here. Somewhere in your code, you have used open to get your outfile. If we have a look at the documentation for open, we can read the following:
buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.
I guess you haven't specified the buffering parameter and your data is smaller than 4 / 8 kb. Therefore, the write is buffered and not directly written to the file.
When you kill your program using Ctrl+C, the outfile is closed and flushes the data to your file.
To fix this, simply put outfile.flush() after your json.dump() as Anthony Rossi suggested.
Is there a simple way to read all the allocated clusters of a given file with python? The usual python read() seemingly only allows me to read up to the logical size of the file (which is reasonable, of course), but I want to read all the clusters including slack space.
For example, I have a file called "test.bin" that is 1234 byte in logical size, but because my file system uses clusters of size 4096 bytes, the file has a physical size of 4096 bytes on disk. I.e., there are 2862 bytes in file slack space.
I'm not sure where to even start with this problem... I know I can read the raw disk from /dev/sda, but I'm not sure how to locate the clusters of interest... of course this is the whole point of having a file-system (to match up names of files to sectors on disk), but I don't know enough about how python interacts with the file-system to figure this out... yet... any help or pointers to references would be greatly appreciated.
Assuming an ext2/3/4 filesytem, as you guess yourself, your best bet is probably to:
use a wrapper (like this one) around debugfs to get the list of blocks associated with a given file:
debugfs: blocks ./f.txt
2562
to read-back that/those block(s) from the block device / image file
>>> f = open('/tmp/test.img','rb')
>>> f.seek(2562*4*1024)
10493952
>>> bytes = f.read(4*1024)
>>> bytes
b'Some data\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
...
Not very fancy but that will work. Please note you don't have to mount the FS to do any of these steps. This is especially important for forensic applications where you cannot trust in anyway the content of the disk and/or are not allowed per regulation to mount the disk image.
There is a C open source forensic tool that implements file access successfully.
Here is an overview of the tool Link.
You can download it here.
It basically uses the POSIX system call open() that returns a raw file discriptor (integer) that you can use with the POSIX system calls read() and write() without the restrinction of stopping at EOF which stops you from accessing the file slack.
There are lots of examples online how to do a system call with python e.g. this one
I am trying to teach myself Python by reading documentation. I am trying to understand what it means to flush a file buffer. According to documentation, "file.flush" does the following.
Flush the internal buffer, like stdio‘s fflush().
This may be a no-op on some file-like objects.
I don't know what "internal buffer" and "no-op" mean, but I think it says that flush writes data from some buffer to a file.
Hence, I ran this file toggling the pound sign in the line in the middle.
with open("myFile.txt", "w+") as file:
file.write("foo")
file.write("bar")
# file.flush()
file.write("baz")
file.write("quux")
However, I seem to get the same myFile.txt with and without the call to file.flush(). What effect does file.flush() have?
Python buffers writes to files. That is, file.write returns before the data is actually written to your hard drive. The main motivation of this is that a few large writes are much faster than many tiny writes, so by saving up the output of file.write until a bit has accumulated, Python can maintain good writing speeds.
file.flush forces the data to be written out at that moment. This is hand when you know that it might be a while before you have more data to write out, but you want other processes to be able to view the data you've already written. Imagine a log file that grows slowly. You don't want to have to wait ages before enough entries have built up to cause the data to be written out in one big chunk.
In either case, file.close causes the remaining data to be flushed, so "quux" in your code will be written out as soon as file (which is a really bad name as it shadows the builtin file constructor) falls out of scope of the with context manager and gets closed.
Note: your OS does some buffering of its own, but I believe every OS where Python is implemented will honor file.flush's request to write data out to the drive. Someone please correct me if I'm wrong.
By the way, "no-op" means "no operation", as in it won't actually do anything. For example, StringIO objects manipulate strings in memory, not files on your hard drive. StringIO.flush probably just immediately returns because there's not really anything for it to do.
Buffer content might be cached to improve performance. Flush makes sure that the content is written to disk completely, avoiding data loss. It is also useful when, for example, you want the line asking for user input printed completely on-screen before the next file operation takes place.