I was looking into the tempfile options in Python, when I ran into SpooledTemporaryFile, now, the description says:
This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
I would like to understand exactly what that means, I have looked around but found no answer, if I get it right:
Is the written data 'buffered' in RAM until it reaches a certain threshold and then saved in disk? What is the advantage of this method over the standard approach? Is it faster? Because in the end it will have to be saved in disk anyway...
Anyway, if anyone can provide me with some insights I'd be grateful.
The data is buffered in memory before writing. The question becomes, will anything be written?
Not necessarily. TemporaryFile, by default, deletes the file after it is closed. If a spooled file never gets big enough to write to disk (and the user never tries to invoke its fileno method), the buffer will simply be flushed when it is closed, with no actual disk I/O taking place.
Related
I have a rather naive question about speed performance of reading from a file in python.
I am implementing an application which needs to read from a binary file. The data is organised in blocks (events, all occupying the same space and encoding the same type of information) where for each event it might not be necessary to read all info. I have written a function using memory mapping to maintain a file "pointer" which is used to seek() and read() from the file.
self.f = open("myFile", "rb")
self._mmf = mmap.mmap(self.f.fileno(), length=0, access=mmap.ACCESS_READ)
Since I know which bytes correspond to which info in the event, I was thinking of implementing a function to receive what type of information is needed from the event, seek() to that position and read() only the relevant bytes, then reposition the file pointer at the beginning of the event for eventual following calls (which might or might not be needed).
My question is, is this implementation necessarily expected to be slower wrt reading the entire event once (and eventually use this info only partially) or does it depend on the event size compared to how many calls of seek() I might have in the workflow?
Thanks!
This might not even be an issue but I've got a couple of related Python questions that will hopefully help clear up a bit of debugging I've been stuck on for the past week or two.
If you call a function that returns a large object, is there a way to ignore the return value in order to save memory?
My best example is, let's say you are piping a large text file, line-by-line, to another server and when you complete the pipe, the function returns a confirmation for every successful line. If you pipe too many lines, the returned list of confirmations could potentially overrun your available memory.
for line in lines:
connection.put(line)
response = connection.execute()
If you remove the response variable, I believe the return value is still loaded into memory so is there a way to ignore/block the return value when you don't really care about the response?
More background: I'm using the redis-python package to pipeline a large number of set-additions. My processes occasionally die with out-of-memory issues even though the file itself is not THAT big and I'm not entirely sure why. This is just my latest hypothesis.
I think the confirmation response is not big enough to overrun your memory. In python, when you read lines from file , the lines is always in memory which cost large memory resource.
I am trying to teach myself Python by reading documentation. I am trying to understand what it means to flush a file buffer. According to documentation, "file.flush" does the following.
Flush the internal buffer, like stdio‘s fflush().
This may be a no-op on some file-like objects.
I don't know what "internal buffer" and "no-op" mean, but I think it says that flush writes data from some buffer to a file.
Hence, I ran this file toggling the pound sign in the line in the middle.
with open("myFile.txt", "w+") as file:
file.write("foo")
file.write("bar")
# file.flush()
file.write("baz")
file.write("quux")
However, I seem to get the same myFile.txt with and without the call to file.flush(). What effect does file.flush() have?
Python buffers writes to files. That is, file.write returns before the data is actually written to your hard drive. The main motivation of this is that a few large writes are much faster than many tiny writes, so by saving up the output of file.write until a bit has accumulated, Python can maintain good writing speeds.
file.flush forces the data to be written out at that moment. This is hand when you know that it might be a while before you have more data to write out, but you want other processes to be able to view the data you've already written. Imagine a log file that grows slowly. You don't want to have to wait ages before enough entries have built up to cause the data to be written out in one big chunk.
In either case, file.close causes the remaining data to be flushed, so "quux" in your code will be written out as soon as file (which is a really bad name as it shadows the builtin file constructor) falls out of scope of the with context manager and gets closed.
Note: your OS does some buffering of its own, but I believe every OS where Python is implemented will honor file.flush's request to write data out to the drive. Someone please correct me if I'm wrong.
By the way, "no-op" means "no operation", as in it won't actually do anything. For example, StringIO objects manipulate strings in memory, not files on your hard drive. StringIO.flush probably just immediately returns because there's not really anything for it to do.
Buffer content might be cached to improve performance. Flush makes sure that the content is written to disk completely, avoiding data loss. It is also useful when, for example, you want the line asking for user input printed completely on-screen before the next file operation takes place.
I have to copy and do some simple processing on file. I can not read whole file to the memory because it is to big. I come up with piece of code which looks like this:
buffer = inFile.read(buffer_size)
while len(buffer) > 0:
outFile.write(buffer)
simpleCalculations(buffer)
buffer = inFile.read(buffer_size)
simpleCalculations procedure is irrelevant in this context but I am worried about subsequent memory allocations of buffer list. On some hardware configuration memory usage gets very high and that apparently kills the machine. I would like to reuse buffer. Is this posible in python 2.6?
I don't think there's any easy way around this. The file.read() method just returns a new string each time you call it. On the other hand, you don't really need to worry about running out of memory -- once you assign buffer to the newly-read string, the previously-read string no longer has any references to it, so its memory gets freed automatically (see here for more details).
Python being a strictly reference-counted environment, your buffer will be deallocated as soon as you no longer have any references to it.
If you're worried about physical RAM but have spare address space, you could mmap your file rather than reading it in a bit at a time.
I have a file open for writing, and a process running for days -- something is written into the file in relatively random moments. My understanding is -- until I do file.close() -- there is a chance nothing is really saved to disk. Is that true?
What if the system crashes when the main process is not finished yet? Is there a way to do kind of commit once every... say -- 10 minutes (and I call this commit myself -- no need to run timer)? Is file.close() and open(file,'a') the only way, or there are better alternatives?
You should be able to use file.flush() to do this.
If you don't want to kill the current process to add f.flush() (it sounds like it's been running for days already?), you should be OK. If you see the file you are writing to getting bigger, you will not lose that data...
From Python docs:
write(str)
Write a string to the file. There is no return value. Due to buffering,
the string may not actually show up in
the file until the flush() or close()
method is called.
It sounds like Python's buffering system will automatically flush file objects, but it is not guaranteed when that happens.
To make sure that you're data is written to disk, use file.flush() followed by os.fsync(file.fileno()).
As has already been stated use the .flush() method to force the write out of the buffer, but avoid using a lot of calls to flush as this can actually slow your writing down (if the application relies on fast writes) as you'll be forcing your filesystem to write changes that are smaller than it's buffer size which can bring you to your knees. :)