I'm trying to call a process on a file after part of it has been read. For example:
with open('in.txt', 'r') as a, open('out.txt', 'w') as b:
header = a.readline()
subprocess.call(['sort'], stdin=a, stdout=b)
This works fine if I don't read anything from a before doing the subprocess.call, but if I read anything from it, the subprocess doesn't see anything. This is using python 2.7.3. I can't find anything in the documentation that explains this behaviour, and a (very) brief glance at the subprocess source didn't reveal a cause.
If you open the file unbuffered then it works:
import subprocess
with open('in.txt', 'rb', 0) as a, open('out.txt', 'w') as b:
header = a.readline()
rc = subprocess.call(['sort'], stdin=a, stdout=b)
subprocess module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work with os.pipe(), socket.socket(), pty.openpty(), anything with a valid .fileno() method if OS supports it.
It is not recommended to mix the buffered and unbuffered I/O on the same file.
On Python 2, file.flush() causes the output to appear e.g.:
import subprocess
# 2nd
with open(__file__) as file:
header = file.readline()
file.seek(file.tell()) # synchronize (for io.open and Python 3)
file.flush() # synchronize (for C stdio-based file on Python 2)
rc = subprocess.call(['cat'], stdin=file)
The issue can be reproduced without subprocess module with os.read():
#!/usr/bin/env python
# 2nd
import os
with open(__file__) as file: #XXX fully buffered text file EATS INPUT
file.readline() # ignore header line
os.write(1, os.read(file.fileno(), 1<<20))
If the buffer size is small then the rest of the file is printed:
#!/usr/bin/env python
# 2nd
import os
bufsize = 2 #XXX MAY EAT INPUT
with open(__file__, 'rb', bufsize) as file:
file.readline() # ignore header line
os.write(2, os.read(file.fileno(), 1<<20))
It eats more input if the first line size is not evenly divisible by bufsize.
The default bufsize and bufsize=1 (line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.
file.tell() reports for all buffer sizes the position at the beginning of the 2nd line. Using next(file) instead of file.readline() leads to file.tell() around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open() gives the expected 2nd line position).
Trying file.seek(file.tell()) before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works with open() functions from io, _pyio modules on Python 2 and with the default open (also io-based) on Python 3.
Trying io, _pyio modules on Python 2 and Python 3 with and without file.flush() produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.
It happens because the subprocess module extracts the File handle from the File Object.
http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/subprocess.py
In line 1126, coming from 701.
The File Object uses buffers and has already read a lot from the file handle when the subprocess extracts it.
As mentioned by #jfs
When using popen it passes the file descriptor to the process,
At the same time python reads in chunks (e.g. 4096 bytes),
The result is that the position at the fd level is different than what you would expect.
I solved it in python 2.7 by aligning the file descriptor position.
_file = open(some_path)
_file.read(codecs.BOM_UTF8)
os.lseek(_file.fileno(), _file.tell(), os.SEEK_SET)
truncate_null_cmd = ['tr','-d', '\\000']
subprocess.Popen(truncate_null_cmd, stdin=_file, stdout=subprocess.PIPE)
Related
I am sniffing network packets using Tshark (Command-Line wireshark) and writing them to a file just as I receive them. My code block is similar to following:
documents = PriorityQueue(maxsize=0)
writing_enabled = True
with open("output.txt", 'w') as opened_file:
while writing_enabled:
try:
data = documents.get(timeout=1)
except Exception as e:
#No document pushed by producer thread
continue
opened_file.write(json.dumps(data) + "\n")
If I receive files from Tshark thread, I put them into queue then another thread writes it to a file using the code above. However, after file reaches 600+ MB process slows down and then change status into Not Responding. After a research I think that this is because of default buffering mechanism of open file method. Is it reasonable to change the with open("output.txt", 'w') as opened_file:
into with open("output.txt", 'w', 1000) as opened_file: to use 1000 byte of buffer in writing mode? Or there is another way to overcome this?
For writing the internal buffer to the file you can use the files flush function. However, this should generally be handled by your operating system which has a default buffer size. You can use something like this to open your file if you want to specify your own buffer size:
f = open('file.txt', 'w', buffering=bufsize)
Please also see the following question: How often does Python flush to file
Alternatively to flushing the buffer you could also try to use rolling files, i.e. open a new file if the size of your currently opened file exceeds a certain size. This is generally good practice if you intend to write a lot of data.
How can I flush the content written to a file opened as a numeric file handle?
For illustration, one can do the following in Python:
f = open(fn, 'w')
f.write('Something')
f.flush()
On the contrary, I am missing a method when doing the following:
import os
fd = os.open(fn)
os.pwrite(fd, buffer, offset)
# How do I flush fd here?
Use os.fsync(fd). See docs for fsync.
Be careful if you do fsync on a file descriptor obtained from a python file object. In that case you need to flush the python file object first.
can someone tell me why in python 3.4.2 when I try
import codecs
f = codecs.open('/home/filename', 'w', 'utf-8')
print ('something', file = f)
it gives me an empty file?
Previously it was working well, but only suddenly it stopped printing to file
File writing is buffered to avoid hitting the performance drain that is writing to a disk. Flushing the buffer takes place when you reach a threshold, flush explicitly or close the file.
You have not closed the file, did not flush the buffer, and haven't written enough to the file to auto-flush the buffer.
Do one of the following:
Flush the buffer:
f.flush()
This can be done with the flush argument to print() as well:
print('something', file=f, flush=True)
but the argument requires Python 3.3 or newer.
Close the file:
f.close()
or use the file as a context manager (using the with stamement):
with open('/home/filename', 'w', encoding='utf-8') as f:
print('something', file=f)
and the file will be closed automatically when the block is exited (on completion, or an exception).
Write more data to the file; how much depends on the buffering configuration.
I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:
import gzip
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
f_out.writelines(f_in)
# or the following:
# for line in f_in:
# f_out.write(line)
The traceback I got is:
Traceback (most recent call last):
File "test.py", line 8, in <module>
f_out.writelines(f_in)
MemoryError
I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?
The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:
As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.
That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)
If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.
The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.
import gzip
import shutil
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.
* If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.
That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.
But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.
I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.
#! /usr/bin/env python
import gzip
import sys
blocksize = 1 << 16 #64kB
def gzipfile(iname, oname, level):
with open(iname, 'rb') as f_in:
f_out = gzip.open(oname, 'wb', level)
while True:
block = f_in.read(blocksize)
if block == '':
break
f_out.write(block)
f_out.close()
return
def main():
if len(sys.argv) < 3:
print "gzip compress in_file to out_file"
print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
exit(1)
iname = sys.argv[1]
oname = sys.argv[2]
level = int(sys.argv[3]) if len(sys.argv) > 3 else 6
gzipfile(iname, oname, level)
if __name__ == '__main__':
main()
I'm running Python 2.6.6 and gzip.open() doesn't support with.
As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.
It is weird to get a memory error even when reading a file line by line. I suppose it is because you have very little available memory and very large lines. You should then use binary reads :
import gzip
#adapt size value : small values will take more time, high value could cause memory errors
size = 8096
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
while True:
data = f_in.read(size)
if data == '' : break
f_out.write(data)
I'm learning Python, and have run into a bit of a problem. On my OSX install of Python 3.1, this happens in the console:
>>> filename = "test"
>>> reader = open(filename, 'r')
>>> writer = open(filename, 'w')
>>> reader.read()
''
>>> writer.write("hello world\n")
12
>>> reader.read()
''
And calling more test in BASH confirms that there is nothing in test. What's going on?
Thanks.
There are two potential reasons why you are seeing this behaviour.
When you open a file for writing (with the "w" open mode in Python), the OS removes the original file and creates a totally new one. So by opening the file for reading first and then writing, the original reading handle refers to a file that no longer has a name (the file still exists until you close it). At that point you're reading from a different file than you're writing to.
After you swap the order of opening so you open for writing and then reading, you won't necessarily be able to read the data from the file until you flush it:
>>> writer.flush()
>>> reader.read()
'hello world\n'
Flushing the file writes any data that might be in Python's file buffers to the OS, so that when you read from the file from the other handle, the OS will return the data. Note that Python itself doesn't know these two handles refer to the same file, but the OS does.
You're probably trashing your file. It's not usually a good idea to open a file for reading and writing at the same time.
Buffering. If you really want to read and write to the same file open one handle using "w+".
And with the buttering, you will need to force the buffer to be emptied before reading. Closing the file is a good way to do this.