Python EOF for multi byte requests of file.read()

Python EOF for multi byte requests of file.read() - python

The Python docs on file.read() state that An empty string is returned when EOF is encountered immediately. The documentation further states:
Note that this method may call the
underlying C function fread() more
than once in an effort to acquire as
close to size bytes as possible. Also
note that when in non-blocking mode,
less data than was requested may be
returned, even if no size parameter
was given.
I believe Guido has made his view on not adding f.eof() PERFECTLY CLEAR so need to use the Python way!
What is not clear to ME, however, is if it is a definitive test that you have reached EOF if you receive less than the requested bytes from a read, but you did receive some.
ie:
with open(filename,'rb') as f:
while True:
s=f.read(size)
l=len(s)
if l==0:
break # it is clear that this is EOF...
if l<size:
break # ? Is receiving less than the request EOF???
Is it a potential error to break if you have received less than the bytes requested in a call to file.read(size)?

You are not thinking with your snake skin on... Python is not C.
First, a review:
st=f.read() reads to EOF, or if opened as a binary, to the last byte;
st=f.read(n) attempts to reads n bytes and in no case more than n bytes;
st=f.readline() reads a line at a time, the line ends with '\n' or EOF;
st=f.readlines() uses readline() to read all the lines in a file and returns a list of the lines.
If a file read method is at EOF, it returns ''. The same type of EOF test is used in the other 'file like" methods like StringIO, socket.makefile, etc. A return of less than n bytes from f.read(n) is most assuredly NOT a dispositive test for EOF! While that code may work 99.99% of the time, it is the times it does not work that would be very frustrating to find. Plus, it is bad Python form. The only use for n in this case is to put an upper limit on the size of the return.
What are some of the reasons the Python file-like methods returns less than n bytes?
EOF is certainly a common reason;
A network socket may timeout on read yet remain open;
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in text mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
The file is in non-blocking mode and another process begins to access the file;
Temporary non-access to the file;
An underlying error condition, potentially temporary, on the file, disc, network, etc.
The program received a signal, but the signal handler ignored it.
I would rewrite your code in this manner:
with open(filename,'rb') as f:
while True:
s=f.read(max_size)
if not s: break
# process the data in s...
Or, write a generator:
def blocks(infile, bufsize=1024):
while True:
try:
data=infile.read(bufsize)
if data:
yield data
else:
break
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
break
f=open('somefile','rb')
for block in blocks(f,2**16):
# process a block that COULD be up to 65,536 bytes long

Here's what my C compiler's documentation says for the fread() function:
size_t fread(
void *buffer,
size_t size,
size_t count,
FILE *stream
);
fread returns the number of full items
actually read, which may be less than
count if an error occurs or if the end
of the file is encountered before
reaching count.
So it looks like getting less than size means either an error has occurred or EOF has been reached -- so breaking out of the loop would be the correct thing to do.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks

You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

python os.read(fd, n) requires parameter n, why?

I need to read a text file with the os module as such:
t = os.open('te.txt', os.O_RDONLY)
r = os.read(t, 20)
rs = r.decode('utf-8')
print(rs)
What if I don't know the byte size of the file. I could put a very large number instead of 20 as a value seems to be required, but perhaps there is a more pythonic way.

The second argument isn't supposed to hold the size of the file in bytes; it's only supposed to hold the maximum amount of content you're prepared to read at a time (which should typically be divisible by both your operating system's block size and page size; 64kb is not a bad default).
The "why" of this is because memory has to be allocated in userspace before the kernel can be instructed to write content into that memory. This isn't the kind of detail that Python developers need to think about often, but you're using a low-level interface built for use from C; it accordingly has implementation details leaking out of that underlying layer.
The operating system is free to give you less than the number of bytes you indicate as a maximum (for example, if it gets interrupted, or the filesystem driver isn't written to provide that much data at a time), so no matter what, you need to be prepared to call it repeatedly; only when it returns an empty string (as opposed to throwing an exception or returning a shorter-than-requested string) are you certain to have reached the end of the file.
os.read() isn't a Pythonic interface, and it isn't supposed to be. It's a thin wrapper around the syscall provided by the operating system kernel. If you want a Pythonic interface, don't use os.read(), but instead use Python's native file objects.

If you wanted to load the whole file and you have to use os, you could use os.stat(filename).st_size or os.path.getsize(filename) to get the size of the file in bytes.
filename = 'te.txt'
t = os.open(filename, os.O_RDONLY)
b = os.stat(filename).st_size
r = os.read(t, b)
rs = r.decode('utf-8')
print(rs)

File seek operation is missing out bytes randomly while writing to file

I am in the process of porting code from python2 to python3. The code below in python2 is working fine for python3, except on the occasions where it is over writing a couple of bytes at the end from the previous line while writing a new packet to the file. This is causing an error rate of 10% while reading out the packets from the file (The error rate was around 2% in python2).
logfile = open(filepath, 'w+')
# Gets the offset to write to the file (EOF)
offset = self.enddict[fname]
# The output message
outmsg = "%ld\n%d\n%s\n" % (now, msg_len, msg)
#Seeks to the given offset and writes the message out
logfile.seek(offset)
logfile.write(outmsg)
I've tried out a couple of solutions to resolve this issue, but haven't got the right one so far:
Add extra new lines to the beginning and end of the output message. This seems to mitigate the issue (reduces the error rate to 2%), but it doesn't seem like a viable solution as we'd need to change various readers that are reading off the file downstream.
outmsg = "\n\n\n\n\n\n\n\n\n\n%ld\n%d\n%s\n\n\n\n\n\n\n\n\n\n" % (now, msg_len, msg)
Use io.SEEK_END. This seems to write the packets correctly to the file and the error rate drops close to 0 %. But, it's messing up with the offsets written to the DB. While reading the chunk from the file by using the offsets in the DB, we're getting corrupted chunk.
logfile.seek(0, io.SEEK_END)
I did some research into using os.lseek and found it to be slower than seek.

The below solution seems to be getting rid of the missing bytes issue during seek operation:
self.logfile.seek(offset)
off_bytes_count = len(self.logfile.read())
if off_bytes_count:
offset += off_bytes_count
self.logfile.seek(offset)
Here, the off_bytes_count is the count of the bytes that’s remaining after the seek operation.

How to read stdin buffer in advance before an EOF in python3?

In my python code I wrote the following function to receive self-defined binary package from stdin.
def recvPkg():
## The first 4 bytes stands for the remaining package length
Len = int.from_bytes(sys.stdin.buffer.read(4), byteorder='big', signed=True)
## Then read the remaining package
data = json.loads(str(sys.stdin.buffer.read(Len), 'utf-8'))
## do something...
while True:
recvPkg()
Then, in another Node.js program I spawn this python program as a child process, and send bytes to it.
childProcess = require('child_process').spawn('./python_code.py');
childProcess.stdin.write(someBinaryPackage)
I expect the child process to read from its stdin buffer once a package is received and give the output. But it doesn't work, and I think the reason is that the child process won't begin to read unless its stdin buffer receive a signal, like an EOF. As a proof, if I close childProcess's stdin after stdin.write, the python code will work and receive all the buffered packages at once. This is not the way I want because I need childProcess's stdin to be open. So is there any other way for node.js to send a signal to childProcess to inform of reading from stdin buffer?
(sorry for poor english.

From Wikipedia (emphasis mine):
Input from a terminal never really "ends" (unless the device is disconnected), but it is useful to enter more than one "file" into a terminal, so a key sequence is reserved to indicate end of input. In UNIX the translation of the keystroke to EOF is performed by the terminal driver, so a program does not need to distinguish terminals from other input files.
There is no way to send an EOF character how you are expecting. EOF isn't really a character that exists. When you're in a terminal, you can press the key sequence ctrlz on Windows, and ctrld on UNIX-like enviroments. These produce control characters for the terminal (code 26 on Windows, code 04 on UNIX) and are read by the terminal. The terminal (upon reading this code) will then essentially stop writing to a programs stdin and close it.
In Python, a file object will .read() forever. The EOF condition is that .read() returns ''. In some other languages, this might be -1, or some other condition.
Consider:
>>> my_file = open("file.txt", "r")
>>> my_file.read()
'This is a test file'
>>> my_file.read()
''
The last character here isn't EOF, there's just nothing there. Python has .read() until the end of the file and can't .read() any more.
Because stdin in a special type of 'file' it doesn't have an end. You have to define that end. The terminal has defined that end as the control characters, but here you are not passing data to stdin via a terminal, you'll have to manage it yourself.
Just closing the file
Input [...] never really "ends" (unless the device is disconnected)
Closing stdin is probably the simplest solution here. stdin is an infinite file, so once you're done writing to it, just close it.
Expect your own control character
Another option is to define your own control character. You can use whatever you want here. The example below uses a NULL byte.
Python
class FileWithEOF:
def __init__(self, file_obj):
self.file = file_obj
self.value = bytes()
def __enter__(self):
return self
def __exit__(self, *args, **kwargs):
pass
def read(self):
while True:
val = self.file.buffer.read(1)
if val == b"\x00":
break
self.value += val
return self.value
data = FileWithEOF(sys.stdin).read()
Node
childProcess = require('child_process').spawn('./python_code.py');
childProcess.stdin.write("Some text I want to send.");
childProcess.stdin.write(Buffer.from([00]));
You might be reading the wrong length
I think the value you're capturing in Len is less than the length of your file.
Python
import sys
while True:
length = int(sys.stdin.read(2))
with open("test.txt", "a") as f:
f.write(sys.stdin.read(length))
Node
childProcess = require('child_process').spawn('./test.py');
// Python reads the first 2 characters (`.read(2)`)
childProcess.stdin.write("10");
// Python reads 9 characters, but does nothing because it's
// expecting 10. `stdin` is still capable of producing bytes from
// Pythons point of view.
childProcess.stdin.write("123456789");
// Writing the final byte hits 10 characters, and the contents
// are written to `test.txt`.
childProcess.stdin.write("A");

python socket: sending and receiving 16 bytes

See edits below.
I have two programs that communicate through sockets. I'm trying to send a block of data from one to the other. This has been working with some test data, but is failing with others.
s.sendall('%16d' % len(data))
s.sendall(data)
print(len(data))
sends to
size = int(s.recv(16))
recvd = ''
while size > len(recvd):
data = s.recv(1024)
if not data:
break
recvd += data
print(size, len(recvd))
At one end:
s = socket.socket()
s.connect((server_ip, port))
and the other:
c = socket.socket()
c.bind(('', port))
c.listen(1)
s,a = c.accept()
In my latest test, I sent a 7973903 byte block and the receiver reports size as 7973930.
Why is the data block received off by 27 bytes?
Any other issues?
Python 2.7 or 2.5.4 if that matters.
EDIT: Aha - I'm probably reading past the end of the send buffer. If remaining bytes is less than 1024, I should only read the number of remaining bytes. Is there a standard technique for this sort of data transfer? I have the feeling I'm reinventing the wheel.
EDIT2: I'm screwing up by reading the next file in the series. I'm sending file1 and the last block is 997 bytes. Then I send file2, so the recv(1024) at the end of file1 reads the first 27 bytes of file2.
I'll start another question on how to do this better.
Thanks everyone. Asking and reading comments helped me focus.

First, the line
size = int(s.recv(16))
might read less than 16 bytes — it is unlikely, I will grant, but possible depending on how the network buffers align. The recv() call argument is a maximum value, a limit on how much data you are willing to receive. But you might only receive one byte. The operating system will generally give you control back once at least one byte has arrived, maybe (depending on the OS and on how busy the CPU is) after waiting another few milliseconds in case a second packet arrives with some further data, so that it only has to wake you up once instead of twice.
So you would want to say instead (to do the simplest possible loop; other variants are possible):
data = ''
while len(data) < 16:
more = s.recv(16 - len(data))
if not more:
raise EOFError()
data += more
This is indeed a wheel nearly everyone re-invents because it is so often needed. And your own code needs it a second time: your while loop needs its recv() to count down, asking for smaller and smaller limits until finally it has received exactly the number of bytes that were promised, and no more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python EOF for multi byte requests of file.read() - python

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

python os.read(fd, n) requires parameter n, why?

File seek operation is missing out bytes randomly while writing to file

How to read stdin buffer in advance before an EOF in python3?

python socket: sending and receiving 16 bytes

Categories

Resources