codecs.open(utf-8) fails to read plain ASCII file

codecs.open(utf-8) fails to read plain ASCII file - python

I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8"), I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs open such a file in UTF-8 mode?
# test.py
import codecs
f = codecs.open("test.py", "r", "utf-8")
# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm
assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails
# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
# File "test.py", line 15, in <module>
# assert len(c) == 1 # fails
# AssertionError
# max%
system:
Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Of course it works with regular open. It also works if I remove the "utf-8" option. Also what does 63 mean? That's like the middle of the 3rd line. I don't get it.

Found your problem:
When passed an encoding, codecs.open returns a StreamReaderWriter, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader and StreamWriter. Problem is:
StreamReaderWriter provides a "normal" read method (that is, it takes a size parameter and that's it)
It delegates to the internal StreamReader.read method, where the size argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars, is a strict limiter, but StreamReaderWriter never passes that argument along (it doesn't accept it)
When size hinted, but not capped using chars, if StreamReader has buffered data, and it's large enough to match the size hint StreamReader.read blindly returns the contents of the buffer, rather than limiting it in any way based on the size hint (after all, only chars imposes a maximum return size)
The API of StreamReader.read and the meaning of size/chars for the API is the only documented thing here; the fact that codecs.open returns StreamReaderWriter is not contractual, nor is the fact that StreamReaderWriter wraps StreamReader, I just used ipython's ?? magic to read the source code of the codecs module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter, it's all Python level, so it's easy).
The best solution is to switch to io.open, which is faster and more correct in every standard case (codecs.open supports the weirdo codecs that don't convert between bytes [Py2 str] and str [Py2 unicode], but rather, handle str to str or bytes to bytes encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes and str). All you need to do is import io instead of codecs, and change the codecs.open line to:
f = io.open("test.py", encoding="utf-8")
The rest of your code can remain unchanged (and will likely run faster to boot).
As an alternative, you could explicitly bypass StreamReaderWriter to get the StreamReader's read method and pass the limiting argument directly, e.g. change:
c = f.read(1)
to:
# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1) # 6 is sort of arbitrary; should ensure a full char read in one go
I suspect Python Bug #8260, which covers intermingling readline and read on codecs.open created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read and readline will be able to break it.
Again, just use io.open; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.

Related

open(..., encoding="") vs str.encode(encoding="")

Question:
What is the difference between open(<name>, "w", encoding=<encoding>) and open(<name>, "wb") + str.encode(<encoding>)? They seem to (sometimes) produce different outputs.
Context:
While using PyFPDF (version 1.7.2), I subclassed the FPDF class, and, among other things, added my own output method (taking pathlib.Path objects). While looking at the source of the original FPDF.output() method, I noticed almost all of it is argument parsing - the only relevant bits are
#Finish document if necessary
if(self.state < 3):
self.close()
[...]
f=open(name,'wb')
if(not f):
self.error('Unable to create output file: '+name)
if PY3K:
# manage binary data as latin1 until PEP461 or similar is implemented
f.write(self.buffer.encode("latin1"))
else:
f.write(self.buffer)
f.close()
Seeing that, my own Implementation looked like this:
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
file.write_text(self.buffer, "latin1", "strict")
This seemed to work - a .pdf file was created at the specified path, and chrome opened it. But it was completely blank, even tho I added Images and Text. After hours of experimenting, I finally found a Version that worked (produced a non empty pdf file):
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
# using .write_text(self.buffer, "latin1", "strict") DOES NOT WORK AND I DON'T KNOW WHY
file.write_bytes(self.buffer.encode("latin1", "strict"))
Looking at the pathlib.Path source, it uses io.open for Path.write_text(). As all of this is Python 3.8, io.open and the buildin open() are the same.
Note:
FPDF.buffer is of type str, but holds binary data (a pdf file). Probably because the Library was originally written for Python 2.

Both should be the same (with minor differences).
I like open way, because it is explicit and shorter, OTOH if you want to handle encoding errors (e.g. a way better error to user), one should use decode/encode (maybe after a '\n'.split(s), and keeping line numbers)
Note: if you use the first method (open), you should just use r or w, so without b. For your question's title, it seems you did correct, but check that your example keep b, and probably for this, it used encoding. OTOH the code seems old, and I think the ".encoding" was just done because it would be more natural in Python2 mindset.
Note: I would also replace strict to backslashreplace for debugging. And possibly you may want to check and print (maybe just ord) of the first few characters of the self.buffer on both methods, to see if there are substantial differences before file.write.
I would add a file.flush() on both functions. This is one of the differences: buffering is different, and I'll make sure I close the file. Python will do it, but when debugging, it is important to see the content of the file as quick as possible (and also after an exception). Garbage collector could not guarantee all of this. Maybe you are reading a text file which was not yet flushed.

Aaaand found it: Path.write_bytes() will save the bytes object as is, and str.encoding doesn't touch the line endings.
Path.write_text() will encode the bytes object just like str.encode(), BUT: because the file is opened in text mode, the line endings will be normalized after encoding - in my case converting \n to \r\n because I'm on Windows. And pdfs have to use \n, no matter what platform your on.

os.read() gives OSError: [Errno 22] Invalid argument when reading large data

I use the following method to read binary data from any given offset in the binary file. The binary file I have is huge 10GB, so I usually read portion of it when needed by specifying from which offset I should start_read and how many bytes to read num_to_read. I use Python 3.6.4 :: Anaconda, Inc., platform Darwin-17.6.0-x86_64-i386-64bit and os module:
def read_from_disk(path, start_read, num_to_read, dim):
fd = os.open(path, os.O_RDONLY)
os.lseek(fd, start_read, 0) # Where to (start_read) from the beginning 0
raw_data = os.read(fd, num_to_read) # How many bytes to read
C = np.frombuffer(raw_data, dtype=np.int64).reshape(-1, dim).astype(np.int8)
os.close(fd)
return C
This method works very well when the chunk of data to be read is about less than 2GB. When num_to_read > 2GG, I get this error:
raw_data = os.read(fd, num_to_read) # How many to read (num_to_read)
OSError: [Errno 22] Invalid argument
I am not sure why this issue appears and how to fix it. Any help is highly appreciated.

The os.read function is just a thin wrapper around the platform's read function.
On some platforms, this is an unsigned or signed 32-bit int,1 which means the largest you can read in a single go on these platforms is, respectively, 4GB or 2GB.
So, if you want to read more than that, and you want to be cross-platform, you have to write code to handle this, and to buffer up multiple reads.
This may be a bit of a pain, but you are intentionally using the lowest-level directly-mapping-to-the-OS-APIs function here. If you don't like that:
Use io module objects (Python 3.x) or file objects (2.7) that you get back from open instead.
Just let NumPy read the files—which will have the added advantage that NumPy is smart enough to not try to read the whole thing into memory at once in the first place.
Or, for files this large, you may want to go lower level and use mmap (assuming you're on a 64-bit platform).
The right thing to do here is almost certainly a combination of the first two. In Python 3, it would look like this:
with open(path, 'rb', buffering=0) as f:
f.seek(start_read)
count = num_to_read // 8 # how many int64s to read
return np.fromfile(f, dtype=np.int64, count=count).reshape(-1, dim).astype(np.int8)
1. For Windows, the POSIX-emulation library's _read function uses int for the count argument, which is signed 32-bit. For every other modern platform, see POSIX read, and then look up the definitions of size_t, ssize_t, and off_t, on your platform. Notice that many POSIX platforms have separate 64-bit types, and corresponding functions, instead of changing the meaning of the existing types to 64-bit. Python will use the standard types, not the special 64-bit types.

python 2 to 3 migration error "readinto" method

I converted a huge file which I wrote it at python 2.7.3 and then now I wanted to upgrade to python 3+ (i have 3.5).
what I have done so far:
installed the python interpreter 3.5+
updated the environment path to read from python3+ folder
upgraded the numpy, pandas,
I used >python 2to3.py -w viterbi.py to convert to version 3+
the section that I have error
import sys
import numpy as np
import pandas as pd
# Counting number of lines in the text file
lines = 0
buffer = bytearray(2048)
with open(inputFilePatheName) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
My error is:
AttributeError: '_io.TextIOWrapper' object has no attribute 'readinto'
This is the first error and I cannot proceed to see if there is any other error. I dont know what is the equivalent command for readinto

In 3.x, the readinto method is only available on binary I/O streams. Thus: with open(inputFilePatheName, 'rb') as f:.
Separately, buffer.count('\n') will not work any more, because Python 3.x handles text properly, as something distinct from a raw sequence of bytes. buffer, being a bytearray, stores bytes; it still has a .count method, but it has to be given either an integer (representing the numeric value of a byte to look for) or a "bytes-like object" (representing a subsequence of bytes to look for). So we also have to update that, as buffer.count(b'\n') (using a bytes literal).
Finally, we need to be aware that processing the file this way means we don't get universal newline translation by default any more.

Open the file as binary.
As long as you can guarantee it's utf-8 or CP encoded, all \ns will necessarily be newlines:
with open(inputFilePatheName, "rb") as f:
while f.readinto(buffer) > 0:
lines += buffer.count(b'\n')
That way you also save the time of decoding the file, and use your buffer in the most efficient way possible.

A better approach to what you're trying to achieve is using memory mapped files.
In case of Windows:
file_handle = os.open(r"yourpath", os.O_RDONLY|os.O_BINARY|os.O_SEQUENTIAL)
try:
with mmap.mmap(file_handle, 0, access=mmap.ACCESS_READ) as f:
pos = -1
total = 0
while (pos := f.find(b"\n", pos+1)) != -1:
total +=1
finally:
os.close(file_handle)
Again, make sure you are not encoding the text as UTF-16 which is the default for Windows.

pyserial.readline() with python 2.7

I am using python 2.7.2 with pyserial 2.6.
What is the best way to use pyserial.readline() when talking to a device that has a character other than "\n" for eol? The pyserial doc points out that pyserial.readline() no longer takes an 'eol=' argument in python 2.6+, but recommends using io.TextIOWrapper as follows:
ser = serial.serial_for_url('loop://', timeout=1)
sio = io.TextIOWrapper(io.BufferedRWPair(ser, ser))
However the python io.BufferedRWPair doc specifically warns against that approach, saying "BufferedRWPair does not attempt to synchronize accesses to its underlying raw streams. You should not pass it the same object as reader and writer; use BufferedRandom instead."
Could someone point to a working example of pyserial.readline() working with an eol other than 'eol'?
Thanks,
Tom

read() has a user-settable maximum size to the data it reads(in bits), if your data strings are a predictable length you could simply set that to capture a fixed-length string. it's sort of 'kentucky windage' in execution but so long as your data strings are consistent in size it won't bork.
beyond that, your real option is to capture and write the data stream to another file and split out your entries manually/programatically.
for example, you could write your datastream to a .csv file, and adjust the delimiter variable to be your EoL character.

Assume s is an open serial.serial object and the newline character is \r.
Then this will read to end of line and return everything up to '\r' as a string.
def read_value(encoded_command):
s.write(encoded_command)
temp = ''
response = ''
while '\r' not in response:
response = s.read().decode()
temp = temp + response
return temp
And, BTW, I implemented the io.TextIOWrapper recommendation you talked about above and it 1)is much slower and 2)somehow makes the port close.

zlib decompression in python

Okay so I have some data streams compressed by python's (2.6) zlib.compress() function. When I try to decompress them, some of them won't decompress (zlib error -5, which seems to be a "buffer error", no idea what to make of that). At first, I thought I was done, but I realized that all the ones I couldn't decompress started with 0x78DA (the working ones were 0x789C), and I looked around and it seems to be a different kind of zlib compression -- the magic number changes depending on the compression used. What can I use to decompress the files? Am I hosed?

According to RFC 1950 , the difference between the "OK" 0x789C and the "bad" 0x78DA is in the FLEVEL bit-field:
FLEVEL (Compression level)
These flags are available for use by specific compression
methods. The "deflate" method (CM = 8) sets these flags as
follows:
0 - compressor used fastest algorithm
1 - compressor used fast algorithm
2 - compressor used default algorithm
3 - compressor used maximum compression, slowest algorithm
The information in FLEVEL is not needed for decompression; it
is there to indicate if recompression might be worthwhile.
"OK" uses 2, "bad" uses 3. So that difference in itself is not a problem.
To get any further, you might consider supplying the following information for each of compressing and (attempted) decompressing: what platform, what version of Python, what version of the zlib library, what was the actual code used to call the zlib module. Also supply the full traceback and error message from the failing decompression attempts. Have you tried to decompress the failing files with any other zlib-reading software? With what results? Please clarify what you have to work with: Does "Am I hosed?" mean that you don't have access to the original data? How did it get from a stream to a file? What guarantee do you have that the data was not mangled in transmission?
UPDATE Some observations based on partial clarifications published in your self-answer:
You are using Windows. Windows distinguishes between binary mode and text mode when reading and writing files. When reading in text mode, Python 2.x changes '\r\n' to '\n', and changes '\n' to '\r\n' when writing. This is not a good idea when dealing with non-text data. Worse, when reading in text mode, '\x1a' aka Ctrl-Z is treated as end-of-file.
To compress a file:
# imports and other superstructure left as a exercise
str_object1 = open('my_log_file', 'rb').read()
str_object2 = zlib.compress(str_object1, 9)
f = open('compressed_file', 'wb')
f.write(str_object2)
f.close()
To decompress a file:
str_object1 = open('compressed_file', 'rb').read()
str_object2 = zlib.decompress(str_object1)
f = open('my_recovered_log_file', 'wb')
f.write(str_object2)
f.close()
Aside: Better to use the gzip module which saves you having to think about nasssties like text mode, at the cost of a few bytes for the extra header info.
If you have been using 'rb' and 'wb' in your compression code but not in your decompression code [unlikely?], you are not hosed, you just need to flesh out the above decompression code and go for it.
Note carefully the use of "may", "should", etc in the following untested ideas.
If you have not been using 'rb' and 'wb' in your compression code, the probability that you have hosed yourself is rather high.
If there were any instances of '\x1a' in your original file, any data after the first such is lost -- but in that case it shouldn't fail on decompression (IOW this scenario doesn't match your symptoms).
If a Ctrl-Z was generated by zlib itself, this should cause an early EOF upon attempted decompression, which should of course cause an exception. In this case you may be able to gingerly reverse the process by reading the compressed file in binary mode and then substitute '\r\n' with '\n' [i.e. simulate text mode without the Ctrl-Z -> EOF gimmick]. Decompress the result. Edit Write the result out in TEXT mode. End edit
UPDATE 2 I can reproduce your symptoms -- with ANY level 1 to 9 -- with the following script:
import zlib, sys
fn = sys.argv[1]
level = int(sys.argv[2])
s1 = open(fn).read() # TEXT mode
s2 = zlib.compress(s1, level)
f = open(fn + '-ct', 'w') # TEXT mode
f.write(s2)
f.close()
# try to decompress in text mode
s1 = open(fn + '-ct').read() # TEXT mode
s2 = zlib.decompress(s1) # error -5
f = open(fn + '-dtt', 'w')
f.write(s2)
f.close()
Note: you will need a use a reasonably large text file (I used an 80kb source file) to ensure that the decompression result will contain a '\x1a'.
I can recover with this script:
import zlib, sys
fn = sys.argv[1]
# (1) reverse the text-mode write
# can't use text-mode read as it will stop at Ctrl-Z
s1 = open(fn, 'rb').read() # BINARY mode
s1 = s1.replace('\r\n', '\n')
# (2) reverse the compression
s2 = zlib.decompress(s1)
# (3) reverse the text mode read
f = open(fn + '-fixed', 'w') # TEXT mode
f.write(s2)
f.close()
NOTE: If there is a '\x1a' aka Ctrl-Z byte in the original file, and the file is read in text mode, that byte and all following bytes will NOT be included in the compressed file, and thus can NOT be recovered. For a text file (e.g. source code), this is no loss at all. For a binary file, you are most likely hosed.
Update 3 [following late revelation that there's an encryption/decryption layer involved in the problem]:
The "Error -5" message indicates that the data that you are trying to decompress has been mangled since it was compressed. If it's not caused by using text mode on the files, suspicion obviously(?) falls on your decryption and encryption wrappers. If you want help, you need to divulge the source of those wrappers. In fact what you should try to do is (like I did) put together a small script that reproduces the problem on more than one input file. Secondly (like I did) see whether you can reverse the process under what conditions. If you want help with the second stage, you need to divulge the problem-reproduction script.

I was looking for
python -c 'import sys,zlib;sys.stdout.write(zlib.decompress(sys.stdin.read()))'
wrote it myself; based on answers of zlib decompression in python

Okay sorry I wasn't clear enough. This is win32, python 2.6.2. I'm afraid I can't find the zlib file, but its whatever is included in the win32 binary release. And I don't have access to the original data -- I've been compressing my log files, and I'd like to get them back. As far as other software, I've naievely tried 7zip, but of course it failed, because it's zlib, not gzip (I couldn't any software to decompress zlib streams directly). I can't give a carbon copy of the traceback now, but it was (traced back to zlib.decompress(data)) zlib.error: Error: -3. Also, to be clear, these are static files, not streams as I made it sound earlier (so no transmission errors). And I'm afraid again I don't have the code, but I know I used zlib.compress(data, 9) (i.e. at the highest compression level -- although, interestingly it seems that not all the zlib output is 78da as you might expect since I put it on the highest level) and just zlib.decompress().

Ok sorry about my last post, I didn't have everything. And I can't edit my post because I didn't use OpenID. Anyways, here's some data:
1) Decompression traceback:
Traceback (most recent call last):
File "<my file>", line 5, in <module>
zlib.decompress(data)
zlib.error: Error -5 while decompressing data
2) Compression code:
#here you can assume the data is the data to be compressed/stored
data = encrypt(zlib.compress(data,9)) #a short wrapper around PyCrypto AES encryption
f = open("somefile", 'wb')
f.write(data)
f.close()
3) Decompression code:
f = open("somefile", 'rb')
data = f.read()
f.close()
zlib.decompress(decrypt(data)) #this yeilds the error in (1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

codecs.open(utf-8) fails to read plain ASCII file - python

Related

open(..., encoding="") vs str.encode(encoding="")

os.read() gives OSError: [Errno 22] Invalid argument when reading large data

python 2 to 3 migration error "readinto" method

pyserial.readline() with python 2.7

zlib decompression in python

Categories

Resources