Python 3 itertools.islice continue despite UnicodeDecodeError - python

I have a python 3 program that monitors a log file. The log includes, among other things, chat messages written by users. The log is created by a third party application which I cannot change.
Today a user wrote "텋��텋��" and it caused the program to crash with the following error:
future: <Task finished coro=<updateConsoleLog() done, defined at /usr/local/src/bserver/logmonitor.py:48> exception=UnicodeDecodeError('utf-8',...
say "\xed\xa0\xbd\xed\xb1\x8c"\r\n', 7623, 7624, 'invalid continuation byte')>
Traceback (most recent call last):
File "/usr/lib/python3.4/asyncio/tasks.py", line 238, in _step
result = next(coro)
File "/usr/local/src/bserver/logmonitor.py", line 50, in updateConsoleLog
server_events = self.console.getUpdate()
File "/usr/local/src/bserver/console.py", line 79, in getUpdate
return self.read()
File "/usr/local/src/bserver/console.py", line 90, in read
for line in itertools.islice(log_file, log_no, None):
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7623: invalid continuation byte
ERROR:asyncio:Task exception was never retrieved
Using 'file -i log.file' I determined that the log file is us-ascii. This shouldn't be and issue as ascii is a subset of utf-8 (as far as I know).
Since this happens rarely and I don't mind losing whatever it is that this user typed, is it possible for me to ignore this line or the particular characters that can't be decoded and just keep on reading the rest of the file?
I considered using try: ... except UnicodeDecodeError as ..., but this would mean I can't read anything in the log file after the error.
Code
def read(self):
log_no = self.last_log_no
log_file = open(self.path, 'r')
server_events = []
starting_log_no = log_no
for line in itertools.islice(log_file, log_no, None): //ERROR
server_events.append(line)
print(line.replace('\n', '').replace('\r', ''))
log_no += 1
self.last_log_no = log_no
if (starting_log_no < log_no):
return server_events
return False
Any help or advise would be appreciated!

The byte string \xed\xa0\xbd\xed\xb1\x8c is not utf-8 valid. Neither is it us-ascii, since us-ascii can only be 7-bits long; i.e. \x8c is greater than 127.
Instead of ignoring the UnicodeDecodeError, try opening the file with an encoding that supports all 8-bits of a byte (e.g. latin-1):
log_file = open(self.path, 'r' encoding='latin-1')

Related

A strange python unicode decode error trying to do redis dump (prefer python3, but seeing in both 2 & 3))

I am trying to do a redis dump using the python package redis-dump-load.
It is in UTF-8, except for apparently one key, which I'm being told is in ascii. No idea why, but I thought, ok, if there is a UnicodeDecodeError for this one key (I am receiving lots of data from the dump stream up to this point), what I will do is get the encoding of the bytes string I have, and decode with that instead.
However, I am still getting a UnicodeDecodeError for ASCII as well! It is bytes, it is ascii, I think it might just be corrupt and I plan to just skip it, but curious if anyone had any other ideas.
This is my code snippet:
value = {}
for k in response:
try:
value[k.decode(encoding)] = response[k].decode(encoding)
except UnicodeDecodeError:
print("Error for", k)
print(type(k))
orig_encoding = chardet.detect(k)['encoding']
print(orig_encoding)
value[k.decode(orig_encoding)] = response[k].decode(orig_encoding)
return value
And here is the ouput I see:
Error for b'meta'
<class 'bytes'>
ascii
Traceback (most recent call last):
File "redisdl.py", line 243, in handle_response
value[k.decode(encoding)] = response[k].decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "redisdl.py", line 635, in <module>
main()
File "redisdl.py", line 626, in main
do_dump(options)
File "redisdl.py", line 547, in do_dump
dump(output, **kwargs)
File "redisdl.py", line 174, in dump
for key, type, ttl, value in _reader(r, pretty, encoding, keys):
File "redisdl.py", line 293, in _reader
type, ttl, value = _read_key(encoded_key, r, pretty, encoding)
File "redisdl.py", line 284, in _read_key
value = reader.handle_response(results[2], pretty, encoding)
File "redisdl.py", line 250, in handle_response
response[k].decode(orig_encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
Am I missing something? I have looked at multiple SO answers, believe me, and I thought up until now I understood str, bytes, decoding, and encoding. But maybe not. Even py2 & py3 differences I get. I think. Starting to doubt everything...

UnicodeDecodeError in Python3 when reading binary files stdin

I'm trying to take what's in the stdin and read it 1022 bytes at a time. This code runs fine for text files. But when inputting a binary file it gives me the UnicodeDecodeError. Where data below is sys.stdin.
def sendStdIn(conn, cipher, data):
while True:
chunk = data.read(1022)
if len(chunk)==1022:
EOFAndChunk = b'F' + chunk.encode("utf-8")
conn.send(encryptAndPad(cipher,EOFAndChunk))
else:
EOFAndChunk = b'T' + chunk.encode("utf-8")
conn.send(encryptAndPad(cipher,EOFAndChunk))
break
return True
The binary file was made by calling dd if=/dev/urandom bs=1K iflag=fullblock count=1K > 1MB.bin
I run the file with essentially python A3C.py < 1MB.bin
Then I end up with below.
Traceback (most recent call last):
File "A3C.py", line 163, in <module>
main()
File "A3C.py", line 121, in main
EasyCrypto.sendStdIn(soc, cipher, sys.stdin)
File "EasyCrypto.py", line 63, in sendStdIn
chunk = data.read(1022)
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
Any idea how I can make this so it can read sections of the binary file because I need to send them from this client side to the server side piece at a time. Thanks!
sys.stdin is a text wrapper that decodes binary data. Use sys.stdin.buffer instead:
EasyCrypto.sendStdIn(soc, cipher, sys.stdin.buffer)
The TextIOBase.buffer attribute points to the binary buffered I/O object underneath.

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte
Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.
Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

python: csv to json conversion when csv contains unicode

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.
The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.
Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])
Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!

AllegroGraph - UTF-8 characters in N-Triples

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):
conn.addTriple( ..., ..., '5 × 10**5' )
This doesn't work. I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...
Here's the full traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
self._convert_term_to_mini_term(obj), cxt)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
return self._to_ntriples(term)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
else: return term.toNTriples();
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
sb.append(strings.encode_ntriple_string(self.getLabel()))
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)
Instead I can add the triple like this:
conn.addTriple( ..., ..., u'5 × 10**5' )
That way I don't get an error.
But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
I get this error message if the file is saved as UTF-8 encoding:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a ×.
When the file was given to me, it had × where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like × as UTF-8 characters.
In the file, the triple looks like:
<...some subject URI...> <...some predicate URI...> "5 × 10**5" .
\xd7 is the Latin-1 encoding of ×.
× is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.
When you're given files that show ×, try changing the codec that's used to display them to UTF-8.
For an overview of Unicode in Python see here. ~ Thanks to Daenyth.
As you found out from AllegroGraph support:
AllegroGraph can take unicode characters in nTriples using \uXXXX
notation. Alternatively one can use RDFXML, which allows you to leave the
unicode characters as they are.
use codecs module.
import codecs
f = codecs.open('file.txt','r','utf8')
this will open your file forcing the utf8 encoding

Categories

Resources