UnicodeDecodeError in Python3 when reading binary files stdin - python

I'm trying to take what's in the stdin and read it 1022 bytes at a time. This code runs fine for text files. But when inputting a binary file it gives me the UnicodeDecodeError. Where data below is sys.stdin.
def sendStdIn(conn, cipher, data):
while True:
chunk = data.read(1022)
if len(chunk)==1022:
EOFAndChunk = b'F' + chunk.encode("utf-8")
conn.send(encryptAndPad(cipher,EOFAndChunk))
else:
EOFAndChunk = b'T' + chunk.encode("utf-8")
conn.send(encryptAndPad(cipher,EOFAndChunk))
break
return True
The binary file was made by calling dd if=/dev/urandom bs=1K iflag=fullblock count=1K > 1MB.bin
I run the file with essentially python A3C.py < 1MB.bin
Then I end up with below.
Traceback (most recent call last):
File "A3C.py", line 163, in <module>
main()
File "A3C.py", line 121, in main
EasyCrypto.sendStdIn(soc, cipher, sys.stdin)
File "EasyCrypto.py", line 63, in sendStdIn
chunk = data.read(1022)
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
Any idea how I can make this so it can read sections of the binary file because I need to send them from this client side to the server side piece at a time. Thanks!

sys.stdin is a text wrapper that decodes binary data. Use sys.stdin.buffer instead:
EasyCrypto.sendStdIn(soc, cipher, sys.stdin.buffer)
The TextIOBase.buffer attribute points to the binary buffered I/O object underneath.

Related

How to decode Windows-1254 file to a readable format in Python

I have a file that contains a Windows-1254 encoded data and I want to decode it to a readable format with python but I have no success so far.
I tried this code but I am getting an erorr:
file = open('file_name', 'rb')
content_of_file = file.read()
file.close()
decoded_data = content_of_file.decode('cp1254') # here I am trying to decode that file
print(decoded_data)
This is the error that I get:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 658: character maps to
The way that I now that this file was encoded in Windows-1254 is by using the chardet.detect(data)['encoding'] so keep that in mind.
Does someone knows how can I decode taht data in a way I could understand it?
this is the file content if someone need it.
b'c\x00\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00C\x00\x00\x00s3\x03\x00\x00d\x01\x00\x01\x03\x03\x03\t\x02\x02\t\x03\x03\x03\t\x02\x02\td\x02\x00d\x00\x00l\x00\x00}\x01\x00d\x02\x00d\x00\x00l\x01\x00}\x02\x00t\x02\x00\x83\x00\x00}\x03\x00d\x03\x00}\x00\x00t\x03\x00|\x03\x00\x83\x01\x00d\x04\x00k\x03\x00rZ\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\xd0\x02|\x03\x00d\x06\x00\x19d\x07\x00k\x03\x00ry\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x08\x00\x19\x83\x01\x00d\t\x00Ad\n\x00k\x03\x00r\xa2\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x0b\x00d\x0c\x00!d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19j\x06\x00d\r\x00\x83\x01\x00s\xd4\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19j\x08\x00\x83\x00\x00\x0co\x07\x01|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19d\x0f\x00j\t\x00d\x10\x00\x83\x01\x00k\x02\x00s\x19\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00|\x01\x00j\n\x00d\x11\x00d\x04\x00\x83\x02\x00d\x06\x00\x14d\x11\x00\x17\x19j\x0b\x00d\x12\x00\x83\x01\x00d\x13\x00k\x03\x00rU\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x14\x00\x19\x83\x01\x00t\x05\x00|\x03\x00d\x15\x00\x19\x83\x01\x00Ad\x16\x00k\x03\x00s\x89\x01|\x03\x00d\x14\x00\x19d\x17\x00k\x03\x00r\x98\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x18\x00\x19d\x19\x00j\t\x00d\x1a\x00\x83\x01\x00k\x03\x00r\xc0\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x1b\x00\x19d\x1c\x00j\x0c\x00d\x1c\x00d\x1d\x00\x83\x02\x00k\x03\x00r\xeb\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\r\x00d\x1e\x00d\x0b\x00\x83\x02\x00\x19t\x0e\x00d\x11\x00\x83\x01\x00d\x1f\x00\x17j\t\x00d\x10\x00\x83\x01\x00k\x03\x00r&\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d \x00d!\x00!j\x0c\x00d"\x00d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19d\x1c\x00\x83\x02\x00d\x1c\x00k\x03\x00ra\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d!\x00\x19t\x0f\x00i\x00\x00\x83\x01\x00j\x10\x00d\x0b\x00\x19k\x03\x00r\x8d\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d#\x00\x19t\x0e\x00d\x06\x00\x83\x01\x00k\x03\x00r\xb2\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d$\x00\x19d%\x00k\x07\x00r\xd1\x02d&\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x02\x00j\x01\x00|\x03\x00d'\x00\x19\x83\x01\x00j\x11\x00\x83\x00\x00d(\x00k\x03\x00r\xff\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d)\x00\x19\x83\x01\x00t\x07\x00d*\x00\x83\x01\x00k\x03\x00r*\x03d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00d+\x00GHd\x00\x00S(,\x00\x00\x00Ns\x1a\x00\x00\x00#Import random,md5 modulesi\xff\xff\xff\xffs\x14\x00\x00\x00#Start checking flagi\x12\x00\x00\x00s\x05\x00\x00\x00NO!!!i\x00\x00\x00\x00t\x01\x00\x00\x00Mi\x01\x00\x00\x00i\xff\x00\x00\x00i\xaa\x00\x00\x00i\x02\x00\x00\x00i\x04\x00\x00\x00t\x02\x00\x00\x00HCs\x10\x00\x00\x00([] > {}) << -~1t\x02\x00\x00\x007bt\x03\x00\x00\x00hexi\x05\x00\x00\x00t\x06\x00\x00\x00base64s\x05\x00\x00\x00ZA==\ni\x06\x00\x00\x00i\x07\x00\x00\x00i\x1a\x00\x00\x00t\x01\x00\x00\x00ii\x08\x00\x00\x00t\x01\x00\x00\x00nt\x05\x00\x00\x00rot13i\t\x00\x00\x00t\x00\x00\x00\x00t\x01\x00\x00\x00st\x04\x00\x00\x001010t\x01\x00\x00\x00fi\x0b\x00\x00\x00i\r\x00\x00\x00t\x02\x00\x00\x00ypi\x0e\x00\x00\x00i\x0f\x00\x00\x00t\x08\x00\x00\x00dddddddds\x06\x00\x00\x00NO!!!!i\x10\x00\x00\x00t \x00\x00\x00eccbc87e4b5ce2fe28308fd9f2a7baf3i\x11\x00\x00\x00sM\x00\x00\x00~(~(~((((~({}<[])<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))s\n\x00\x00\x00Bazinga!!!(\x12\x00\x00\x00t\x06\x00\x00\x00randomt\x03\x00\x00\x00md5t\t\x00\x00\x00raw_inputt\x03\x00\x00\x00lent\x04\x00\x00\x00quitt\x03\x00\x00\x00ordt\n\x00\x00\x00startswitht\x04\x00\x00\x00evalt\x07\x00\x00\x00isalphat\x06\x00\x00\x00decodet\x07\x00\x00\x00randintt\x06\x00\x00\x00encodet\x07\x00\x00\x00replacet\x03\x00\x00\x00intt\x03\x00\x00\x00strt\x04\x00\x00\x00typet\x08\x00\x00\x00__name__t\t\x00\x00\x00hexdigest(\x04\x00\x00\x00t\x07\x00\x00\x00commentR\x0f\x00\x00\x00R\x10\x00\x00\x00t\x03\x00\x00\x00ans(\x00\x00\x00\x00(\x00\x00\x00\x00sA\x00\x00\x00XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXt\x0c\x00\x00\x00get_the_flag\x0b\x00\x00\x00sl\x00\x00\x00\x00\x02\x12\x01\x0c\x01\x0c\x01\t\x02\x06\x01\x12\x01\x05\x01\n\x02\x10\x01\x05\x01\n\x01\x1a\x01\x05\x01\n\x01#\x01\x05\x01\n\x016\x01\x05\x01\n\x01-\x01\x05\x01\n\x014\x01\x05\x01\n\x01\x19\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01,\x01\x05\x01\n\x01,\x01\x05\x01\n\x01\x1d\x01\x05\x01\n\x01\x16\x01\x05\x01\n\x01\x10\x01\x05\x01\n\x01\x1f\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01'

UnicodeDecodeError when using specific python library (gender-detector)

I need to do gender guessing for some analysis, and after some research I've found this Python Library on github: malev/gender-detector
After following the instructions and doing some tweaks (e.g. readme instructs import gender_detector as gd but I needed to do
from gender_detector import gender_detector as gd
Then this happens, the lib has 4 datasets, 'us','uk','ar','uy', but only works when using 'us' or 'uk'
See example below:
from gender_detector import gender_detector as gd
detector = gd.GenderDetector('us')
detector2 = gd.GenderDetector('ar')
detector.guess('Marcos')
Out[25]: 'male'
detector2.guess('Marcos')
Traceback (most recent call last):
File "", line 1, in
detector2.guess('Marcos')
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/gender_detector.py", line 25, in guess
initial_position = self.index(name[0])
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 19, in call
self._generate_index()
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 25, in _generate_index
total = file.readline() # Omit headers line
File "/home/cpneto/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1078: invalid continuation byte
I believe this happens because of py2 vs py3 compatibility, but I'm not sure of that and don't have any clue on how to solve this.
Any suggestions?
The library assumes your ar file is UTF-8 encoded, but it isn't (hence the byte 0xf1 in position 1078 error). You need to either convert your file to UTF-8 or find some way to pass the actual encoding to the library.

A strange python unicode decode error trying to do redis dump (prefer python3, but seeing in both 2 & 3))

I am trying to do a redis dump using the python package redis-dump-load.
It is in UTF-8, except for apparently one key, which I'm being told is in ascii. No idea why, but I thought, ok, if there is a UnicodeDecodeError for this one key (I am receiving lots of data from the dump stream up to this point), what I will do is get the encoding of the bytes string I have, and decode with that instead.
However, I am still getting a UnicodeDecodeError for ASCII as well! It is bytes, it is ascii, I think it might just be corrupt and I plan to just skip it, but curious if anyone had any other ideas.
This is my code snippet:
value = {}
for k in response:
try:
value[k.decode(encoding)] = response[k].decode(encoding)
except UnicodeDecodeError:
print("Error for", k)
print(type(k))
orig_encoding = chardet.detect(k)['encoding']
print(orig_encoding)
value[k.decode(orig_encoding)] = response[k].decode(orig_encoding)
return value
And here is the ouput I see:
Error for b'meta'
<class 'bytes'>
ascii
Traceback (most recent call last):
File "redisdl.py", line 243, in handle_response
value[k.decode(encoding)] = response[k].decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "redisdl.py", line 635, in <module>
main()
File "redisdl.py", line 626, in main
do_dump(options)
File "redisdl.py", line 547, in do_dump
dump(output, **kwargs)
File "redisdl.py", line 174, in dump
for key, type, ttl, value in _reader(r, pretty, encoding, keys):
File "redisdl.py", line 293, in _reader
type, ttl, value = _read_key(encoded_key, r, pretty, encoding)
File "redisdl.py", line 284, in _read_key
value = reader.handle_response(results[2], pretty, encoding)
File "redisdl.py", line 250, in handle_response
response[k].decode(orig_encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
Am I missing something? I have looked at multiple SO answers, believe me, and I thought up until now I understood str, bytes, decoding, and encoding. But maybe not. Even py2 & py3 differences I get. I think. Starting to doubt everything...

Python 3 itertools.islice continue despite UnicodeDecodeError

I have a python 3 program that monitors a log file. The log includes, among other things, chat messages written by users. The log is created by a third party application which I cannot change.
Today a user wrote "텋��텋��" and it caused the program to crash with the following error:
future: <Task finished coro=<updateConsoleLog() done, defined at /usr/local/src/bserver/logmonitor.py:48> exception=UnicodeDecodeError('utf-8',...
say "\xed\xa0\xbd\xed\xb1\x8c"\r\n', 7623, 7624, 'invalid continuation byte')>
Traceback (most recent call last):
File "/usr/lib/python3.4/asyncio/tasks.py", line 238, in _step
result = next(coro)
File "/usr/local/src/bserver/logmonitor.py", line 50, in updateConsoleLog
server_events = self.console.getUpdate()
File "/usr/local/src/bserver/console.py", line 79, in getUpdate
return self.read()
File "/usr/local/src/bserver/console.py", line 90, in read
for line in itertools.islice(log_file, log_no, None):
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7623: invalid continuation byte
ERROR:asyncio:Task exception was never retrieved
Using 'file -i log.file' I determined that the log file is us-ascii. This shouldn't be and issue as ascii is a subset of utf-8 (as far as I know).
Since this happens rarely and I don't mind losing whatever it is that this user typed, is it possible for me to ignore this line or the particular characters that can't be decoded and just keep on reading the rest of the file?
I considered using try: ... except UnicodeDecodeError as ..., but this would mean I can't read anything in the log file after the error.
Code
def read(self):
log_no = self.last_log_no
log_file = open(self.path, 'r')
server_events = []
starting_log_no = log_no
for line in itertools.islice(log_file, log_no, None): //ERROR
server_events.append(line)
print(line.replace('\n', '').replace('\r', ''))
log_no += 1
self.last_log_no = log_no
if (starting_log_no < log_no):
return server_events
return False
Any help or advise would be appreciated!
The byte string \xed\xa0\xbd\xed\xb1\x8c is not utf-8 valid. Neither is it us-ascii, since us-ascii can only be 7-bits long; i.e. \x8c is greater than 127.
Instead of ignoring the UnicodeDecodeError, try opening the file with an encoding that supports all 8-bits of a byte (e.g. latin-1):
log_file = open(self.path, 'r' encoding='latin-1')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..
The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.
I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.

Categories

Resources