I have the following type of bytes:
b = b'2787\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x03\x01\x00\x00\x00\x00\x00\x96\x08\n\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x0047\x00>2!\tMV\xa7\x00\x00\x00\x00'
I must convert it to a string and obtain the 2787, how I should strip the \x00 values,
I just tryed with decode("utf-8") but throws the follwing error message:
'utf-8' codec can't decode byte 0x96 in position 33: invalid start byte
Also rstrip('\x00) didn't work.
Which type of decode I should use?
I obtain a list of strings from here:
data, addr = socket_udp.recvfrom(struct.calcsize("B13s9s61s"))
info = struct.unpack("B13s9s61s", data)
And b is the last 61 string.
The content of the string:
class Udp_packet:
type = 0x00
id = ""
random_num = ""
data = ""
def __init__(self, values_list, convert=False):
self.type = values_list[0]
self.id = values_list[1]
self.random_num = values_list[2]
self.data = values_list[3].split("\0")[0]
The code works properly using python 2.7, I just moved to 3.7.5
What you want from b is apparently the portion before the first NUL byte, b'\x00', or simply b'\0', so you can slice b by the the index of the first NUL byte:
b = b[:b.find(b'\0')]
I need to convert '767f440128e1a00a' hex data to packed EBCDIC string. I want all result outcomes into one string but python is giving Unicode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
s='767f440128e1a00a'
output = []
DDF = [1]
distance = 0
for y in range (1,len(s[2:])):
for x in DDF:
if s[2:][distance:x*2+distance]!='':
output.append(s[2:][distance:x*2+distance])
else:
continue
distance += x*2
print(output)
final=[]
result=''
bytearrya=[]
for x in output:
result=(str(bytearray.fromhex(x).decode()))
x = codecs.decode(x, "hex")
final.append(x)
Here is the code based on Python byte representation of a hex string that is EBCDIC that mentioned "According to this, you need to use 'cp500' for decoding"
Codec / Aliases / Languages
cp500 / EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 / Western Europe
my_string_in_hex = '767f440128e1a00a'
my_bytes = bytearray.fromhex(my_string_in_hex)
print(my_bytes)
my_string = my_bytes.decode('cp500')
print(my_string)
Output:
I am building a parser, and I kinda new to this.
I have problem with decoding specific bytes, they always return same int(and they shouldn't) so I must doing it wrong.
byte = ser.read(1)
byte += ser.read(ser.inWaiting())
a = 0
for i in byte:
if i == 0x04:
value = struct.unpack("<h", bytes([i, a]))[0]
print (value)
I recive bytes like this:
b'\xaa\x04\x80\x02\xff\xfb\x83\xaa\xaa\x04\x80\
And I need to decode packet 0x04. I am using Python 3.6
Try something like :
value = int.from_bytes(byte, byteorder='little')
I tried to find solution, but still stuck with it.
I want to use PyVisa to control function generator.
I have a waveform which is a list of values between 0 and 16382
Then I have to prepare it in a way that each waveform point occupies 2 bytes.
A value is represented in big-endian, MSB-first format, and is a straight binary. So I do binwaveform = pack('>'+'h'*len(waveform), *waveform)
And then when I try to write it to the instrument with AFG.write('trace ememory, '+ header + binwaveform) I get an error:
File ".\afg3000.py", line 97, in <module>
AFG.write('trace ememory, '+ header + binwaveform)
TypeError: Can't convert 'bytes' object to str implicitly
I tried to solve it with AFG.write('trace ememory, '+ header + binwaveform.decode()) but it looks that by default it tries to use ASCII characters what is not correct for some values: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 52787: invalid start byte
Could you please help with it?
binwaveform is a packed byte array of an integer. E.g:
struct.pack('<h', 4545)
b'\xc1\x11'
You can't print it as it makes no sense to your terminal. In the above example,
0xC1 is invalid ASCII and UTF-8.
When you append a byte string to a regular str (trace ememory, '+ header + binwaveform), Python wants to convert it to readable text but doesn't know how.
Decoding it implies that it's text - it's not.
The best thing to do is print the hex representation of it:
import codecs
binwaveform_hex = codecs.encode(binwaveform, 'hex')
binwaveform_hex_str = str(binwaveform_hex)
AFG.write('trace ememory, '+ header + binwaveform_hex_str)
I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.