How do I decode this binary string in python? - python

So, I have this string 01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000
and I want to decode it using python, I'm getting this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 280: invalid start byte
According to this webiste: https://www.binaryhexconverter.com/binary-to-ascii-text-converter
The output should be S�ellotherehowyoudoingimfineareyoufineP
Here's my code:
def decodeAscii(bin_string):
binary_int = int(bin_string, 2);
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
print(ascii_text)
How do I fix it?

Your bytes simply cannot be decoded as utf-8, just as the error message tells you.
utf-8 is the default encoding parameter of decode - and the best way to put in the correct encoding value is to know the encoding - otherwise you'll have to guess.
And guessing is probably what the website does, too, by trying the most common encodings, until one does not throw an exception:
def decodeAscii(bin_string):
binary_int = int(bin_string, 2);
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = "Bin string cannot be decoded"
for enc in ['utf-8', 'ascii', 'ansi']:
try:
ascii_text = binary_array.decode(encoding=enc)
break
except:
pass
print(ascii_text)
s = "01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000"
decodeAscii(s)
Output:
S°ellotherehowyoudoingimfineareyoufineP
But there's no guarantee that you find the "correct" encoding by guessing.

Your binary string is just not a valid ascii or utf-8 string. You can tell decode to ignore invalid sequences by saying
ascii_text = binary_array.decode(errors='ignore')

It could be solved in one line:
Try this:
def bin_to_text(bin_str):
bin_to_str = "".join([chr(int(bin_str[i:i+8],2)) for i in range(0,len(bin_str),8)])
return bin_to_str
bin_str = '01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000'
bin_to_str = bin_to_text(bin_str)
print(bin_to_str)
Output:
S°ellotherehowyoudoingimfineareyoufineP

Related

How I should parse this type of bytes?

I have the following type of bytes:
b = b'2787\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x03\x01\x00\x00\x00\x00\x00\x96\x08\n\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x0047\x00>2!\tMV\xa7\x00\x00\x00\x00'
I must convert it to a string and obtain the 2787, how I should strip the \x00 values,
I just tryed with decode("utf-8") but throws the follwing error message:
'utf-8' codec can't decode byte 0x96 in position 33: invalid start byte
Also rstrip('\x00) didn't work.
Which type of decode I should use?
I obtain a list of strings from here:
data, addr = socket_udp.recvfrom(struct.calcsize("B13s9s61s"))
info = struct.unpack("B13s9s61s", data)
And b is the last 61 string.
The content of the string:
class Udp_packet:
type = 0x00
id = ""
random_num = ""
data = ""
def __init__(self, values_list, convert=False):
self.type = values_list[0]
self.id = values_list[1]
self.random_num = values_list[2]
self.data = values_list[3].split("\0")[0]
The code works properly using python 2.7, I just moved to 3.7.5
What you want from b is apparently the portion before the first NUL byte, b'\x00', or simply b'\0', so you can slice b by the the index of the first NUL byte:
b = b[:b.find(b'\0')]

HexString to packed EBCDIC string

I need to convert '767f440128e1a00a' hex data to packed EBCDIC string. I want all result outcomes into one string but python is giving Unicode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
s='767f440128e1a00a'
output = []
DDF = [1]
distance = 0
for y in range (1,len(s[2:])):
for x in DDF:
if s[2:][distance:x*2+distance]!='':
output.append(s[2:][distance:x*2+distance])
else:
continue
distance += x*2
print(output)
final=[]
result=''
bytearrya=[]
for x in output:
result=(str(bytearray.fromhex(x).decode()))
x = codecs.decode(x, "hex")
final.append(x)
Here is the code based on Python byte representation of a hex string that is EBCDIC that mentioned "According to this, you need to use 'cp500' for decoding"
Codec / Aliases / Languages
cp500 / EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 / Western Europe
my_string_in_hex = '767f440128e1a00a'
my_bytes = bytearray.fromhex(my_string_in_hex)
print(my_bytes)
my_string = my_bytes.decode('cp500')
print(my_string)
Output:

Python - decoding bytes in struct

I am building a parser, and I kinda new to this.
I have problem with decoding specific bytes, they always return same int(and they shouldn't) so I must doing it wrong.
byte = ser.read(1)
byte += ser.read(ser.inWaiting())
a = 0
for i in byte:
if i == 0x04:
value = struct.unpack("<h", bytes([i, a]))[0]
print (value)
I recive bytes like this:
b'\xaa\x04\x80\x02\xff\xfb\x83\xaa\xaa\x04\x80\
And I need to decode packet 0x04. I am using Python 3.6
Try something like :
value = int.from_bytes(byte, byteorder='little')

Convert 'bytes' object to string

I tried to find solution, but still stuck with it.
I want to use PyVisa to control function generator.
I have a waveform which is a list of values between 0 and 16382
Then I have to prepare it in a way that each waveform point occupies 2 bytes.
A value is represented in big-endian, MSB-first format, and is a straight binary. So I do binwaveform = pack('>'+'h'*len(waveform), *waveform)
And then when I try to write it to the instrument with AFG.write('trace ememory, '+ header + binwaveform) I get an error:
File ".\afg3000.py", line 97, in <module>
AFG.write('trace ememory, '+ header + binwaveform)
TypeError: Can't convert 'bytes' object to str implicitly
I tried to solve it with AFG.write('trace ememory, '+ header + binwaveform.decode()) but it looks that by default it tries to use ASCII characters what is not correct for some values: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 52787: invalid start byte
Could you please help with it?
binwaveform is a packed byte array of an integer. E.g:
struct.pack('<h', 4545)
b'\xc1\x11'
You can't print it as it makes no sense to your terminal. In the above example,
0xC1 is invalid ASCII and UTF-8.
When you append a byte string to a regular str (trace ememory, '+ header + binwaveform), Python wants to convert it to readable text but doesn't know how.
Decoding it implies that it's text - it's not.
The best thing to do is print the hex representation of it:
import codecs
binwaveform_hex = codecs.encode(binwaveform, 'hex')
binwaveform_hex_str = str(binwaveform_hex)
AFG.write('trace ememory, '+ header + binwaveform_hex_str)

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Categories

Resources