HexString to packed EBCDIC string - python

I need to convert '767f440128e1a00a' hex data to packed EBCDIC string. I want all result outcomes into one string but python is giving Unicode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: unexpected end of data
s='767f440128e1a00a'
output = []
DDF = [1]
distance = 0
for y in range (1,len(s[2:])):
for x in DDF:
if s[2:][distance:x*2+distance]!='':
output.append(s[2:][distance:x*2+distance])
else:
continue
distance += x*2
print(output)
final=[]
result=''
bytearrya=[]
for x in output:
result=(str(bytearray.fromhex(x).decode()))
x = codecs.decode(x, "hex")
final.append(x)

Here is the code based on Python byte representation of a hex string that is EBCDIC that mentioned "According to this, you need to use 'cp500' for decoding"
Codec / Aliases / Languages
cp500 / EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 / Western Europe
my_string_in_hex = '767f440128e1a00a'
my_bytes = bytearray.fromhex(my_string_in_hex)
print(my_bytes)
my_string = my_bytes.decode('cp500')
print(my_string)
Output:

Related

How do I decode this binary string in python?

So, I have this string 01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000
and I want to decode it using python, I'm getting this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 280: invalid start byte
According to this webiste: https://www.binaryhexconverter.com/binary-to-ascii-text-converter
The output should be S�ellotherehowyoudoingimfineareyoufineP
Here's my code:
def decodeAscii(bin_string):
binary_int = int(bin_string, 2);
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = binary_array.decode()
print(ascii_text)
How do I fix it?
Your bytes simply cannot be decoded as utf-8, just as the error message tells you.
utf-8 is the default encoding parameter of decode - and the best way to put in the correct encoding value is to know the encoding - otherwise you'll have to guess.
And guessing is probably what the website does, too, by trying the most common encodings, until one does not throw an exception:
def decodeAscii(bin_string):
binary_int = int(bin_string, 2);
byte_number = binary_int.bit_length() + 7 // 8
binary_array = binary_int.to_bytes(byte_number, "big")
ascii_text = "Bin string cannot be decoded"
for enc in ['utf-8', 'ascii', 'ansi']:
try:
ascii_text = binary_array.decode(encoding=enc)
break
except:
pass
print(ascii_text)
s = "01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000"
decodeAscii(s)
Output:
S°ellotherehowyoudoingimfineareyoufineP
But there's no guarantee that you find the "correct" encoding by guessing.
Your binary string is just not a valid ascii or utf-8 string. You can tell decode to ignore invalid sequences by saying
ascii_text = binary_array.decode(errors='ignore')
It could be solved in one line:
Try this:
def bin_to_text(bin_str):
bin_to_str = "".join([chr(int(bin_str[i:i+8],2)) for i in range(0,len(bin_str),8)])
return bin_to_str
bin_str = '01010011101100000110010101101100011011000110111101110100011010000110010101110010011001010110100001101111011101110111100101101111011101010110010001101111011010010110111001100111011010010110110101100110011010010110111001100101011000010111001001100101011110010110111101110101011001100110100101101110011001010101000000000000'
bin_to_str = bin_to_text(bin_str)
print(bin_to_str)
Output:
S°ellotherehowyoudoingimfineareyoufineP

How I should parse this type of bytes?

I have the following type of bytes:
b = b'2787\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x03\x01\x00\x00\x00\x00\x00\x96\x08\n\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x0047\x00>2!\tMV\xa7\x00\x00\x00\x00'
I must convert it to a string and obtain the 2787, how I should strip the \x00 values,
I just tryed with decode("utf-8") but throws the follwing error message:
'utf-8' codec can't decode byte 0x96 in position 33: invalid start byte
Also rstrip('\x00) didn't work.
Which type of decode I should use?
I obtain a list of strings from here:
data, addr = socket_udp.recvfrom(struct.calcsize("B13s9s61s"))
info = struct.unpack("B13s9s61s", data)
And b is the last 61 string.
The content of the string:
class Udp_packet:
type = 0x00
id = ""
random_num = ""
data = ""
def __init__(self, values_list, convert=False):
self.type = values_list[0]
self.id = values_list[1]
self.random_num = values_list[2]
self.data = values_list[3].split("\0")[0]
The code works properly using python 2.7, I just moved to 3.7.5
What you want from b is apparently the portion before the first NUL byte, b'\x00', or simply b'\0', so you can slice b by the the index of the first NUL byte:
b = b[:b.find(b'\0')]

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Getting error in python with characters from word document

I have this text which is entered in text box
‘f’fdsfs’`124539763~!##$%^’’;’””::’
I am coverting to json and then it comes as
"\\u2018f\\u2019fdsfs\\u2019`124539763~!##$%^\\u2019\\u2019;\\u2019\\u201d\\u201d::\\u2019e"
Now when i am writing the csv file then i get this error
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
csv.writer(data)
I tried all data.encode('utf-8') data.decode('unicode-escape') but didn't work
csv module does not support unicode use https://github.com/jdunck/python-unicodecsv instead
although im not sure \u2018 is part of the utf-8 charset
x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"'); print j.encode('cp1252')
‘f’fdsfs...
note that it is being encoded as cp1252
>>> import unicodecsv as csv #https://github.com/jdunck/python-unicodecsv
>>> x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"');
>>> with open("some_file.csv","wb") as f:
... w = csv.writer(f,encoding="cp1252")
... w.writerow([j,"normal"])
...
>>>
here is the csv file : https://www.dropbox.com/s/m4gta1o9vg8tfap/some_file.csv

python Incorrect formatting Cyrillic

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks
Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Categories

Resources