Converting input hex 0A to \n escaped character - python

I am using webpage to get data in hex to write serial modbus using python
Issue is that 0A in the string gets converted to \n.
How to stop this from happening?
rList = r'0106000600000AE9CC'
arr=str(rList)
arr = bytes.fromhex(rList)
print(arr)
Output:
b'\x01\x06\x00\x06\x00\x00\n\xe9\xcc'

The repr() representation of a bytes object uses ASCII code points wherever possible.
What this means is \x0A will be displayed as \n, because that's the ASCII code point for a newline.
More examples:
\x55 will be displayed as U, \x5A will be displayed as Z, \x0D will be displayed as \r, you get the idea.
However, the data under the hood is still the same.
Don't worry about how the output string is displayed by the Python console—it's presumably more important to process its content.

Try this
def print_me(thing: bytes):
print(''.join(
f'\\x{byte:02x}'
for byte in thing
))
print_me(bytes.fromhex("0106000600000AE9CC"))
Output:
\x01\x06\x00\x06\x00\x00\x0a\xe9\xcc

b symbol before string means you have a list of bytes. As you print it, python tries to decode bytes whenever possible to represent them in symbols for simplicity.
Byte b'\x0A' corresponds to decimal value 10, which is a newline symbol in ascii. That is why you get this symbol printed.
Under the hood your byte b'\x0A' has not been changed.

Related

Why does bytes.fromhex() produce the output shown?

I was wondering why this hex:
bytes.fromhex("34 FF FA A3 A5")
gives an output: b'4\xff\xfa\xa3\xa5'. Why \x disappeared, shouldn't it be \x34?
That's how bytes reprs work; when a byte has an ordinal value corresponding to a printable ASCII character, it's represented as the ASCII character, rather than the \x escape code. You can create the bytes with either form (b'4' == b'\x34' is True; they produce the exact same bytes value), but it chooses the ASCII display to make byte strings that happen to be ASCII more readable (and make many reprs shorter).
Python tries to print a good looking equivalent.
In your case we have: '0x34'= hex(ord("4")) , which means Unicode integer representing of 4 in hex equals '0x34'.
Try this one in your console: print ("\x09") . That's because \x09 in hex format represents \t.

Encoding unicode with 'utf-8' shows byte-strings only for non-ascii

I'm running python2.7.10
Trying to wrap my head around why the following behavior is seen. (Sure there is a reasonable explanation)
So I define two unicode characters, with only the first one in the ascii-set, and the second one outside of it.
>>> a=u'\u0041'
>>> b=u'\u1234'
>>> print a
A
>>> print b
ሴ
Now I encode it to see what the corresponding bytes would be. But only the latter gives me the results I am expecting to see (bytes)
>>> a.encode('utf-8')
'A'
>>> b.encode('utf-8')
'\xe1\x88\xb4'
Perhaps the issue is in my expectation, and if so, one of you can explain where the flaw lies.
- My a,b are unicodes (hex values of the ordinals inside)
- When I print these, the interpreter prints the actual character corresponding to each unicode byte.
- When I encode, I assumed that it would be converted into a byte-string using the encoding scheme I provide (in this case utf-8). I expected to see a bytestring for a.encode, just like I did for b.encode.
What am I missing?
There is no flaw. You encoded to UTF-8, which uses the same bytes as the ASCII standard for the first 127 codepoints of the Unicode standard, and uses multiple bytes (between 2 and 4) for everything else.
You then echoed that value in your terminal, which uses the repr() function to build a debugging representation. That representation produces a valid Python expression for strings, one that is ASCII safe. Any bytes in that value that is not printable as an ASCII character, is shown as an escape sequence. Thus UTF-8 bytes are shown as \xhh hex escapes.
Most importantly, because A is a printable ASCII character, it is shown as is; any code editor or terminal will accept ASCII, and for most English text showing the actual text is so much more useful.
Note that you used print for the unicode values stored in a and b, which means Python encodes those values to your terminal codec, coordinating with your terminal to produce the right output. You did not echo the values in the interpreter. Had you done so, you'd also seen debug output:
>>> a = u'\u0041'
>>> b = u'\u1234'
>>> a
u'A'
>>> b
u'\u1234'
In Python 3, the functionality of the repr() function (or rather, the object.__repr__ hook) has been updated to produce a unicode string with most printable codepoints left un-escaped. Use the new ascii() function to get the above behaviour.

convert dpkt byte strings containing random characters

I am using the dpkt python module to parse a pcap file. I'm looking deep enough into the packets that some of the data is represented as byte streams. I can convert from regular byte strings easily enough, however some of the byte strings appear as:
\t\x01\x1c\x88
The first value should be 09, however for some reason it's using an escaped tab character. (the hex code of a tab is 09).
It's doing this for other characters in other streams as well.
Some more sample outputs:
\x10\x00#\x00\
\x05q\x00\x00\
\x069\x9c\n\x00
So my question is: can I convert this byte stream to one without these extra characters?
Alternatively, how would I go about converting something like '\t' to hex so that it returns '09'?
Update:
Turns out that I was creating the strings to be converted using a function that would return
\t011c88 in place of the first stream.
Leaving it alone and using stream.encode("hex") worked
The repr function by default escapes all non-printable characters like you've seen.
To get a hex-only representation, use
string.encode("hex")
NOTE: The original bytestream is correct, you should only convert to hex for viewing purposes rather than integity purposes. It only shows the data in a strange way.

Getting char from string its hex value

I currently get a string as a parameters in my method,
i would like to extract the char in the I index and get it's hex value.
Currently i'm doing :
temp = string[i]
binascii.hexlify(temp);
but i get an error :
TypeError: 'str' does not support the buffer interface
Any ideas please ?
You need to encode the string to bytes:
binascii.hexlify(temp.encode('ascii'));
You'll need to pick a suitable encoding, one that can represent your text properly; I am presuming that your unicode characters fall in the 0-127 range here.
If you encode to a different encoding, the result will be a hex representation of that encoding. UTF-8 will use between 1 and 6 bytes per character for example.
Alternatively, you could use the ord() function and format the result to hex:
format(ord(temp), 'x')
and it'll work with any unicode character. It'll use the Unicode code point for the hex representation, so between 1 and 4 bytes (the latter for \Uabcdefgh wide characters). Depending on your maximum character width, you may want to pad the bytes to prevent ambiguous code points; say you need to encode up to codepoint \uffff then you'll need to use 2 bytes for every codepoint, or 4 hex characters:
format(ord(temp), '04x')

Unicode character that doesnt work alone, Python

Ok so I have another Python Unicode problem. In IDLE windows 7,The following code:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
prints two chinese characters, 不无, the correct ones. However, if I replace the first line with
uni = u"\u65E0"
ie only the second character, it prints æ— instead. Altough if I replace it with only the first character
u"\u4E0D"
it gives the correct output 不
Is this a bug, or what am I doing wrong?
COMPLETE CODE:
uni = u"\u4E0D\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u65E0"
binary = uni.encode("utf-8")
print binary
uni = u"\u4E0D"
binary = uni.encode("utf-8")
print binary
OUTPUT:
不无
æ— 
不
The unicode string u"\u4E0D\u65E0" consists of the two text characters 不 and 无.
When a unicode string is encoded, it is converted into a sequence of bytes (not binary). Depending on what encoding is used, there may not be a one-to-one mapping of text characters to bytes. The "utf8" encoding, for instance, can use from one to three bytes to represent a single character:
>>> u'\u65E0'.encode('utf8')
'\xe6\x97\xa0'
Now, before a sequence of bytes can be printed, python (or IDLE) has to try to decode it. But since it has no way to know what encoding was used, it is forced to guess. For some reason, it appears that IDLE may have wrongly guessed "cp1252" for one of the examples:
>>> text = u'\u65E0'.encode('utf8').decode('cp1252')
>>> text
u'\xe6\u2014\xa0'
>>> print text
æ— 
Note that there are three characters in text - the last one is a non-breaking space.
EDIT
Strictly speaking, IDLE wrongly guesses "cp1252" for all three examples. The second one only "succeeds" because each byte coincidently maps to a valid text character ("cp1252" is an 8-bit, single-byte encoding). The other two examples contain the byte \x8d, which is not defined in "cp1252". For these cases, IDLE (eventually) falls back to "utf8", which gives the correct output.

Categories

Resources