I want to convert unicode string to its hexadecimal representation.
For example, u'\u041a\u0418\u0421\u0410' should be converted to "\xD0\x9A\xD0\x98\xD0\xA1\xD0\x90". I tried the code below (python 2.7):
unicode_username.encode("utf-8").encode("hex")
However, I get a string:
'd09ad098d0a1d090'
Any suggestions how I can get \xD0\x9A\xD0\x98\xD0\xA1\xD0\x90?
When you do string.encode('utf-8'), it changes to hex notation.
But if you print it, you will get original unicode string.
If you want the hex notation you can get it like this with repr() function:
>>> print u'\u041a\u0418\u0421\u0410'.encode('utf-8')
КИСА
>>> print repr(u'\u041a\u0418\u0421\u0410'.encode('utf-8'))
'\xd0\x9a\xd0\x98\xd0\xa1\xd0\x90'
you can also try:
print "hex_signature : ",'\\X'.join(x.encode("hex") for x in signature)
The join function is used with the separator '\X' so that for each byte to hex conversion the \X is inserted. The join function is done for each byte of the variable signature in a loop. Everything is joined/concatenated and printed.
Related
How do I convert a string which contains the literal representation of a byte string, to a byte string?
This might seem strange, but for a library I'm using for a certain type of exception I need one of the attributes of the exception, this gives me the value I need, but it is a byte string in a string.
It is "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'", I can get the value by splitting on the equals and then using eval, such as
>>> eval("value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'".split("=")[1])
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
This works, but as we all know eval can be very, very bad. So, is there an alternative to using eval?
There is a unicode-escape codec that will convert bytes containing literal sequences like \x.. or \u.... into their equivalent characters in the string. The remainder of the string is converted using the latin1 encoding, which just translates all the bytes.
So you convert the string to raw bytes using latin1, then convert back to a string using unicode-escape, and finally back to bytes using latin1 again:
>>> s = '\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
Getting rid of the clutter around the string is pretty easy using regex or the more manual parsing you showed. For example:
>>> x = "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'"
>>> s = re.fullmatch('[^\'"]+b([\'"])(.*)\\1[^\'"]*', x).group(2)
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
OR
>>> s = x.split('=')[1].lstrip('b').strip("'")
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
Say I have someString = "00". basically I want to convert someString to \x00
I tried multiple ways to achieve my goal, but couldn't find a successful one.
tried:
HexString = '\x'+someString
This method throws this error:
ValueError: invalid \x escape
Unless I do HexString = r'\x'+someString, but then HexString value is set to \\x00 which is not the same as I want.
I also tried using hex() function, which had few issues. But the big issue I had with it was that it returns 0x0. It expects int and etc...
Can anyone help me with converting a string("11") to \x11?
If I am understanding you correctly, the actual goal is to take a string that contains a bunch of pairs of hex digits, and translate each pair of hex digits into the corresponding byte and have a result of type bytes.
In 3.x, this is built directly into the bytes type itself:
>>> bytes.fromhex('11abcdef')
b'\x11\xab\xcd\xef'
You can also instead use the standard library:
>>> import binascii
>>> binascii.unhexlify('11abcdef')
b'\x11\xab\xcd\xef'
You will not necessarily see a \x escape sequence for every byte value. This is normal and expected; it has to do with how the bytes object is represented as text for display purposes.
'\x'+someString
No approach of this general form can work, because it fundamentally misunderstands the problem. The output that you want is not a string, and a string literal like '\x00' does not have a backslash in it, nor a lowercase x - again, what you are seeing is how the string is represented as text, because not every character is printable.
int lets you set the base. For base 16
>>> someString = "00"
>>> int(someString, 16)
0
Of course, 0 is kinda boring because it works for all bases.
If you wanted a byte in a bytes object, you could
>>> import struct
>>> struct.pack("b", int(someString, 16))
b'\x00'
If you want a string (and I'm switching to 0x41 here) you could
>>> chr(int("41", 16))
'A'
You can get ord of the character by using int, then convert it to a character. Then you can encode it to bytes object without any import.
>>> chr(int("11", 16)) # a character
'\x11'
>>> chr(int("11", 16)).encode() # bytes object
b'\x11'
I wanted to convert an ascii string (well just text to be precise) towards base64.
So I know how to do that, I just use the following code:
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string)
Which gives me
b'c3RyaW5n'
However the problem is, I'd like it to just print
c3RyaW5n
Is it possible to print the string without the "b" and the '' quotation marks?
Thanks!
The b prefix denotes that it is a binary string. A binary string is not a string: it is a sequence of bytes (values in the 0 to 255 range). It is simply typesetted as a string to make it more compact.
In case of base64 however, all characters are valid ASCII characters, you can thus simply decode it like:
print(string.decode('ascii'))
So here we will decode each byte to its ASCII equivalent. Since base64 guarantees that every byte it produces is in the ASCII range 'A' to '/') we will always produce a valid string. Mind however that this is not guaranteed with an arbitrary binary string.
A simple .decode("utf-8") would do
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string.decode("utf-8"))
I received the following string.How can it be converted to hex value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
That's not a hex string. You are confusing the Python repr() output for a bytestring, which aims to make debugging easier, with the contents.
Each \xhh is a standard Python string literal escape sequence, and displaying the string like this makes it trivial to copy and paste into another Python session to reproduce the exact same value.
You don't need to hex decode this at all.
An actual hex string consists only of the digits 0 through to 9, and the letters a through to f (upper or lowercase). Your value, converted to hex, looks like this:
>>> value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
>>> import binascii
>>> binascii.hexlify(value)
'28d24d00180018800080000000000000e0d2e0d22ed20000000000000a00180026000000000000000f30fe2f0130ff2f0030ff2f0030ff2fff2fff2fff2fff2f0030ff2fff2fff2f00300030ff2f003000300030ff2fff2f00300030ff2f0030adff0c00ddffc2ffd3ffdeffe9ffcaffd8ffe6ffb5ffb2ffe6ff92ffd0ffa0ffbdffb4ff82ff90ff66ffe1ff9fff94ffd4ffa4ffbbffe8ff00000200ff7fff7f97ffd0ffb7ff7eff47ffa1ffa1ffcdab0000410a0000'
When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.