I'm trying to convert a string with special characters from ASCII to Hex using python, but it doesn't seem that I'm getting the correct value, noting that it works just fine whenever I try to convert a string that has no special characters. So basically here is what I'm doing:
import binascii
s = "D`Cزف³›"
s_bytes = str.encode(s)
hex_value = str(binascii.hexlify(s_bytes),'ascii')
print (hex_value)
Output
446043d8b2d981c2b316e280ba
Where the output should be (using online converter https://www.rapidtables.com/convert/number/ascii-to-hex.html):
446043632641b3203a
str.encode(s) defaults to utf8 encoding, which doesn't give you the byte values needed to get the desired output. The values you want are simply Unicode ordinals as hexadecimal values, so get the ordinal, convert to hex and join them all together:
s = 'D`Cزف³›'
h = ''.join([f'{ord(c):x}' for c in s])
print(h)
446043632641b3203a
Just realize that Unicode ordinals can be 1-6 hexadecimal digits long, so there is no easy way to reverse the process since you have no spacing of the numbers.
Related
I have an info_address that I want to convert to delimited hex
info_address_original = b'002dd748'
What i want is
info_address_coded = b'\x00\x2d\xd7\x48'
I tried this solution
info_address_original = b'002dd748'
info_address_intermediary = info_address_original.decode("utf-8") # '002dd748'
info_address_coded = bytes.fromhex( info_address_intermediary ) # b'\x00-\xd7H'
and i get
info_address_coded = b'\x00-\xd7H'
What my debugger shows
How would one go about correctly turning a bytes string like that to delimited hex? It worked implicitly in Python 2 but it doesn't work the way i would want in Python 3.
This is only a representation of the bytes. '-' is the same as '\x2d'.
>>> b'\x00\x2d\xd7\x48' == b'\x00-\xd7H'
True
The default representation of a byte string is to display the character value for all ascii printable characters and the encoded \xhh representation where hh is the hexadecimal value of the byte.
That means that b'\x00\x2d\xd7\x48' and `b'\x00-\xd7H' are the exact same string containing 4 bytes.
I would like to convert Unicode characters encoded by 'utf-8' to hex string of which character starts with '%' because the API server only recognize this form.
For example, if I need to input unicode character '※' to the server, I should convert this character to string '%E2%80%BB'. (It does not matter whether a character is upper or lower.)
I found the way to convert unicode character to bytes and convert bytes to hex string in https://stackoverflow.com/a/35599781.
>>> print('※'.encode('utf-8'))
b'\xe2\x80\xbb'
>>> print('※'.encode('utf-8').hex())
e280bb
But I need the form of starting with '%' like '%E2%80%BB' or %e2%80%bb'
Are there any concise way to implement this? Or do I need to make the function to add '%' to each hex character?
There is two ways to do this:
The preferred solution. Use urllib.parse.urlencode and specify multiple parameters and encode all at once:
urllib.parse.urlencode({'parameter': '※', 'test': 'True'})
# parameter=%E2%80%BB&test=True
Or, you can manually convert this into chunks of two symbols, then join with the % symbol:
def encode_symbol(symbol):
symbol_hex = symbol.encode('utf-8').hex()
symbol_groups = [symbol_hex[i:i + 2].upper() for i in range(0, len(symbol_hex), 2)]
symbol_groups.insert(0, '')
return '%'.join(symbol_groups)
encode_symbol('※')
# %E2%80%BB
This question already has answers here:
What's the correct way to convert bytes to a hex string in Python 3?
(9 answers)
Closed 7 years ago.
I have an ANSI string Ď–ór˙rXüď\ő‡íQl7 and I need to convert it to hexadecimal like this:
06cf96f30a7258fcef5cf587ed51156c37 (converted with XVI32).
The problem is that Python cannot encode all characters correctly (some of them are incorrectly displayed even here, on Stack Overflow) so I have to deal with them with a byte string.
So the above string is in bytes this: b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
And that's what I need to convert to hexadecimal.
So far I tried binascii with no success, I've tried this:
h = ""
for i in b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7':
h += hex(i)
print(h)
It prints:
0x60xcf0x960xf30xa0x720x830xff0x720x580xfc0xef0x5c0xf50x870xed0x510x150x6c0x37
Okay. It looks like I'm getting somewhere... but what's up with the 0x thing?
When I remove 0x from the string like this:
h.replace("0x", "")
I get 6cf96f3a7283ff7258fcef5cf587ed51156c37 which looks like it's correct.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
Any ideas?
If you're running python 3.5+, bytes type has an new bytes.hex() method that returns string representation.
>>> h = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> h.hex()
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
Otherwise you can use binascii.hexlify() to do the same thing
>>> import binascii
>>> binascii.hexlify(h).decode('utf8')
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
As per the documentation, hex() converts “an integer number to a lowercase hexadecimal string prefixed with ‘0x’.” So when using hex() you always get a 0x prefix. You will always have to remove that if you want to concatenate multiple hex representations.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
That does not make any sense. x is not a valid hexadecimal character, so in your solution it can only be generated by the hex() call. And that, as said above, will always create a 0x. So the sequence 0x can never appear in a different way in your resulting string, so replacing 0x by nothing should work just fine.
The actual problem in your solution is that hex() does not enforce a two-digit result, as simply shown by this example:
>>> hex(10)
'0xa'
>>> hex(2)
'0x2'
So in your case, since the string starts with b\x06 which represents the number 6, hex(6) only returns 0x6, so you only get a single digit here which is the real cause of your problem.
What you can do is use format strings to perform the conversion to hexadecimal. That way you can both leave out the prefix and enforce a length of two digits. You can then use str.join to combine it all into a single hexadecimal string:
>>> value = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> ''.join(['{:02x}'.format(x) for x in value])
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
This solution does not only work with a bytes string but with really anything that can be formatted as a hexadecimal string (e.g. an integer list):
>>> value = [1, 2, 3, 4]
>>> ''.join(['{:02x}'.format(x) for x in value])
'01020304'
I am currently trying to find a way to convert any sort of text to a number, so that it can later be converted back to text.
So something like this:
text = "some string"
number = somefunction(text)
text = someotherfunction(number)
print(text) #output "some string"
If you're using Python 3, it's pretty easy. First, convert the str to bytes in a chosen encoding (utf-8 is usually appropriate), then use int.from_bytes to convert to an int:
number = int.from_bytes(mystring.encode('utf-8'), 'little')
Converting back is slightly trickier (and will lose trailing NUL bytes unless you've stored how long the resulting string should be somewhere else; if you switch to 'big' endianness, you lose leading NUL bytes instead of trailing):
recoveredstring = number.to_bytes((number.bit_length() + 7) // 8, 'little').decode('utf-8')
You can do something similar in Python 2, but it's less efficient/direct:
import binascii
number = int(binascii.hexlify(mystring.encode('utf-8')), 16)
hx = '%x' % number
hx = hx.zfill(len(hx) + (len(hx) & 1)) # Make even length hex nibbles
recoveredstring = binascii.unhexlify(hx).decode('utf-8')
That's equivalent to the 'big' endian approach in Python 3; reversing the intermediate bytes as you go in each direction would get the 'little' effect.
You can use the ASCII values to do this:
ASCII to int:
ord('a') # = 97
Back to a string:
str(unichr(97)) # = 'a'
From there you could iterate over the string one character at a time and store these in another string. Assuming you are using standard ASCII characters, you would need to zero pad the numbers (because some are two digits and some three) like so:
s = 'My string'
number_string = ''
for c in s:
number_string += str(ord(c)).zfill(3)
To decode this, you will read the new string three characters at a time and decode them into a new string.
This assumes a few things:
all characters can be represented by ASCII (you could use Unicode code points if not)
you are storing the numeric value as a string, not as an actual int type (not a big deal in Python—saves you from having to deal with maximum values for int on different systems)
you absolutely must have a numeric value, i.e. some kind of hexadecimal representation (which could be converted into an int) and cryptographic algorithms won't work
we're not talking about GB+ of text that needs to be converted in this manner
Although Python 3.x solved the problem that uppercase and lowercase for some locales (for example tr_TR.utf8) Python 2.x branch lacks this. Several workaround for this issuse like https://github.com/emre/unicode_tr/ but did not like this kind of a solution.
So I am implementing a new upper/lower/capitalize/title methods for monkey-patching unicode class with
string.maketrans method.
The problem with maketrans is the lenghts of two strings must have same lenght.
The nearest solution came to my mind is "How can I convert 1 Byte char to 2 bytes?"
Note: translate method does work only ascii encoding, when I pass u'İ' (1 byte length \u0130) as arguments to translate gives ascii encoding error.
from string import maketrans
import unicodedata
c1 = unicodedata.normalize('NFKD',u'i').encode('utf-8')
c2 = unicodedata.normalize('NFKD',u'İ').encode('utf-8')
c1,len(c1)
('\xc4\xb1', 2)
# c2,len(c2)
# ('I', 1)
'istanbul'.translate( maketrans(c1,c2))
ValueError: maketrans arguments must have same length
Unicode objects allow multicharacter translation via a dictionary instead of two byte strings mapped through maketrans.
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
print u'istanbul'.translate(D)
Output:
İstanbul
If you start with an ASCII byte string and want the result in UTF-8, simply decode/encode around the translation:
#!python2
#coding:utf8
D = {ord(u'i'):u'İ'}
s = 'istanbul'.decode('ascii')
t = s.translate(D)
s = t.encode('utf8')
print repr(s)
Output:
'\xc4\xb0stanbul'
The following technique can do the job of maketrans. Note that the dictionary keys must be Unicode ordinals, but the value can be Unicode ordinals, Unicode strings or None. If None, the character is deleted when translated.
#!python2
#coding:utf8
def maketrans(a,b):
return dict(zip(map(ord,a),b))
D = maketrans(u'àáâãäå',u'ÀÁÂÃÄÅ')
print u'àbácâdãeäfåg'.translate(D)
Output:
ÀbÁcÂdÃeÄfÅg
Reference: str.translate