I currently get a string as a parameters in my method,
i would like to extract the char in the I index and get it's hex value.
Currently i'm doing :
temp = string[i]
binascii.hexlify(temp);
but i get an error :
TypeError: 'str' does not support the buffer interface
Any ideas please ?
You need to encode the string to bytes:
binascii.hexlify(temp.encode('ascii'));
You'll need to pick a suitable encoding, one that can represent your text properly; I am presuming that your unicode characters fall in the 0-127 range here.
If you encode to a different encoding, the result will be a hex representation of that encoding. UTF-8 will use between 1 and 6 bytes per character for example.
Alternatively, you could use the ord() function and format the result to hex:
format(ord(temp), 'x')
and it'll work with any unicode character. It'll use the Unicode code point for the hex representation, so between 1 and 4 bytes (the latter for \Uabcdefgh wide characters). Depending on your maximum character width, you may want to pad the bytes to prevent ambiguous code points; say you need to encode up to codepoint \uffff then you'll need to use 2 bytes for every codepoint, or 4 hex characters:
format(ord(temp), '04x')
Related
I was wondering why this hex:
bytes.fromhex("34 FF FA A3 A5")
gives an output: b'4\xff\xfa\xa3\xa5'. Why \x disappeared, shouldn't it be \x34?
That's how bytes reprs work; when a byte has an ordinal value corresponding to a printable ASCII character, it's represented as the ASCII character, rather than the \x escape code. You can create the bytes with either form (b'4' == b'\x34' is True; they produce the exact same bytes value), but it chooses the ASCII display to make byte strings that happen to be ASCII more readable (and make many reprs shorter).
Python tries to print a good looking equivalent.
In your case we have: '0x34'= hex(ord("4")) , which means Unicode integer representing of 4 in hex equals '0x34'.
Try this one in your console: print ("\x09") . That's because \x09 in hex format represents \t.
I'm trying to convert a binary I have in python (a gzipped protocol buffer object) to an hexadecimal string in a string escape fashion (eg. \xFA\x1C ..).
I have tried both
repr(<mygzipfileobj>.getvalue())
as well as
<mygzipfileobj>.getvalue().encode('string-escape')
In both cases I end up with a string which is not made of HEX chars only.
\x86\xe3$T]\x0fPE\x1c\xaa\x1c8d\xb7\x9e\x127\xcd\x1a.\x88v ...
How can I achieve a consistent hexadecimal conversion where every single byte is actually translated to a \xHH format ? (where H represents a valid hex char 0-9A-F)
The \xhh format you often see is a debugging aid, the output of the repr() applied to a string with non-ASCII codepoints. Any ASCII codepoints are left a in-place to leave what readable information is there.
If you must have a string with all characters replaced by \xhh escapes, you need to do so manually:
''.join(r'\x{0:02x}'.format(ord(c)) for c in value)
If you need quotes around that, you'd need to add those manually too:
"'{0}'".format(''.join(r'\x{:02x}'.format(ord(c)) for c in value))
Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:
Byte code is simply the converted source code into arrays of bytes
Does it mean that Python has its own coding/encoding format? Or does it use the operation system settings?
I don't understand. Could you please explain?
Thank you!
No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.
bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.
For example:
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.
Related reading:
Python Unicode Howto (from the official documentation)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.
Suppose you create a string using Python 3 in the usual way:
stringobject = 'ant'
stringobject would be a unicode string.
A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t
Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.
So you will find that if you type in,
stringobject = '\u0061\u006E\u0074'
stringobject
You will also get the output 'ant'.
Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.
How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.
Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.
For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.
Running the following code:
byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)
converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.
Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.
Similarly,
stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)
gives you 'ant', str.
I am wondering how is binary encoding for a string is given in Python.
For example,
>>> b'\x25'
b'%'
or
>>>b'\xe2\x82\xac'.decode()
'€'
but
>>> b'\xy9'
File "<stdin>", line 1
SyntaxError: (value error) invalid \x escape at position 0
Please, could you explain what \xe2 stand for and how this binary encoding works.
\x is used to introduce a hexadecimal value, and must be followed by exactly two hexadecimal digits. For example, \xe2 represents the byte (in decimal) 226 (= 14 * 16 + 2).
In the first case, the two strings b'\x25' and b'%' are identical; Python displays values using ASCII equivalents where possible.
I assume that you use a Python 3 version. In Python 3 the default encoding is UTF-8, so b'\xe2\x82\xac'.decode() is in fact b'\xe2\x82\xac'.decode('UTF-8).
It gives the character '€' which is U+20AC in unicode and the UTF8 encoding of U+20AC is indeed `b'\xe2\x82\xac' in 3 bytes.
So all ascii characters (code below 128) are encoded into one single byte with same value as the unicode code. Non ascii characters corresponding to one single 16 bit unicode value are utf8 encoded into 2 or 3 bytes (this is known as the Basic Multilingual Plane).
I have everything working as I want it in my code, but I'm still curious. I have a string: "stación." When I convert that string to unicode, I get:
unicode('stación', 'utf-8')
>>> u'staci\xf3n'
That "\xf3" in there looks like a byte character. This is different from:
unicode('Поиск', 'utf-8')
>>> u'\u041f\u043e\u0438\u0441\u043a'
In the latter example, as with everything I've converted to unicode before, I get unicode characters starting with "\u." Normally, when I see a byte starting with "\x," I think there's a problem. What gives here? Is this because "ó" is extended ASCII?
No, it's because "ó" is a non-ASCII character within the first 255 characters. Since it's representable using a single byte, we save 2 characters in the representation. The other two representations are valid, but not required.
>>> u'\u00f3'
u'\xf3'
>>> u'\U000000f3'
u'\xf3'
u'\xf3' is not a byte; it is a Unicode string with a single Unicode codepoint (U+00f3 LATIN SMALL LETTER O WITH ACUTE).
What you see (u'\xf3') is how Python 2 chooses to represent Unicode character with Unicode ordinals (integers) in the range 0..255 that are not printable ascii characters (Python 3 would show 'ó' here, only non-printable characters use '\xhh' form there by default). As #Ignacio Vazquez-Abrams said: u'\u00f3' and u'\U000000f3' literals create exactly the same Unicode string.
You could see how the Unicode character (u'\xf3') looks like bytes in different character encodings for comparision:
>>> print(u'\xf3')
ó
>>> u'\xf3'.encode('utf-8')
b'\xc3\xb3'
>>> u'\xf3'.encode('utf-16be')
b'\x00\xf3'
>>> u'\xf3'.encode('utf-32le')
b'\xf3\x00\x00\x00'
>>> u'\xf3'.encode('cp1252')
b'\xf3'
Note: b'\xf3' and u'\xf3' are different things. The former is a bytestring that contains a single byte (an integer 243), the latter is a Unicode string that contains a single Unicode codepoint (Unicode ordinal 243). The number is the same 243 by the units are different -- 100 calories is not the same thing as 100 grams.