I am trying to use str.encode() but I get
>>> "hello".encode(hex)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string, not builtin_function_or_method
I have tried a bunch of variations and they seem to all work in Python 2.5.2, so what do I need to do to get them to work in Python 3.1?
The hex codec has been chucked in 3.x. Use binascii instead:
>>> binascii.hexlify(b'hello')
b'68656c6c6f'
In Python 3.5+, encode the string to bytes and use the hex() method, returning a string.
s = "hello".encode("utf-8").hex()
s
# '68656c6c6f'
Optionally convert the string back to bytes:
b = bytes(s, "utf-8")
b
# b'68656c6c6f'
You've already got some good answers, but I thought you might be interested in a bit of the background too.
Firstly you're missing the quotes. It should be:
"hello".encode("hex")
Secondly this codec hasn't been ported to Python 3.1. See here. It seems that they haven't yet decided whether or not these codecs should be included in Python 3 or implemented in a different way.
If you look at the diff file attached to that bug you can see the proposed method of implementing it:
import binascii
output = binascii.b2a_hex(input)
The easiest way to do it in Python 3.5 and higher is:
>>> 'halo'.encode().hex()
'68616c6f'
If you manually enter a string into a Python Interpreter using the utf-8 characters, you can do it even faster by typing b before the string:
>>> b'halo'.hex()
'68616c6f'
Equivalent in Python 2.x:
>>> 'halo'.encode('hex')
'68616c6f'
binascii methodes are easier by the way
>>> import binascii
>>> x=b'test'
>>> x=binascii.hexlify(x)
>>> x
b'74657374'
>>> y=str(x,'ascii')
>>> y
'74657374'
>>> x=binascii.unhexlify(x)
>>> x
b'test'
>>> y=str(x,'ascii')
>>> y
'test'
Hope it helps. :)
In Python 3, all strings are unicode. Usually, if you encode an unicode object to a string, you use .encode('TEXT_ENCODING'), since hex is not a text encoding, you should use codecs.encode() to handle arbitrary codecs. For example:
>>>> "hello".encode('hex')
LookupError: 'hex' is not a text encoding; use codecs.encode() to handle arbitrary codecs
>>>> import codecs
>>>> codecs.encode(b"hello", 'hex')
b'68656c6c6f'
Again, since "hello" is unicode, you need to indicate it as a byte string before encoding to hexadecimal. This may be more inline with what your original approach of using the encode method.
The differences between binascii.hexlify and codecs.encode are as follow:
binascii.hexlify
Hexadecimal representation of binary data.
The return value is a bytes object.
Type: builtin_function_or_method
codecs.encode
encode(obj, [encoding[,errors]]) -> object
Encodes obj using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a ValueError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that can handle ValueErrors.
Type: builtin_function_or_method
base64.b16encode and base64.b16decode convert bytes to and from hex and work across all Python versions. The codecs approach also works, but is less straightforward in Python 3.
Use hexlify - http://epydoc.sourceforge.net/stdlib/binascii-module.html
Yet another method:
s = 'hello'
h = ''.join([hex(ord(i)) for i in s]);
# outputs: '0x680x650x6c0x6c0x6f'
This basically splits the string into chars, does the conversion through hex(ord(char)), and joins the chars back together. In case you want the result without the prefix 0x then do:
h = ''.join([str(hex(ord(i)))[2:4] for i in s]);
# outputs: '68656c6c6f'
Tested with Python 3.5.3.
Related
As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
So when I get a text input, I don't need to use unicode()?
I might sound lazy but I am just interested if this is possible...
p.s. I don't know a lot about character encoding so please correct me if I got anything wrong
For Example(In pyhon interactive,diff in GUI Shell) :
>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2
In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.
But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).
bytestring = b'abc'
unicode_text = u'abc'
The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.
Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.
So when I get a text input, I don't need to use unicode()?
If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).
And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).
Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.
To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).
What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:
text = data.decode(response.headers.getparam('charset'))
You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).
In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.
Example:
>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>
For the behavior you want, use Python 3.
How can I convert this string bellow from Python3 to a Json?
This is my code:
import ast
mystr = b'[{\'1459161763632\': \'this_is_a_test\'}, {\'1459505002853\': "{\'hello\': 12345}"}, {\'1459505708472\': "{\'world\': 98765}"}]'
chunk = str(mystr)
chunk = ast.literal_eval(chunk)
print(chunk)
Running from Python2 I get:
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Running from Python3 I get:
b'[{\'1459161763632\': \'this_is_a_test\'}, {\'1459505002853\': "{\'hello\': 12345}"}, {\'1459505708472\': "{\'world\': 98765}"}]'
How can I run from Python3 and get the same result as Python2?
What you have in mystr is in bytes format, just decode it into ascii and then evaluate it:
>>> ast.literal_eval(mystr.decode('ascii'))
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Or in a more general case, to avoid issues with unicodes characters,
>>> ast.literal_eval(mystr.decode('utf-8'))
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
And since, default decoding scheme is utf-8 which you can see from:
>>> help(mystr.decode)
Help on built-in function decode:
decode(...) method of builtins.bytes instance
B.decode(encoding='utf-8', errors='strict') -> str
...
Then, you don't have to specify the encoding scheme:
>>> ast.literal_eval(mystr.decode())
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Iron Fist beat me to the fix. To extend his answer, the 'b' prefix on the string indicates (to python3 but not python2) that the literal should be interpreted as a byte sequence, not a string.
The result is that the .decode method is needed to convert the bytes back into a string. Python2 doesn't make this distinction between the bytes and strings, hence the difference.
See What does the 'b' character do in front of a string literal? for more information on this.
I've been working on a TCP/IP connection program thingy in Python and came across the need to use Struct. And so I imported the module and after some time came to a very particular issue. I get the error specified in te title when I run the code below, which should be working after I checked some other answers and documentations.
import struct
string = "blab"
s = struct.Struct(b'4s')
packed_data = s.pack(string)
print(packed_data)
As far as I found, the issue should be fixed by prepending the string used in the s variable with 'b' or using the bytes() function parsing 'utf-8' as encoding argument. Tried both, same error.
I have no idea what might be wrong so am I missing something? I could not find relevant information online regarding this issue, so this is why I'm posting here now.
Any help is appreciated and thanks in advance!
One problem is that you put the "b" in the wrong place. You placed it in the format string, when the data to be packed need to be a byte string.
>>> string = "blab"
>>> s = struct.Struct('4s')
>>> packed_data = s.pack(string.encode('utf-8'))
>>> print(packed_data)
b'blab'
But even that is problematic. Suppose your string is not in the ascii character set... let's say it's Greek, then the UTF8 encoded string is more than 4 bytes and you write a truncated value
>>> string = "ΑΒΓΔ"
>>> s = struct.Struct('4s')
>>>
>>> packed_data = s.pack(string.encode('utf-8'))
>>> print('utf8len', len(string.encode('utf-8')), 'packedlen', len(packed_data))
utf8len 8 packedlen 4
>>> print(packed_data)
b'\xce\x91\xce\x92'
>>> print(struct.unpack('4s', packed_data)[0].decode('utf-8'))
ΑΒ
>>>
If you really need to restrict to 4 bytes, then convert the original string using ascii instead of utf-8 so that any unencodable unicode character will raise an exception right away.
>>> string = "ΑΒΓΔ"
>>> s = struct.Struct('4s')
>>>
>>> packed_data = s.pack(string.encode('ascii'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
packed_data = s.pack(string.encode('utf-8'))
Should work in both Py2 and 3
From the Unicode HowTo for Python 2.7.11:
Another important method is .encode([encoding], [errors='strict']),
which returns an 8-bit string version of the Unicode string, encoded
in the requested encoding. The errors parameter is the same as the
parameter of the unicode() constructor, with one additional
possibility; as well as ‘strict’, ‘ignore’, and ‘replace’, you can
also pass ‘xmlcharrefreplace’ which uses XML’s character references.
The following example shows the different results:
https://docs.python.org/2/howto/unicode.html
What's the correct way to convert bytes to a hex string in Python 3?
I see claims of a bytes.hex method, bytes.decode codecs, and have tried other possible functions of least astonishment without avail. I just want my bytes as hex!
Since Python 3.5 this is finally no longer awkward:
>>> b'\xde\xad\xbe\xef'.hex()
'deadbeef'
and reverse:
>>> bytes.fromhex('deadbeef')
b'\xde\xad\xbe\xef'
works also with the mutable bytearray type.
Reference: https://docs.python.org/3/library/stdtypes.html#bytes.hex
Use the binascii module:
>>> import binascii
>>> binascii.hexlify('foo'.encode('utf8'))
b'666f6f'
>>> binascii.unhexlify(_).decode('utf8')
'foo'
See this answer:
Python 3.1.1 string to hex
Python has bytes-to-bytes standard codecs that perform convenient transformations like quoted-printable (fits into 7bits ascii), base64 (fits into alphanumerics), hex escaping, gzip and bz2 compression. In Python 2, you could do:
b'foo'.encode('hex')
In Python 3, str.encode / bytes.decode are strictly for bytes<->str conversions. Instead, you can do this, which works across Python 2 and Python 3 (s/encode/decode/g for the inverse):
import codecs
codecs.getencoder('hex')(b'foo')[0]
Starting with Python 3.4, there is a less awkward option:
codecs.encode(b'foo', 'hex')
These misc codecs are also accessible inside their own modules (base64, zlib, bz2, uu, quopri, binascii); the API is less consistent, but for compression codecs it offers more control.
New in python 3.8, you can pass a delimiter argument to the hex function, as in this example
>>> value = b'\xf0\xf1\xf2'
>>> value.hex('-')
'f0-f1-f2'
>>> value.hex('_', 2)
'f0_f1f2'
>>> b'UUDDLRLRAB'.hex(' ', -4)
'55554444 4c524c52 4142'
https://docs.python.org/3/library/stdtypes.html#bytes.hex
The method binascii.hexlify() will convert bytes to a bytes representing the ascii hex string. That means that each byte in the input will get converted to two ascii characters. If you want a true str out then you can .decode("ascii") the result.
I included an snippet that illustrates it.
import binascii
with open("addressbook.bin", "rb") as f: # or any binary file like '/bin/ls'
in_bytes = f.read()
print(in_bytes) # b'\n\x16\n\x04'
hex_bytes = binascii.hexlify(in_bytes)
print(hex_bytes) # b'0a160a04' which is twice as long as in_bytes
hex_str = hex_bytes.decode("ascii")
print(hex_str) # 0a160a04
from the hex string "0a160a04" to can come back to the bytes with binascii.unhexlify("0a160a04") which gives back b'\n\x16\n\x04'
import codecs
codecs.getencoder('hex_codec')(b'foo')[0]
works in Python 3.3 (so "hex_codec" instead of "hex").
it can been used the format specifier %x02 that format and output a hex value. For example:
>>> foo = b"tC\xfc}\x05i\x8d\x86\x05\xa5\xb4\xd3]Vd\x9cZ\x92~'6"
>>> res = ""
>>> for b in foo:
... res += "%02x" % b
...
>>> print(res)
7443fc7d05698d8605a5b4d35d56649c5a927e2736
OK, the following answer is slightly beyond-scope if you only care about Python 3, but this question is the first Google hit even if you don't specify the Python version, so here's a way that works on both Python 2 and Python 3.
I'm also interpreting the question to be about converting bytes to the str type: that is, bytes-y on Python 2, and Unicode-y on Python 3.
Given that, the best approach I know is:
import six
bytes_to_hex_str = lambda b: ' '.join('%02x' % i for i in six.iterbytes(b))
The following assertion will be true for either Python 2 or Python 3, assuming you haven't activated the unicode_literals future in Python 2:
assert bytes_to_hex_str(b'jkl') == '6a 6b 6c'
(Or you can use ''.join() to omit the space between the bytes, etc.)
If you want to convert b'\x61' to 97 or '0x61', you can try this:
[python3.5]
>>>from struct import *
>>>temp=unpack('B',b'\x61')[0] ## convert bytes to unsigned int
97
>>>hex(temp) ##convert int to string which is hexadecimal expression
'0x61'
Reference:https://docs.python.org/3.5/library/struct.html
def _oauth_escape(val):
if isinstance(val, unicode):# useful ?
val = val.encode("utf-8")#useful ?
return urllib.quote(val, safe="~")
i think it is not useful ,
yes ??
updated
i think unicode is ‘utf-8’ ,yes ?
utf-8 is an encoding, a recipe for concretely representing unicode data as a series of bytes. This is one of many such encodings. Python str objects are bytestrings, which can represent arbitrary binary data, such as text in a specific encoding.
Python's unicode type is an abstract, not-encoded way to represent text. unicode strings can be encoded in any of many encodings.
As others have said already, unicode and utf-8 are not the same. Utf-8 is one of many encodings for unicode.
Think of unicode objects as "unencoded" unicode strings, while string objects are encoded in a particular encoding (unfortunately, string objects don't have an attribute that tells you how they are encoded).
val.encode("utf-8") converts this unicode object into an utf-8 encoded string object.
In Python 2.6, this is necessary, as urllib can't handle unicode properly.
>>> import urllib
>>> urllib.quote(u"")
''
>>> urllib.quote(u"ä")
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py:1216: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
res = map(safe_map.__getitem__, s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 1216, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe4'
>>> urllib.quote(u"ä".encode("utf-8"))
'%C3%A4'
Python 3.x however, where all strings are unicode (the Python 3 equivalent to an encoded string is a bytes object), it is not necessary anymore.
>>> import urllib.parse
>>> urllib.parse.quote("ä")
'%C3%A4'
In Python 3.0 all strings support Unicode, but with previous versions one has to explicitly encode strings to Unicode strings. Could that be it?
(utf-8 is not the only, but the most common encoding for Unicode. Read this.)