How can I convert this string bellow from Python3 to a Json?
This is my code:
import ast
mystr = b'[{\'1459161763632\': \'this_is_a_test\'}, {\'1459505002853\': "{\'hello\': 12345}"}, {\'1459505708472\': "{\'world\': 98765}"}]'
chunk = str(mystr)
chunk = ast.literal_eval(chunk)
print(chunk)
Running from Python2 I get:
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Running from Python3 I get:
b'[{\'1459161763632\': \'this_is_a_test\'}, {\'1459505002853\': "{\'hello\': 12345}"}, {\'1459505708472\': "{\'world\': 98765}"}]'
How can I run from Python3 and get the same result as Python2?
What you have in mystr is in bytes format, just decode it into ascii and then evaluate it:
>>> ast.literal_eval(mystr.decode('ascii'))
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Or in a more general case, to avoid issues with unicodes characters,
>>> ast.literal_eval(mystr.decode('utf-8'))
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
And since, default decoding scheme is utf-8 which you can see from:
>>> help(mystr.decode)
Help on built-in function decode:
decode(...) method of builtins.bytes instance
B.decode(encoding='utf-8', errors='strict') -> str
...
Then, you don't have to specify the encoding scheme:
>>> ast.literal_eval(mystr.decode())
[{'1459161763632': 'this_is_a_test'}, {'1459505002853': "{'hello': 12345}"}, {'1459505708472': "{'world': 98765}"}]
Iron Fist beat me to the fix. To extend his answer, the 'b' prefix on the string indicates (to python3 but not python2) that the literal should be interpreted as a byte sequence, not a string.
The result is that the .decode method is needed to convert the bytes back into a string. Python2 doesn't make this distinction between the bytes and strings, hence the difference.
See What does the 'b' character do in front of a string literal? for more information on this.
Related
Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()
As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
So when I get a text input, I don't need to use unicode()?
I might sound lazy but I am just interested if this is possible...
p.s. I don't know a lot about character encoding so please correct me if I got anything wrong
For Example(In pyhon interactive,diff in GUI Shell) :
>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2
In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.
But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).
bytestring = b'abc'
unicode_text = u'abc'
The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.
Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.
So when I get a text input, I don't need to use unicode()?
If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).
And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).
Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.
To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).
What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:
text = data.decode(response.headers.getparam('charset'))
You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).
In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.
Example:
>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>
For the behavior you want, use Python 3.
I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?
NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.
In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from #Noumenon
If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:
stringnamehere.decode('utf-8', 'ignore')
Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:
def make_unicode(inp):
if type(inp) != unicode:
inp = inp.decode('utf-8')
return inp
Adding the following line to the top of your .py file:
# -*- coding: utf-8 -*-
allows you to encode strings directly in your script, like this:
utfstr = "ボールト"
If I understand you correctly, you have a utf-8 encoded byte-string in your code.
Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).
You do that by using the unicode function or the decode method. Either:
unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")
Or:
unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')
In Python 3.6, they do not have a built-in unicode() method.
Strings are already stored as unicode by default and no conversion is required. Example:
my_str = "\u221a25"
print(my_str)
>>> √25
Translate with ord() and unichar().
Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.
>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ
First, str in Python is represented in Unicode.
Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).
When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.
You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.
Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).
Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
Encode str to UTF-8 bytes: s.encode('utf-8')
The url is translated to ASCII and to the Python server it is just a Unicode string, eg.:
"T%C3%A9st%C3%A3o"
Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.
You can encode an URL just like this:
import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão
See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.
you can also do this:
from unidecode import unidecode
unidecode(yourStringtoDecode)
You can use python's standard library codecs module.
import codecs
codecs.decode(b'Decode me', 'utf-8')
Yes, You can add
# -*- coding: utf-8 -*-
in your source code's first line.
You can read more details here https://www.python.org/dev/peps/pep-0263/
I am trying to use str.encode() but I get
>>> "hello".encode(hex)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string, not builtin_function_or_method
I have tried a bunch of variations and they seem to all work in Python 2.5.2, so what do I need to do to get them to work in Python 3.1?
The hex codec has been chucked in 3.x. Use binascii instead:
>>> binascii.hexlify(b'hello')
b'68656c6c6f'
In Python 3.5+, encode the string to bytes and use the hex() method, returning a string.
s = "hello".encode("utf-8").hex()
s
# '68656c6c6f'
Optionally convert the string back to bytes:
b = bytes(s, "utf-8")
b
# b'68656c6c6f'
You've already got some good answers, but I thought you might be interested in a bit of the background too.
Firstly you're missing the quotes. It should be:
"hello".encode("hex")
Secondly this codec hasn't been ported to Python 3.1. See here. It seems that they haven't yet decided whether or not these codecs should be included in Python 3 or implemented in a different way.
If you look at the diff file attached to that bug you can see the proposed method of implementing it:
import binascii
output = binascii.b2a_hex(input)
The easiest way to do it in Python 3.5 and higher is:
>>> 'halo'.encode().hex()
'68616c6f'
If you manually enter a string into a Python Interpreter using the utf-8 characters, you can do it even faster by typing b before the string:
>>> b'halo'.hex()
'68616c6f'
Equivalent in Python 2.x:
>>> 'halo'.encode('hex')
'68616c6f'
binascii methodes are easier by the way
>>> import binascii
>>> x=b'test'
>>> x=binascii.hexlify(x)
>>> x
b'74657374'
>>> y=str(x,'ascii')
>>> y
'74657374'
>>> x=binascii.unhexlify(x)
>>> x
b'test'
>>> y=str(x,'ascii')
>>> y
'test'
Hope it helps. :)
In Python 3, all strings are unicode. Usually, if you encode an unicode object to a string, you use .encode('TEXT_ENCODING'), since hex is not a text encoding, you should use codecs.encode() to handle arbitrary codecs. For example:
>>>> "hello".encode('hex')
LookupError: 'hex' is not a text encoding; use codecs.encode() to handle arbitrary codecs
>>>> import codecs
>>>> codecs.encode(b"hello", 'hex')
b'68656c6c6f'
Again, since "hello" is unicode, you need to indicate it as a byte string before encoding to hexadecimal. This may be more inline with what your original approach of using the encode method.
The differences between binascii.hexlify and codecs.encode are as follow:
binascii.hexlify
Hexadecimal representation of binary data.
The return value is a bytes object.
Type: builtin_function_or_method
codecs.encode
encode(obj, [encoding[,errors]]) -> object
Encodes obj using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a ValueError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that can handle ValueErrors.
Type: builtin_function_or_method
base64.b16encode and base64.b16decode convert bytes to and from hex and work across all Python versions. The codecs approach also works, but is less straightforward in Python 3.
Use hexlify - http://epydoc.sourceforge.net/stdlib/binascii-module.html
Yet another method:
s = 'hello'
h = ''.join([hex(ord(i)) for i in s]);
# outputs: '0x680x650x6c0x6c0x6f'
This basically splits the string into chars, does the conversion through hex(ord(char)), and joins the chars back together. In case you want the result without the prefix 0x then do:
h = ''.join([str(hex(ord(i)))[2:4] for i in s]);
# outputs: '68656c6c6f'
Tested with Python 3.5.3.
When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.