How do I decode a string with utf-8? [closed]

How do I decode a string with utf-8? [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a string that is already encoded with utf-8 (ex. "No\xf0\x9f\x92\x80"). I would like to decode it so it becomes No💀. However, when I use .decode('utf-8) it says decode is not a function of a str.
The string is from a txt file that I am reading with pandas.

If the length is 6, that doesn't quite make sense if you read the file with encoding='utf8'. It should have decoded the UTF-8 bytes correctly, but this would fix it if it is really what you have:
>>> s='No\xf0\x9f\x92\x80'
>>> len(s)
6
>>> s.encode('latin1').decode('utf8')
'No💀'
Instead, if you have literal backslashes and numbers in the string, this would work:
>>> s=r'No\xf0\x9f\x92\x80'
>>> s
'No\\xf0\\x9f\\x92\\x80'
>>> len(s)
18
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'No💀'
unicode-escape translates escape codes to Unicode code points, but only works on bytes strings. .encode('latin1') translates Unicode code points, 1:1 to their byte equivalent (only works U+0000 to U+00FF, of course).
The code above translates a str to bytes, decodes the escapes, converts to bytes again, and decodes correctly as UTF-8.

Related

Use u'string' on string stored as variable in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
As a French user of Python 2.7, I'm trying to properly print strings containing accents such as "é", "è", "à", etc. in the Python console.
I already know the trick of using u before the explicit value of a string, such as :
print(u'Université')
which properly prints the last character.
Now, my question is: how can I do the same for a string that is stored as a variable?
Indeed, I know that I could do the following:
mystring = u'Université'
print(mystring)
but the problem is that the value of mystring is bound to be passed into a SQL query (using psycopg2), and therefore I can't afford to store the u inside the value of mystring.
so how could I do something like
"print the unicode value of mystring" ?

The u sigil is not part of the value, it's just a type indicator. To convert a string into a Unicode string, you need to know the encoding.
unicodestring = mystring.decode('utf-8') # or 'latin-1' or ... whatever
and to print it you typically (in Python 2) need to convert back to whatever the system accepts on the output filehandle:
print(unicodestring.encode('utf-8')) # or 'latin-1' or ... whatever
Python 3 clarifies (though not directly simplifies) the situation by keeping Unicode strings and (what is now called) bytes objects separate.

How do I escape '\x' in Python? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm using pymysql to query a database that has an entry like 'name':'Te\xtCorp', it's a name that I need to preserve. I'm sending it somewhere else with json.dumps() and when it hits this it fails to escape the \x.
What's the proper way to escape the \x without double escaping everything else?

Two options here:
You escape the backslash, like:
'Te\\xtCorp'
You can use a raw string:
r'Te\xtCorp'
Both generate:
>>> 'Te\\xtCorp'
'Te\\xtCorp'
>>> r'Te\xtCorp'
'Te\\xtCorp'
Or printed:
>>> print(r'Te\xtCorp')
Te\xtCorp
Note that in order to inspect the content of the string, you should use a print(..) statement, otherwise you get the repr(..)esentation of that string. For example:
>>> print(json.dumps(r'te\xt'))
"te\\xt"
>>> print(json.loads(json.dumps(r'te\xt')))
te\xt
As one can read in the documentation on String literals:
\xhh...: ASCII character with hex value hh...
So it is used to encode any ASCII character, by specifying the code as a hexadecimal value.

How to unpack and decode '#\x01\x01\x00'? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am encountering the following output and I cannot really understand it.
Could you please advise what it is exactly? How to unpack it?
'#\x01\x01\x00'
It does not look to be purely binary or hexadecimal.
I would like to see the ASCII representation of it.

You have a string of bytes, if you print it you are seeing the the ascii output:
In [5]: s = '#\x01\x01\x00'
In [8]: print(list(bytearray(s)))
[64, 1, 1, 0]
If you call chr on each of the ints you will see exactly the same output, 64 in ascii is #, 1 is a SOH and 0 is a NUL , without more info like where it came from there is not much else that can be suggested.

This seems to be a sequence of four bytes with the values 64, 1, 1, 0.
To interpret it, you need to know how it was encoded or what it is supposed to represent.
Generally, you can unpack binary data in Python with the unpack function in the struct module:
import struct
intval = struct.unpack('i', '#\x01\x01\x00')
shortvals = struct.unpack('hh', '#\x01\x01\x00')
The first unpack line would give you the value of your string interpreted as a 4-byte integer, which is the number 65856. The second one interprets the string as two 2-byte integers (320 and 1).

How to read Unicode file as Unicode string in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a file that is encoded in Unicode or UTF-8 (I don't know which). When I read the file in Python 3.4, the resulting string is interpreted as an ASCII string. How do I convert it to a Unicode string like u"text"?

The term "Unicode" refers to the standard, not to a particular encoding.
Since files in computers are binary, there exist different ways of encoding Unicode data in binary files. One of them is "UTF-8".
You can consult https://docs.python.org/3/howto/unicode.html
An example taken from this document (in the section "Reading and Writing Unicode Data")
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
In python 3, unlike python2, unicode string constants are not written with a "u".

What's the best way to convert an integer (0-255) to an escaped byte (\x00 - \xff)? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Basically what I'm asking is, what's the most direct way to convert any integer between 0 and 255 into it's hexadecimal, escaped equivalent? One that I mean will function correctly if wrapped in a write() function (which means '\x56' writes 'V' and not literally '\x56'.

That's what the chr function is for.
f.write(chr(0x56))
Speaking of hexadecimal escaped equivalents isn't really relevant in this context - every character has a hexadecimal equivalent, but in expressing a string the characters that can be expressed as a single simple character are simply output as the character.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I decode a string with utf-8? [closed] - python

Related

Use u'string' on string stored as variable in Python [closed]

How do I escape '\x' in Python? [closed]

How to unpack and decode '#\x01\x01\x00'? [closed]

How to read Unicode file as Unicode string in Python [closed]

What's the best way to convert an integer (0-255) to an escaped byte (\x00 - \xff)? [closed]

Categories

Resources