What does 'u' before a string in python mean? [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What does the ‘u’ symbol mean in front of string values?
I have this json text file:
{
"EASW_KEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"EASF_SESS":" aaaaaaaaaaaaaaaaaaaaaaa",
"PHISHKEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"XSID":" aaaaaaaaaaaaaaaaaaaaaaa"
}
If I decode it like this:
import json
data = open('data.txt').read()
data = json.loads(data)
print data['EASW_KEY']
I get this:
u' aaaaaaaaaaaaaaaaaaaaaaa'
instead of simply:
aaaaaaaaaaaaaaaaaaaaaaa
What does this u mean?

What does 'u' before a string in python mean?
The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations
The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.
Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html
Then to explore the nature of this, run this command:
type(u'abc')
returns:
<type 'unicode'>
If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.
For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.
As defined in this post:
FQL multiquery from python fails with unicode query
Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.

The u prefix in the repr of a unicode object indicates that it's a unicode object.
You should not be seeing this if you're really just printing the unicode object itself, and not, say, a list containing it. (and, in fact, running your code doesn't actually print it like this.)

Related

Mojibake when reading JSON containing escaped unicode - wrongly decoded as Latin-1? [duplicate]

This question already has answers here:
Facebook JSON badly encoded
(9 answers)
Closed 3 years ago.
I have a JSON file that contains /u escaped unicode characters, however when I read this in Python, the escaped characters are seemingly incorrectly decoded as Latin-1 rather than UTF-8. Calling .encode('latin-1').decode('utf-8') on the affected strings seems to fix this, but why is it happening, and is there a way to specify to json.load that escape sequences should be read as unicode rather than Latin-1?
JSON file message.json, which should contain a message composed of a "Grinning Face With Sweat" emoji:
{
"message": "\u00f0\u009f\u0098\u0085"
}
Python:
>>> with open('message.json') as infile:
... msg_json = json.load(infile)
...
>>> msg_json
{'message': 'ð\x9f\x98\x85'}
>>> msg_json['message']
'ð\x9f\x98\x85'
>>> msg_json['message'].encode('latin-1').decode('utf-8')
'😅'
Setting the encoding parameter in open or json.load doesn't seem to change anything, as the JSON file is plain ASCII, and the unicode is escaped within it.
What you have there is not the correct notation for the 😅 emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u... notation is independent of encoding.)
The proper notation for 😅, unicode U+1F605, in JavaScript is \ud83d\ude05. Use that in the JSON.
{
"message": "\ud83d\ude05"
}
If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.

Python - Unicode & double backslashes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I scrapped a webpage with BeautifulSoup.
I got great output except parts of the list look like this after getting the text:
list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
My question now is how to get rid or replace these double backslashes with the special characters they are.
If i print the first the first element of the example list the output looks like
print list[0]
that\u2019s
I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.
I hope that someone could help me with this issue.
Thanks!
MG
Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:
data = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
result = [part.decode('unicode_escape') for part in data]
To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:
result = [part.encode('latin1').decode('unicode_escape') for part in data]
the problem here is that the site ended up double encoding those unicode arguments, just do the following:
ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
ls = map(lambda x: x.decode('unicode-escape'), ls)
now you have a list with properly unicode encoded strings:
for a in ls:
print a

How does Python's "print" function is working?

I'm interested how does Python's print function determines what is the string encoding, and how to handle it?
For example I've got the string:
str1 = u'\u041e\u0431\u044a\u0435\u043c
print(str1) # Will be converted to Объем`
What is going on under the hood of python?
Update
I'm interested in CPython 2.7 implementation of python
It uses the encoding in sys.stdout.encoding, which comes from the environment it's running in.
The u in front of the string makes a difference.The 'u' in front of the string values means the string has been represented as unicode. It is a way to represent more characters than normal ascii can manage.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.
More info here

Python get ASCII characters [duplicate]

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

Is there any possible way to display accented characters in Python interpreter?

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.
import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.
I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

Categories

Resources