print python dictionary with utf8 values - python

I have dictionary with utf8 string values. I need to print it without any \xd1 , \u0441 or u'string' symbols.
# -*- coding: utf-8 -*-
a = u'lang=русский'
# prints: lang=русский
print(a)
mydict = {}
mydict['string'] = a
mydict2 = repr(mydict).decode("unicode-escape")
# prints: {'string': u'lang=русский'}
print mydict2
expected
{'string': 'lang=русский'}
Is it possible without parsing the dictionary?
This question is related with Python print unicode strings in arrays as characters, not code points , but I need to get rid from that annoying u

I can't see a reasonable use case for this, but if you want a custom representation of a dictionary (or better said, a custom representation of a unicode object within a dictionary), you can roll it yourself:
def repr_dict(d):
return '{%s}' % ',\n'.join("'%s': '%s'" % pair for pair in d.iteritems())
and then
print repr_dict({u'string': u'lang=русский'})

Related

How to output a utf-8 string list as it is in python?

Well, character encoding and decoding sometimes frustrates me a lot.
So we know u'\u4f60\u597d' is the utf-8 encoding of 你好,
>>> print hellolist
[u'\u4f60\u597d']
>>> print hellolist[0]
你好
Now what I really want to get from the output or write to a file is [u'你好'], but it's [u'\u4f60\u597d'] all the time, so how do you do it?
When you print (or write to a file) a list it internally calls the str() method of the list , but list internally calls repr() on its elements. repr() returns the ugly unicode representation that you are seeing .
Example of repr -
>>> h = u'\u4f60\u597d'
>>> print h
\u4f60\u597d
>>> print repr(h)
u'\u4f60\u597d'
You would need to manually take the elements of the list and print them for them to print correctly.
Example -
>>> h1 = [h,u'\u4f77\u587f']
>>> print u'[' + u','.join([u"'" + unicode(i) + u"'" for i in h1]) + u']'
For lists containing sublists that may have unicode characters, you would need a recursive function , example -
>>> h1 = [h,(u'\u4f77\u587f',)]
>>> def listprinter(l):
... if isinstance(l, list):
... return u'[' + u','.join([listprinter(i) for i in l]) + u']'
... elif isinstance(l, tuple):
... return u'(' + u','.join([listprinter(i) for i in l]) + u')'
... elif isinstance(l, (str, unicode)):
... return u"'" + unicode(l) + u"'"
...
>>>
>>>
>>> print listprinter(h1)
To save them to file, use the same list comprehension or recursive function. Example -
with open('<filename>','w') as f:
f.write(listprinter(l))
You are misunderstanding.
u'' in python is not utf-8, it is simply Unicode (except on Windows in Python <= 3.2, where it is utf-16 instead).
utf-8 is an encoding of Unicode, which is necessarily a sequence of bytes.
Additionally, u'你' and u'\u4f60' are exactly the same thing. It's simply that in Python2 the repr of high characters uses escapes instead of raw values.
Since Python2 is heading for EOL very soon now, you should start to think seriously about switching to Python3. It is a lot easier to keep track of all this in Python3 since there's only one string type and it's much more clear when you .encode and .decode.
with open("some_file.txt","wb") as f:
f.write(hellolist[0].encode("utf8"))
I think will resolve your issue
most text editors use utf8 encoding :)
while the other answers are correct none of them actually resolved your issue
>>> u'\u4f60\u597d'.encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
if you want the brackets
>>> u'[u\u4f60\u597d,]'.encode("utf8")
one thing is the unicode character itself
hellolist = u'\u4f60\'
and another is how you can represent it.
You can represent it in many many ways depending on where you are going to display.
Web: UTF-8
Database: maybe UTF-16 or UTF-8
Web in Japan: EUC-JP or Shift JIS
For example 本
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=672c
http://www.fileformat.info/info/unicode/char/672c/index.htm

Concatenating all strings in a module

I have prepared the following file (unicode_strings.py) with some unicode strings that I want to use for testing:
# -*- coding: utf-8 -*-
# Refer to http://ergoemacs.org/emacs/unicode.txt
GREEK = u'ΑΒΓΔ ΕΖΗΘ ΙΚΛΜ ΝΞΟΠ ΡΣΤΥ ΦΧΨΩ αβγδ εζηθ ικλμ νξοπ ρςτυ φχψω'
ACCENTS = u'àáâãäåæç èéêë ìíîï ðñòóôõö øùúûüýþÿ ÀÁÂÃÄÅ Ç ÈÉÊË ÌÍÎÏ ÐÑ ÒÓÔÕÖ ØÙÚÛÜÝÞß'
CURRENCY = u'¤ $ ¢ € ₠ £ ¥'
...
So in my test file I can do:
from unicode_strings import GREEK
def test1():
print GREEK
Now I want to implement a test_all:
def test_all():
print ALL_UNICODE
How can I define ALL_UNICODE so that it is a concatenation of all strings (all variables) defined in unicode_strings.py. I do not want to define this manually, obviously.
If all your variables are uppercase names, and you didn't import any other such strings from elsewhere, you could use:
_uppercase = [k for k in dir() if k.isupper()]
ALL_UNICODE = ' '.join(map(globals().get, _uppercase))
This will concatenate all unicode strings bound to uppercase names in the current module global namespace.
I switched to using dir() here as that's a little less verbose than looping over list(globals()); you cannot loop with a list comprehension over globals() itself as list comprehension variables are injected into the global namespace during the loop, changing the size of the globals() dictionary during iteration.
It should work:
ALL_UNICODE = ' '.join([item for item in dir(unicode_strings) if not item.startswith("__")])

Python: re.split() display cyrillic result

I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
class Syntax():
def __init__(self, string):
self.string = string.encode('utf-8')
self.list = None
def split(self):
self.list = re.split(ur"\W+", self.string, flags=re.U)
if __name__ == '__main__':
string = ur"Привет, мой друг test words."
a = Syntax(string)
a.split()
print a.string, a.list
Console output:
Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', '']
Thanks for your help.
There are two problems here:
You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).
When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do
for x in a.list:
print x
after making the first change, it'll print Cyrillic.
Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.

How can I compare a unicode type to a string in python?

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
us.encode('utf-8') == "MyString" ## False
Part two - how can I compare within a list comprehension?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I'm using Google App Engine, which uses Python 2.7
Here's a more complete example of the problem:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]
You are trying to compare a string of bytes ('MyString') with a string of Unicode code points (u'MyString'). This is an "apples and oranges" comparison. Unfortunately, Python 2 pretends in some cases that this comparison is valid, instead of always returning False:
>>> u'MyString' == 'MyString' # in my opinion should be False
True
It's up to you as the designer/developer to decide what the correct comparison should be. Here is one possible way:
a = u'MyString'
b = 'MyString'
a.encode('UTF-8') == b # True
I recommend the above instead of a == b.decode('UTF-8') because all u'' style strings can be encoded into bytes with UTF-8, except possibly in some bizarre cases, but not all byte-strings can be decoded to Unicode that way.
But if you choose to do a UTF-8 encode of the Unicode strings before comparing, that will fail for something like this on a Windows system: u'Em dashes\u2014are cool'.encode('UTF-8') == 'Em dashes\x97are cool'. But if you .encode('Windows-1252') instead it would succeed. That's why it's an apples and oranges comparison.
I'm assuming you're using Python 3. us.encode('utf-8') == "MyString" returns False because the str.encode() function is returning a bytes object:
In [2]: us.encode('utf-8')
Out[2]: b'MyString'
In Python 3, strings are already Unicode, so the u'MyString' is superfluous.

How to print container object with unicode-containing values?

The following code
# -*- coding: utf-8 -*-
x = (u'abc/αβγ',)
print x
print x[0]
print unicode(x).encode('utf-8')
print x[0].encode('utf-8')
...produces:
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
Is there any way to get Python to print
('abc/αβγ',)
that does not require me to build the string representation of the tuple myself? (By this I mean stringing together the "(", "'", encoded value, "'", ",", and ")"?
BTW, I'm using Python 2.7.1.
Thanks!
You could decode the str representation of your tuple with 'raw_unicode_escape'.
In [25]: print str(x).decode('raw_unicode_escape')
(u'abc/αβγ',)
I don't think so - the tuple's __repr__() is built-in, and AFAIK will just call the __repr__ for each tuple item. In the case of unicode chars, you'll get the escape sequences.
(Unless Gandaro's solution works for you - I couldn't get it to work in a plain python shell, but that could be either my locale settings, or that it's something special in ipython.)
The following should be a good start:
>>> x = (u'abc/αβγ',)
>>> S = type('S', (unicode,), {'__repr__': lambda s: s.encode('utf-8')})
>>> tuple(map(S, x))
(abc/αβγ,)
The idea is to make a subclass of unicode which has a __repr__() more to your liking.
Still trying to figure out how best to surround the result in quotes, this works for your example:
>>> S = type('S', (unicode,), {'__repr__': lambda s: "'%s'" % s.encode('utf-8')})
>>> tuple(map(S, x))
('abc/αβγ',)
... but it will look odd if there is a single quote in the string:
>>> S("test'data")
'test'data'

Categories

Resources