Setting Python encoding for printing Chinese characters [duplicate]

Setting Python encoding for printing Chinese characters [duplicate] - python

This question already has answers here:
Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
(4 answers)
Closed last year.
My code is below. I don't know why can't print Chinese. Please help.
When trying to print more than one variable at a time, the words look like ASCII or raw type.
How to fix it?
# -*- coding: utf-8 -*-
import pygoldilocks
import sys
reload(sys)
sys.setdefaultencoding('utf8')
rows = ( '已','经激活的区域语言' )
print( rows[0] )
print( rows[1] )
print( rows[0], rows[1] )
print( rows[0].encode('utf8'), rows[1].decode('utf8') )
print( rows[0], 1 )
$ python test.py
已
经激活的区域语言
('\xe5\xb7\xb2', '\xe7\xbb\x8f\xe6\xbf\x80\xe6\xb4\xbb\xe7\x9a\x84\xe5\x8c\xba\xe5\x9f\x9f\xe8\xaf\xad\xe8\xa8\x80')
('\xe5\xb7\xb2', u'\u7ecf\u6fc0\u6d3b\u7684\u533a\u57df\u8bed\u8a00')
('\xe5\xb7\xb2', 1)

All your outputs are normal. By the way, this:
reload(sys)
sys.setdefaultencoding('utf8')
is really a poor man's trick to set the Python default encoding. It is seldom really useful - IMHO it is not in shown code - and should only be used when no cleaner way is possible. I had been using Python 2 for decades with non ascii charset (Latin1) and only used that in my very first scripts.
And the # -*- coding: utf-8 -*- is not used either by Python here, though it may be useful for your text editor: it only makes sense when you have unicode literal strings in your script - what you have not.
Now what really happens:
You define row as a 2 tuple of (byte) strings containing chinese characters encoded in utf8. Fine.
When you print a string, the characters are passed directly to the output system (here a terminal or screen). As it correctly processes UTF8 it converts the utf8 byte representation into the proper characters. So print (row[0]) (which is executed as print row[0] in Python 2 - (row[0]) is not a tuple, (row[0],) is a 1-tuple) correctly displays chinese characters.
But when you print a tuple, Python actually prints the representation of the elements of the tuple (it would be the same for a list, set or map). And in Python 2, the representation of a byte or unicode string encodes all non ASCII characters in \x.. of \u.... forms.
In a Python interactive session, you should see:
>>> print rows[0]
已
>>> print repr(rows[0])
'\xe5\xb7\xb2'
TL/DR: when you print containers, you actually print the representation of the elements. If you want to display the string values, use an explicit loop or a join:
print '(' + ', '.join(rows) + ')'
displays as expected:
(已, 经激活的区域语言)

Your problem is that you are using Python 2, I guess. Your code
print( rows[0], rows[1] )
is evaluated as
tmp = ( rows[0], rows[1] ) # a tuple!
print tmp # Python 2 print statement!
Since the default formatting for tuples is done via repr(), you see the ASCII-escaped representation.
Solution: Upgrade to Python 3.

There are two less drastic solutions than upgrading to Python 3.
The first is not to use Python 3 print() syntax:
rows = ( '已','经激活的区域语言' )
print rows[0]
print rows[1]
print rows[0], rows[1]
print rows[0].decode('utf8'), rows[1].decode('utf8')
print rows[0], 1
已
经激活的区域语言
已 经激活的区域语言
已 经激活的区域语言
已 1
The second is to import Python 3 print() syntax into Python 2:
from __future__ import print_function
rows = ( '已','经激活的区域语言' )
print (rows[0])
print (rows[1])
print (rows[0], rows[1])
print (rows[0].decode('utf8'), rows[1].decode('utf8'))
print (rows[0], 1)
Output is the same.
And drop that sys.setdefaultencoding() call. It's not intended to be used like that (only in the site module) and does more harm than good.

Related

Python Cyrillic encoding output

I just started learning python, and I am doing exercises for dictionaries, but since the sample code I have to use is in Cyrillic, I'm having some encoding issues. I have the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
temperatures = {
'София': -14,
'Новосибирск': -31
}
print("-" * 20)
print(temperatures)
print("-" * 20)
key = 'Бургас'
if key in temperatures:
print(temperatures[key])
else:
print("No data for {}".format(key))
Before adding the # -*- coding: utf-8 -*- line I was getting SyntaxError: Non-ASCII character '\xd0'. Now however, the error is gone but the output of the words in Cyrillic is not right. This is the output:
--------------------
{'\xd0\x9d\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x81\xd0\xb8\xd0\xb1\xd0\xb8\xd1\x80\xd1\x81\xd0\xba': -31, '\xd0\xa1\xd0\xbe\xd1\x84\xd0\xb8\xd1\x8f': -14}
--------------------
No data for Бургас
So, the words in Cyrillic which are printed out from the python dict are messed up, but the line Бургас appears right. I tried using print(format(temperatures)) but the output is the same. If it prints out one Cyrillic word, shouldn't it print all of them?

Sorry, can't comment yet: Your code works for me (emacs on arch linux).
edit: My default python interpreter is Python3.6, it get the same output as you with Python2.7. Is Python 2.7 a requirement?

Assuming that you use a Python 2, the output here is as expected:
when you print a string, it is printed directly
when you print a dictionnary, the keys (and values) as printed using repr and not str. That explains why all characters having code points above 255 use their escaped form.
If you want to display a dictionnary in Python 2, you must do it by hand:
h = temperatures
print('{' + ', '.join((str(k)+ ': ' + str(h[k]) for k in h.keys())) + '}')

Delete last printed character python

I am writing a program in Python and want to replace the last character printed in the terminal with another character.
Pseudo code is:
print "Ofen",
print "\b", # NOT NECCESARILY \b, BUT the wanted print statement that will erase the last character printed
print "r"
I'm using Windows8 OS, Python 2.7, and the regular interpreter.
All of the options I saw so far didn't work for me. (such as: \010, '\033[#D' (# is 1), '\r').
These options were suggested in other Stack Overflow questions or other resources and don't seem to work for me.
EDIT: also using sys.stdout.write doesn't change the affect. It just doesn't erase the last printed character. Instead, when using sys.stdout.write, my output is:
Ofenr # with a square before 'r'
My questions:
Why don't these options work?
How do I achieve the desired output?
Is this related to Windows OS or Python 2.7?
When I find how to do it, is it possible to erase manually (using the wanted eraser), delete the '\n' that is printed in python's print statement?

When using print in python a line feed (aka '\n') is added. You should use sys.stdout.write() instead.
import sys
sys.stdout.write("Ofen")
sys.stdout.write("\b")
sys.stdout.write("r")
sys.stdout.flush()
Output: Ofer

You can also import the print function from Python 3. The optional end argument can be any string that will be added. In your case it is just an empty string.
from __future__ import print_function # Only needed in Python 2.X
print("Ofen",end="")
print("\b",end="") # NOT NECCESARILY \b, BUT the wanted print statement that will erase the last character printed
print("r")
Output
Ofer

I think string stripping would help you. Save the input and just print the string upto the length of string -1 .
Instance
x = "Ofen"
print (x[:-1] + "r")
would give you the result
Ofer
Hope this helps. :)

python strip() seems doesn't working as expected

I do my apologizes for the dummy question, but i'm experiencing a weird problem with a simple script that seems correct but doesnt' works as expected
#!/usr/bin/python
import json,sys
obj=json.load(sys.stdin)
oem=obj["_id"]
models = obj.get("modelli", 0)
if models != 0:
for marca in obj["modelli"]:
brand=obj["modelli"][marca]
for serie in brand:
ser=brand[serie]
for modello in ser:
model=modello
marca = marca.strip()
modello = modello.strip()
serie = serie.strip()
print oem,";",marca,";",serie,";",modello
It should just cycle an array from a json var and print the output in csv format, but i still get the string containing one withespace at the begin and at the end of each variable (oem, marca, serie, modello) like this
KD-CH884 ; Dell ; ; 966
This is my very first script in python, i've just followed some simple directives, so i'm missing something or what?
Any guess?

The print statement is putting in that whitespace.
From the docs here:
A space is written before each object is (converted and) written,
unless the output system believes it is positioned at the beginning of
a line.
Use ';'.join(...) instead.

Python is actually stripping the whitespaces out. Its just the print statement:
print oem,";",marca,";",serie,";",modello
.. that is reintroducing the spaces. Try concatenating the variables and display them.

';'.join(filter(None, [oem, marca, serie, modello]))
It will only place the semicolon between two existing strings. If a variable holds the empty string after being stripped '', filtering the None will take it out of the list.

try this:
print "%s;%s;%s;%s" % (oem,marca,serie,modello)
or
print ";".join([oem,marca,serie,modello])

ElementTree will not parse special characters with Python 2.7

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree.
This is a piece of my xml:
<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>
This is the ouput when I parse this row:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')
So it seems to be a problem with the character "ä".
This is how i do it in the code:
sys.setdefaultencoding( "UTF-8" )
xmltree = ET()
xmltree.parse("xxxx.xml")
printAccountPlan(xmltree)
def printAccountPlan(xmltree):
print("account:",str(i.attrib['number']), "AccountType:",str(i.attrib['type']),"Name:",str(i.text))
Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results.
The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. But briefly, this is the difference in how 'print' works:
In Python 3:
>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there')
>>> Hi there
In Python 2:
>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')
The second problem in your case is that tuples print themselves by calling repr() on each of their elements. In Python 3, repr() displays unicode as you want. But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (e.g., larger than 127). This is why you're seeing them.
You may decide to resolve this issue, or not, depending on what you're goal is with your code. The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. It's more for your internal convenience as a developer, for troubleshooting and similar tasks. If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this:
def printAccountPlan(xmltree):
data = (i.attrib['number'], i.attrib['type'], i.text)
print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

Python: 2.6 and 3.1 string matching inconsistencies

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.
I'm not going to post all my code since it may cause confusion.
Brief explanation:
I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.
I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value.
After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title.
Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.
Without further ado, here's the relevant code:
for i in d:
if i[1:-2] != d[i].get('id'):
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues.
Here is an example line of output:
X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY
Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.
Edit:
So the problem seems to be the way that I am slicing.
In Python3,
if i[1:-2] != d[i].get('id'):
this comparison works fine.
In Python2,
if i[1:-3] != d[i].get('id'):
I have to change the offset by one.
Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').
Edit 2:
Updated with requested repr() information.
I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character.
Python2:
'"9626-2008olympics_Prod-SH"\r\n'
'9626-2008olympics_Prod-SH'
Python3:
'"9626-2008olympics_Prod-SH"\n'
'9626-2008olympics_Prod-SH'
Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?

You are printing i[1:-3] but comparing i[1:-2] in the loop.
Very Important Question
Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

Russell Borogrove is right.
Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.
In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].
I just added a manual check for the Python major version.
Here is the fixed code:
for i in d:
# The keys in D contain quotes and a newline which need
# to be removed. In v3, newline = 1 char and in v2,
# newline = 2 char.
if sys.version_info[0] < 3:
if i[1:-3] != d[i].get('id'):
print('%s %s' % (i[1:-3], d[i].get('id')))
else:
if i[1:-2] != d[i].get('id'):
print('%s %s' % (i[1:-2], d[i].get('id')))
Thanks for the responses everyone! I appreciate your help.

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.
Instead of
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
do
print('%r %r' % (i, d[i].get('id')))
Note leaving off the [1:-3] so that you can see what is in i before you slice it.
Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":
How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?
Update after actual input finally provided by OP:
If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?
for i in d:
stripped = i.strip()
if stripped != d[i].get('id'):
print('X%sX Y%sY' % (stripped, d[i].get('id')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Setting Python encoding for printing Chinese characters [duplicate] - python

Related

Python Cyrillic encoding output

Delete last printed character python

python strip() seems doesn't working as expected

ElementTree will not parse special characters with Python 2.7

Python: 2.6 and 3.1 string matching inconsistencies

Categories

Resources