Python Cyrillic encoding output

Python Cyrillic encoding output - python

I just started learning python, and I am doing exercises for dictionaries, but since the sample code I have to use is in Cyrillic, I'm having some encoding issues. I have the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
temperatures = {
'София': -14,
'Новосибирск': -31
}
print("-" * 20)
print(temperatures)
print("-" * 20)
key = 'Бургас'
if key in temperatures:
print(temperatures[key])
else:
print("No data for {}".format(key))
Before adding the # -*- coding: utf-8 -*- line I was getting SyntaxError: Non-ASCII character '\xd0'. Now however, the error is gone but the output of the words in Cyrillic is not right. This is the output:
--------------------
{'\xd0\x9d\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x81\xd0\xb8\xd0\xb1\xd0\xb8\xd1\x80\xd1\x81\xd0\xba': -31, '\xd0\xa1\xd0\xbe\xd1\x84\xd0\xb8\xd1\x8f': -14}
--------------------
No data for Бургас
So, the words in Cyrillic which are printed out from the python dict are messed up, but the line Бургас appears right. I tried using print(format(temperatures)) but the output is the same. If it prints out one Cyrillic word, shouldn't it print all of them?

Sorry, can't comment yet: Your code works for me (emacs on arch linux).
edit: My default python interpreter is Python3.6, it get the same output as you with Python2.7. Is Python 2.7 a requirement?

Assuming that you use a Python 2, the output here is as expected:
when you print a string, it is printed directly
when you print a dictionnary, the keys (and values) as printed using repr and not str. That explains why all characters having code points above 255 use their escaped form.
If you want to display a dictionnary in Python 2, you must do it by hand:
h = temperatures
print('{' + ', '.join((str(k)+ ': ' + str(h[k]) for k in h.keys())) + '}')

Related

Setting Python encoding for printing Chinese characters [duplicate]

This question already has answers here:
Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
(4 answers)
Closed last year.
My code is below. I don't know why can't print Chinese. Please help.
When trying to print more than one variable at a time, the words look like ASCII or raw type.
How to fix it?
# -*- coding: utf-8 -*-
import pygoldilocks
import sys
reload(sys)
sys.setdefaultencoding('utf8')
rows = ( '已','经激活的区域语言' )
print( rows[0] )
print( rows[1] )
print( rows[0], rows[1] )
print( rows[0].encode('utf8'), rows[1].decode('utf8') )
print( rows[0], 1 )
$ python test.py
已
经激活的区域语言
('\xe5\xb7\xb2', '\xe7\xbb\x8f\xe6\xbf\x80\xe6\xb4\xbb\xe7\x9a\x84\xe5\x8c\xba\xe5\x9f\x9f\xe8\xaf\xad\xe8\xa8\x80')
('\xe5\xb7\xb2', u'\u7ecf\u6fc0\u6d3b\u7684\u533a\u57df\u8bed\u8a00')
('\xe5\xb7\xb2', 1)

All your outputs are normal. By the way, this:
reload(sys)
sys.setdefaultencoding('utf8')
is really a poor man's trick to set the Python default encoding. It is seldom really useful - IMHO it is not in shown code - and should only be used when no cleaner way is possible. I had been using Python 2 for decades with non ascii charset (Latin1) and only used that in my very first scripts.
And the # -*- coding: utf-8 -*- is not used either by Python here, though it may be useful for your text editor: it only makes sense when you have unicode literal strings in your script - what you have not.
Now what really happens:
You define row as a 2 tuple of (byte) strings containing chinese characters encoded in utf8. Fine.
When you print a string, the characters are passed directly to the output system (here a terminal or screen). As it correctly processes UTF8 it converts the utf8 byte representation into the proper characters. So print (row[0]) (which is executed as print row[0] in Python 2 - (row[0]) is not a tuple, (row[0],) is a 1-tuple) correctly displays chinese characters.
But when you print a tuple, Python actually prints the representation of the elements of the tuple (it would be the same for a list, set or map). And in Python 2, the representation of a byte or unicode string encodes all non ASCII characters in \x.. of \u.... forms.
In a Python interactive session, you should see:
>>> print rows[0]
已
>>> print repr(rows[0])
'\xe5\xb7\xb2'
TL/DR: when you print containers, you actually print the representation of the elements. If you want to display the string values, use an explicit loop or a join:
print '(' + ', '.join(rows) + ')'
displays as expected:
(已, 经激活的区域语言)

Your problem is that you are using Python 2, I guess. Your code
print( rows[0], rows[1] )
is evaluated as
tmp = ( rows[0], rows[1] ) # a tuple!
print tmp # Python 2 print statement!
Since the default formatting for tuples is done via repr(), you see the ASCII-escaped representation.
Solution: Upgrade to Python 3.

There are two less drastic solutions than upgrading to Python 3.
The first is not to use Python 3 print() syntax:
rows = ( '已','经激活的区域语言' )
print rows[0]
print rows[1]
print rows[0], rows[1]
print rows[0].decode('utf8'), rows[1].decode('utf8')
print rows[0], 1
已
经激活的区域语言
已 经激活的区域语言
已 经激活的区域语言
已 1
The second is to import Python 3 print() syntax into Python 2:
from __future__ import print_function
rows = ( '已','经激活的区域语言' )
print (rows[0])
print (rows[1])
print (rows[0], rows[1])
print (rows[0].decode('utf8'), rows[1].decode('utf8'))
print (rows[0], 1)
Output is the same.
And drop that sys.setdefaultencoding() call. It's not intended to be used like that (only in the site module) and does more harm than good.

Using unicode / umlauts in Python: Dictionary v manual input

I am using a dictionary to store some character pairs in Python (I am replacing umlaut characters). Here is what it looks like:
umlautdict={
'ae': 'ä',
'ue': 'ü',
'oe': 'ö'
}
Then I run my inputwords through it like so:
for item in umlautdict.keys():
outputword=inputword.replace(item,umlautdict[item])
But this does not do anything (no replacement happens). When I printed out my umlautdict, I saw that it looks like this:
{'ue': '\xfc', 'oe': '\xf6', 'ae': '\xc3\xa4'}
Of course that is not what I want; however, trying things like unicode() (--> Error) or pre-fixing u did not improve things.
If I type the 'ä' or 'ö' into the replace() command by hand, everything works just fine. I also changed the settings in my script (working in TextWrangler) to # -*- coding: utf-8 -*- as it would net even let me execute the script containing umlauts without it.
So I don't get...
Why does this happen? Why and when do the umlauts change from "good
to evil" when I store them in the dictionary?
How do I fix it?
Also, if anyone knows: what is a good resource to learn about
encoding in Python? I have issues all the time and so many things
don't make sense to me / I can't wrap my head around.
I'm working on a Mac in Python 2.7.10. Thanks for your help!

Converting to Unicode is done by decoding your string (assuming you're getting bytes):
data = "haer ueber loess"
word = data.decode('utf-8') # actual encoding depends on your data
Define your dict with unicode strings as well:
umlautdict={
u'ae': u'ä',
u'ue': u'ü',
u'oe': u'ö'
}
and finally print umlautdict will print out some representation of that dict, usually involving escapes. That's normal, you don't have to worry about that.

Declare your coding.
Use raw format for the special characters.
Iterate properly on your string: keep the changes from each loop iteration as you head to the next.
Here's code to get the job done:
\# -*- coding: utf-8 -*-
umlautdict = {
'ae': r'ä',
'ue': r'ü',
'oe': r'ö'
}
print umlautdict
inputword = "haer ueber loess"
for item in umlautdict.keys():
inputword = inputword.replace(item, umlautdict[item])
print inputword
Output:
{'ue': '\xc3\xbc', 'oe': '\xc3\xb6', 'ae': '\xc3\xa4'}
här über löss

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position

Add this line at the top of you code. It may work.
# coding=utf8

Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.

If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.

The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter

Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

Curses library in python

I have C code which draws a vertical & a horizontal line in the center of screen as below:
#include<stdio.h>
#define HLINE for(i=0;i<79;i++)\
printf("%c",196);
#define VLINE(X,Y) {\
gotoxy(X,Y);\
printf("%c",179);\
}
int main()
{
int i,j;
clrscr();
gotoxy(1,12);
HLINE
for(y=1;y<25;y++)
VLINE(39,y)
return 0;
}
I am trying to convert it literally in python version 2.7.6:
import curses
def HLINE():
for i in range(0,79):
print "%c" % 45
def VLINE(X,Y):
curses.setsyx(Y,X)
print "%c" % 124
curses.setsyx(12,1)
HLINE()
for y in range(1,25):
VLINE(39,y)
My questions:
1.Do we have to change the position of x and y in setsyx function i.e, gotoxy(1,12) is setsyx(12,1) ?
2.Curses module is only available for unix not for windows?If yes, then what about windows(python 2.7.6)?
3.Why character value of 179 and 196 are � in python but in C, it is | and - respectively?
4.Above code in python is literally right or it needs some improvement?

Yes, you will have to change the argument positions. setsyx(y, x) and gotoxy(x, y)
There are Windows libraries made available. I find most useful binaries here: link
This most likely has to do with unicode formatting. What you could try to do is add the following line to the top of your python file (after the #!/usr/bin/python line) as this forces python to work with utf-8 encoding in String objects:
# -*- coding: utf-8 -*-
Your Python code to me looks acceptable enough, I wouldn't worry about it.

Yes.
Duplicate of Curses alternative for windows
Presumably you are using Python 2.x, thus your characters are bytes and therefore encoding-dependent. The meaning of a particular numeric value is determined by the encoding used. Most likely you are using utf8 on Linux and something non-utf8 in your Windows program, so you cannot compare the values. In curses you should use curses.ACS_HLINE and curses.ACS_VLINE.
You cannot mix print and curses functions, it will mess up the display. Use curses.addch or variants instead.

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)

I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.

I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))

You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Cyrillic encoding output - python

Sorry, can't comment yet: Your code works for me (emacs on arch linux). edit: My default python interpreter is Python3.6, it get the same output as you with Python2.7. Is Python 2.7 a requirement?

Related

Setting Python encoding for printing Chinese characters [duplicate]

Using unicode / umlauts in Python: Dictionary v manual input

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

Curses library in python

How to search and replace utf-8 special characters in Python?

Categories

Resources