How can I get chinese character using raw_input - python

my dev enviroment is: eclipse+pydev.
If I use raw_input() to get character, I input "你好世界", then I get "浣犲ソ涓栫晫". Then how can I get "你好世界" and print it correctly.
I have tried raw_input().decode(sys.stdin.encoding), but the result is same.

Decode using the terminal's/console's code page.
import sys
t = raw_input().decode(sys.stdin.encoding)
print t

Check the encoding you are using. Based on #imom0's comment, I went and tried gbk encoding. Specifically, this is my python 2.7.3 interpreter with UTF-8 encoding via ibus for input:
>>> print raw_input().decode('gbk')
你好世界
浣犲ソ涓栫晫
>>> print raw_input().decode('utf-8')
你好世界
你好世界
This is the result of trying to decode a UTF-8 encoded string as gbk. Since your input seems to be some form of UTF, why not enforce either utf-8 decoding or use the input's encoding to decode it, as in #ignacio-vazquez-abrams' answer?
import sys
print myString.decode(sys.stdin.encoding)

Related

Python read and write 'ß' from file

I have a file.txt with the input
Straße
Straße 1
Straße 2
I want to read this text from file and print it. I tried this, but it won´t work.
lmao1 = open('file.txt').read().splitlines()
lmao =random.choice(lmao1)
print str(lmao).decode('utf8')
But I get the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 5: invalid continuation byte
Got it. If this doesn't work try other common encodings until you find the right one. utf-8 is not the correct encoding.
print str(lmao).decode('latin-1')
If on Windows, the file is likely encoded in cp1252.
Whatever the encoding, use io.open and specify the encoding. This code will work in both Python 2 and 3.
io.open will return Unicode strings. It is good practice to immediately convert to/from Unicode at the I/O boundaries of your program. In this case that means reading the file as Unicode in the first place and leaving print to determine the appropriate encoding for the terminal.
Also recommended is to switch to Python 3 where Unicode handling is greatly improved.
from __future__ import print_function
import io
import random
with io.open('file.txt',encoding='cp1252') as f:
lines = f.read().splitlines()
line = random.choice(lines)
print(line)
You're on the right track, regarding decode, the problem is only there is no way to guess the encoding of a file 100%. Try a different encoding (e.g. latin-1).
It's working fine on Python prompt and while running from python script as well.
>>> import random
>>> lmao =random.choice(lmao1)
>>> lmao =random.choice(lmao1)
>>> print str(lmao).decode('utf8')
Straße 2
The above worked on Python 2.7. May I know your python version ?

How to decode ascii in python

I send cyrillic letters from postman to django as a parameter in url and got something like %D0%B7%D0%B2 in variable search_text
actually if to print search_text I got something like текст printed
I've tried in console to make the following and didn't get an error
>>> a = "текст"
>>> a
'\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82'
>>> print a
текст
>>> b = a.decode("utf-8")
>>> b
u'\u0442\u0435\u043a\u0441\u0442'
>>> print b
текст
>>>
by without console I do have an error:
"""WHERE title LIKE '%%{}%%' limit '{}';""".format(search_text, limit))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
How to prevent it?
To decode urlencoded string (with '%' signs) use the urllib:
import urllib
byte_string=urllib.unquote('%D0%B7%D0%B2')
and then you'll need to decode the byte_string from it's original encoding, i.e.:
import urllib
import codecs
byte_string=urllib.unquote('%D0%B7%D0%B2')
unicode_string=codecs.decode(byte_string, 'utf-8')
and print(unicode_string) will print зв.
The problem is with the unknown encoding. You have to know what encoding is used for the data you get. To specify the default encoding used in your script .py file, place the following line at the top:
# -*- coding: utf-8 -*-
Cyrillic might be 'cp866', 'cp1251', 'koi8_r' and 'utf-8', this are the most common. So when using decode try those.
Python 2 doesn't use unicode by default, so it's best to enable it or swich to Python 3. To enable unicode in .py file put the following line on top of all imports:
from __future__ import unicode_literals
So i.e. in Python 2.7.9, the following works fine:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
a="текст"
c="""WHERE title LIKE '%%{}%%' limit '{}';""".format(a, '10')
print(c)
Also see:
https://docs.python.org/2/library/codecs.html
https://docs.python.org/2/howto/unicode.html.
it depends on what encoding the django program is expecting and the strings search_text, limit are. usually its sufficient to do this:
"""WHERE title LIKE '%%{}%%' limit '{}';""".decode("utf-8").format(search_text.decode("utf-8"), limit)
EDIT** after reading your edits, it seems you are having problems changing back your urlparsed texts into strings. heres an example of how to do this:
import urlparse
print urlparse.urlunparse(urlparse.urlparse("ресторан"))
You can use '{}'.format(search_text.encode('utf-8'))) to interpret the string as utf-8, but it probably will show your cyrillic letters as \xd0.
And read The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets.

Python how to handle unicode text

I am using Python 2.6.6
item = {u'snippet': {u'title': u'How to Pronounce Canap\xe9'}}
title = item['snippet']['title']
print title
Result:
How to Pronounce Canapé
Desired result:
How to Pronounce Canapé
This looks like a Unicode issue, I tried encode and decode to utf8, but result still the same, any ideas?
Your terminal expects UTF-8:
$ locale charmap
UTF-8
Python prints using UTF-8:
>>> sys.stdout.encoding
UTF-8
Change SecureCRT setting to accept UTF-8.
This is quite possibly due to mismatch of the default encoding that Python is using versus the console's encoding. It looks like Python is assuming that the encoding is UTF-8 but then the console is interpreting that as latin-1.
Instead of \xe9, use \u00e9 if possible. Then pick an appropriate encoding when outputting the unicode string:
print title.encode('latin1')
What encoding is sensible depends on where you are outputting to. Generally, you have to infer it from the environment variables, or maybe let your users make a choice in a configuration file.
PS: If you deal with Unicode strings a lot, I'd recommend switching to Python 3 (e.g. 3.3), if at all possible. Unicode handling is a lot more clear/explicit/sane, there.
I am getting your expected output in my terminal (using python 2.7.7)
The format you are expecting depends on encoding set in the terminal. For me, it is set to 'cp437'
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> sys.stdout.encoding
'cp437'
You can verify that, you are getting correct output by giving:
print title.encode('cp437')
set your default encoding to iso-8859-1 in your sitecustomize.py file in ${pythondir}/lib/site-packages/ as
import sys
sys.setdefaultencoding('iso-8859-1')
for me it worked with \xe9.

python, and unicode stderr

I used an anonymous pipe to capture all stdout,and stderr then print into a richedit, it's ok when i use wsprintf ,but the python using multibyte char that really annoy me. how can I convert all these output to unicode?
UPDATE 2010-01-03:
Thank you for the reply, but it seems the str.encode() only worked with print xxx stuff, if there is an error during the py_runxxx(), my redirected stderr will capture the error message in multibyte string, so is there a way can make python output it's message in unicode way? And there seems to be an available solution in this post.
I'll try it later.
First, please remember that on Windows console may not fully support Unicode.
The example below does make python output to stderr and stdout using UTF-8. If you want you could change it to other encodings.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points."
You can work with Unicode in python either by marking strings as Unicode (ie: u'Hello World') or by using the encode() method that all strings have.
Eg. assuming you have a Unicode string, aStringVariable:
aStringVariable.encode('utf-8')
will convert it to UTF-8. 'utf-16' will give you UTF-16 and 'ascii' will convert it to a plain old ASCII string.
For more information, see:
Tutorial - Unicode Strings
Python String Methods
wsprintf?
This seems to be a "C/C++" question rather than a Python question.
The Python interpreter always writes bytestrings to stdout/stderr, rather than unicode (or "wide") strings. It means Python first encodes all unicode data using the current encoding (likely sys.getdefaultencoding()).
If you want to get at stdout/stderr as unicode data, you must decode it by yourself using the right encoding.
Your favourite C/C++ library certainly has what it takes to do that.

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)
# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.
If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.
Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.
Convert it to a unicode string through:
print unicode(name)

Categories

Resources