I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be 𡗗.svg but the following code does not work:
import urllib.parse
import codecs
input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)
will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.
I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to 𡗗.svg.
What happens here?
you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote
urlencoded = '%ed%a1%85%ed%b7%97'
char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')
print(char)
print(char1)
print(char2)
# 怼呿窏
# 瞴�窾�
# 怼呿窏
this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.
cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk)
I want to run my code on terminal but it shows me this error :
SyntaxError: Non-ASCII character '\xd8' in file streaming.py on line
72, but no encoding declared; see http://python.org/dev/peps/pep-0263/
for detail
I tried to encode the Arabic string using this :
# -*- coding: utf-8 -*-
st = 'المملكة العربية السعودية'.encode('utf-8')
It's very important for me to run it on the terminal so I can't use IDLE.
The problem is since you are directly pasting your characters in to a python file, the interpreter (Python 2) attempts to read them as ASCII (even before you encode, it needs to define the literal), which is illegal. What you want is a unicode literal if pasting non-ASCII bytes:
x=u'المملكة العربية السعودية' #Or whatever the corresponding bytes are
print x.encode('utf-8')
You can also try to set the entire source file to be read as utf-8:
#/usr/bin/python
# -*- coding: utf-8 -*-
and don't forget to make it run-able, and lastly, you can import the future from Python 3:
from __future__ import unicode_literal
at the top of the file, so string literals by default are utf-8. Note that \xd8 appears as phi in my terminal, so make sure the encoding is correct.
my dev enviroment is: eclipse+pydev.
If I use raw_input() to get character, I input "你好世界", then I get "浣犲ソ涓栫晫". Then how can I get "你好世界" and print it correctly.
I have tried raw_input().decode(sys.stdin.encoding), but the result is same.
Decode using the terminal's/console's code page.
import sys
t = raw_input().decode(sys.stdin.encoding)
print t
Check the encoding you are using. Based on #imom0's comment, I went and tried gbk encoding. Specifically, this is my python 2.7.3 interpreter with UTF-8 encoding via ibus for input:
>>> print raw_input().decode('gbk')
你好世界
浣犲ソ涓栫晫
>>> print raw_input().decode('utf-8')
你好世界
你好世界
This is the result of trying to decode a UTF-8 encoded string as gbk. Since your input seems to be some form of UTF, why not enforce either utf-8 decoding or use the input's encoding to decode it, as in #ignacio-vazquez-abrams' answer?
import sys
print myString.decode(sys.stdin.encoding)
New to python and lxml so please bear with me. Now stuck with what appears to be unicode issue. I tried .encode, beautiful soup's unicodedammit with no luck. Had searched the forum and web, but my lack of python skill failed to apply suggested solution to my particular code. Appreciate any help, thanks.
Code:
import requests
import lxml.html
sourceUrl = "http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty.htm"
sourceHtml = requests.get(sourceUrl)
htmlTree = lxml.html.fromstring(sourceHtml.text)
for stockCodes in htmlTree.xpath('''/html/body/printfriendly/table/tr/td/table/tr/td/table/tr/table/tr/td'''):
string = stockCodes.text
print string
Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
When I run your code like this python lx.py, I don't get the error. But when I send the result to sdtout python lx.py > output.txt, it occurs. So try this:
# -*- coding: utf-8 -*-
import requests
import lxml.html
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
This allows you to switch from the default ASCII to UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.
The text attribute always returns pure bytes but the content attribute should try to encode it for you. You could also try: sourceHTML.text.encode('utf-8') or sourceHTML.text.encode('ascii') but I'm fairly certain the latter will cause that same exception.
To summarize: How do I print unicode system independently to produce play card symbols?
What I do wrong, I consider myself quite fluent in Python, except I seem not able to print correctly!
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
symbols = ('♥','♦','♠','♣')
# red suits to sdterr for IDLE
print(' '.join(symbols[:2]), file=sys.stderr)
print(' '.join(symbols[2:]))
sys.stdout.write(symbols) # also correct in IDLE
print(' '.join(symbols))
Printing to console, which is main consern for console application, is failing miserably though:
J:\test>chcp
Aktiivinen koodisivu: 850
J:\test>symbol2
Traceback (most recent call last):
File "J:\test\symbol2.py", line 9, in <module>
print(''.join(symbols))
File "J:\Python26\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <unde
fined>
J:\test>chcp 437
Aktiivinen koodisivu: 437
J:\test>d:\Python27\python.exe symbol2.py
Traceback (most recent call last):
File "symbol2.py", line 6, in <module>
print(' '.join(symbols))
File "d:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2660' in position 0: character maps
o <undefined>
J:\test>
So summa summarum I have console application which works as long as you are not using console, but IDLE.
I can of course generate the symbols myself by producing them by chr:
# correct symbols for cp850
print(''.join(chr(n) for n in range(3,3+4)))
But this looks very stupid way to do it. And I do not make programs only run on Windows or have many special cases (like conditional compiling). I want readable code.
I do not mind which letters it outputs, as long as it looks correct no matter if it is Nokia phone, Windows or Linux. Unicode should do it but it does not print correctly to Console
Whenever I need to output utf-8 characters, I use the following approach:
import codecs
out = codecs.getwriter('utf-8')(sys.stdout)
str = u'♠'
out.write("%s\n" % str)
This saves me an encode('utf-8') every time something needs to be sent to sdtout/stderr.
Use Unicode strings and the codecs module:
Either:
# coding: utf-8
from __future__ import print_function
import sys
import codecs
symbols = (u'♠',u'♥',u'♦',u'♣')
print(u' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
or:
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import sys
import codecs
symbols = ('♠','♥','♦','♣')
print(' '.join(symbols))
print(*symbols)
with codecs.open('test.txt','w','utf-8') as testfile:
print(*symbols, file=testfile)
No need to re-implement print.
In response to the updated question
Since all you want to do is to print out UTF-8 characters on the CMD, you're out of luck, CMD does not support UTF-8:
Is there a Windows command shell that will display Unicode characters?
Old Answer
It's not totally clear what you're trying to do here, my best bet is that you want to write the encoded UTF-8 to a file.
Your problems are:
symbols = ('♠','♥', '♦','♣') while your file encoding maybe UTF-8, unless you're using Python 3 your strings wont be UTF-8 by default, you need to prefix them with a small u:
symbols = (u'♠', u'♥', u'♦', u'♣')
Your str(arg) converts the unicode string back into a normal one, just leave it out or use unicode(arg) to convert to a unicode string
The naming of .decode() may be confusing, this decodes bytes into UTF-8, but what you need to do is to encode UTF-8 into bytes so use .encode()
You're not writing to the file in binary mode, instead of open('test.txt', 'w') your need to use open('test.txt', 'wb') (notice the wb) this will open the file in binary mode which is important on windows
If we put all of this together we get:
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
symbols = (u'♠',u'♥', u'♦',u'♣')
print(' '.join(symbols))
print('Failure!')
def print(*args,**kwargs):
end = kwargs[end] if 'end' in kwargs else '\n'
sep = kwargs[sep] if 'sep' in kwargs else ' '
stdout = sys.stdout if 'file' not in kwargs else kwargs['file']
stdout.write(sep.join(unicode(arg).encode('utf-8') for arg in args))
stdout.write(end)
print(*symbols)
print('Success!')
with open('test.txt', 'wb') as testfile:
print(*symbols, file=testfile)
That happily writes the byte encoded UTF-8 to the file (at least on my Ubuntu box here).
UTF-8 in the Windows console is a long and painful story.
You can read issue 1602 and issue 6058 and have something that works, more or less, but it's fragile.
Let me summarise:
add 'cp65001' as an alias for 'utf8' in Lib/encodings/aliases.py
select Lucida Console or Consolas as your console font
run chcp 65001
run python