Python 3.5: Exporting Chinese Characters

Python 3.5: Exporting Chinese Characters - python

I have been trying several times to export Chinese from list variables to csv or txt file and found problems with that.
Specifically, I have already set the encoding as utf-8 or utf-16 when reading the data and writing them to the file. However, I noticed that I cannot do that when my Window 7’s base language is English, even that I change the language setting to Chinese. When I run the Python programs under the Window 7 with Chinese as base language, I can successfully export and show Chinese perfectly.
I am wondering why that happens and any solution helping me show Chinese characters in the exported file when running the Python programs under English-based Window?

I just found that you need to do 2 things to achieve this:
Change the Window's display language to Chinese.
Use encoding UTF-16 in the writing process.

Here's US Windows 10, running a Python IDE called PythonWin. There are no problems with Chinese.
Here's the same program running in a Windows console. Note the US codepage default for the console is cp437. cp65001 is UTF-8. Switching to an encoding that supports Chinese text is the key. The text below was cut-and-pasted directly from the console. While the characters display correctly pasted to Stack Overflow, the console font didn't support Chinese and actually displayed .
C:\>chcp
Active code page: 437
C:\>x
Traceback (most recent call last):
File "C:\\x.py", line 5, in <module>
print(f.read())
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
C:\>chcp 65001
Active code page: 65001
C:\>type test.txt
我是美国人。
C:\>x
我是美国人。
Notepad displays the output file correctly as well:
Either use an IDE that supports UTF-8, or write your output to a file and read it with a tool like Notepad.
Ways to get the Windows console to actually output Chinese are the win-unicode-console package, and changing the Language and Region settings, Administrative tab, system locale to Chinese. For the latter, Windows will remain English, but the Windows console will use a Chinese code page instead of an English one.

Related

Unicode Error when opening text file - Geany

I'm trying to create a little program that reads the contents of two stories, Alice in Wonderland & Moby Dick, and then counts how many times the word 'the' is found in each story.
However I'm having issues with getting Geany text editor to open the files. I've been creating and using my own small text files with no issues so far.
with open('alice_test.txt') as a_file:
contents = a_file.readlines()
print(contents)
I get the following error:
Traceback (most recent call last):
File "add_cats_dogs.py", line 50, in <module>
print(contents)
File "C:\Users\USER\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2018' in position 279: character maps to <undefined>
As I said, no issues experienced with any small homemade text files.
Strangely enough, when I excecute the above code in Python IDLE, I have no problems, even if I change the text file's encoding between UTF-8 to ANSII.
I tried encoding the text file as UTF-8 and ANSII, I also checked to make sure the default encoding of Geany is UTF-8 (also tried without using default encoding), as well using and not using fixed encoding when opening non-Unicode files.
I get the same error every time. The text file was from gutenberg.org, I tried using another file from there and got the same issue.
I know it must be some sort of issue between Geany and the text file, but I can't figure out what.
EDIT: I found a sort of fix.
Here is the text that was giving me problems:https://www.gutenberg.org/files/11/11-0.txt
Here is the text that I can use without problems:http://www.textfiles.com/etext/FICTION/alice13a.txt
Top one is encoded in UTF-8, bottom one is encoded in windows-1252. I would've imagined the reverse to be true, but for whatever reason the UTF-8 encoding seems to be causing the problem.

What OS do you use? There are similar problems in Windows. If so, you can try to run chcp 65001 before you command in console. Also you can add # encoding: utf-8 at the top of you .py file. Hope this will help because I can't reply same encoding problem with .txt file from gutenberg.org on my machine.

UnicodeEncodeError: Script runs on Spyder and IDLE, yet not on cmd prompt

I have a working script at hand, while it runs on the Spyder IDE and python shell, when I just run it by double clicking, it closes right away. To understand the problem, I ran it through the cmd prompt and encountered the following:
Traceback (most recent call last):
File "C:\Users\Cheese\Desktop\demografik-proje\demo-form-v-0-1-3.py", line 314, in <module>
mainMenu(q_list, xtr_q_list)
File "C:\Users\Cheese\Desktop\demografik-proje\demo-form-v-0-1-3.py", line 152, in mainMenu
patient_admin = input("Testi uygulayan ki\u015fi: ") #the person who administrated the test
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 18: character maps to <undefined>
This question has been asked many times before, but why I'm asking is, this script works fine in some computers, just by double-clicking, yet doesn't work on mine, for instance. From what I gathered, could it be related to the fact that my computer is in English, yet the computers that were able to run were in Turkish?
Also since program has many Turkish strings, I'd rather not fiddle with every individual string and rather put a thing on the top or something. I'm even up for setting up a batch file to run the script in UTF8. Or if I could freeze it in such a way that it recognizes UTF8(this would be preferred)? Also, I just checked, and the program works fine if all the turkish chars are removed. As expected.
If it's any help, Spyder still runs Python 3.5.1, I have 3.5.2 installed and when I just type "python" on command prompt, Python 3.5.2 runs just fine.
Following is the code, if it's any assistance:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Block o' Code
"""
patient_admin = input("Testi uygulayan kişi: ") #the person who administrated the test
#gets input for all the data program needs
print("=======================================")
"""
More block o' code
"""

Okay, found the solution, wonder how I missed this one (considering I tried to figure this one out for 2-3 hours).
To run the program with a non ASCII strings, you can change the code page for the command prompt. To do that:
Open Windows Control Panel
Select Region and Language
Click on the Administrative tab
Under Language for non-Unicode programs, click on Change System Locale
Choose the locale (Turkish for my case)
Click OK
It's going to restart your computer, after that, your program should work fine.
Source: http://knowledgebase.progress.com/articles/Article/4677

input with a string argument does a print, and, notoriously, PrintFails.
The Windows command prompt is hopelessly broken at Unicode. You can change the code page it uses to one that includes the characters you want to print, for example using the command chcp 1254 for the legacy Turkish encoding code page 1254, before running the Python script. By setting the ‘Language for non-Unicode programs’ you are setting this code page as the default for all command prompts.
However this will still fail if you need to use characters that don't exist in 1254. In theory as #PM2Ring suggests you could use code page 65001 (which is nearly UTF-8), but in practice the long-standing bugs in Windows's implementation of this code page usually make it unusable.
You could also try installing the win-unicode-console module, which attempts to work around the problems of the Windows command prompt.

Puking blood: how to make a scrapy, python, and postrgres ecosystem inside Windows 7 that can deal with unicode

Would really appreciate if anyone can help me either fix the problems I describe below, or (worst case) suggest an alternative environment that would work (although I'm loathe to upgrade to Windows 10)
I am scraping mostly-english webpages from a Japanese website. A few required fields have kanji in them.
I'm using scrapy, postgres 9.5, and python 2.7 on a Windows 7 installation.
Scrapy has to run in a cmd.exe shell, and I'm examining the database results in a psql.exe instance also running in a cmd.exe shell. I've been using Console2 application for the cmd.exe.
It's a horrible experience to debug in this setup:
scrapy shell
I'm unable to do any diagnostic print() messages because the kanji causes an Exception
> print st['kanji_name']
> File "C:\Users\mds\Anaconda2\lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 0-8: character maps to <undefined>
I've seen solutions about changing the active code page to with chcp 65001 but scrapy doesn't understand cp65001 apparently
C:\Users\_python\j_school>chcp 65001
Active code page: 65001
Throws the error:
C:\Users\_python\j_school>scrapy crawl j_school
Traceback (most recent call last):
File "C:\Users\s\Anaconda2\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\_python\j_school\j_school\spiders\j_school_spider.py", line 141, in parse
print(st['english_name'])
LookupError: unknown encoding: cp65001
PSQL
PSQL already warns me on startup
C:\Program Files\PostgreSQL\9.5\bin>psql m_experiment postgres
psql (9.5rc1)
WARNING: Console code page (437) differs from Windows code page (1252)
8-bit characters might not work correctly. See psql reference
page "Notes for Windows users" for details.
regardless of whether I try the chcp 65001, psql still will not print these.
m_experiment=# select * from schools limit 1;
ERROR: character with byte sequence 0xe6 0x9d 0xb1 in encoding "UTF8" has no equivalent in encoding "WIN1252"
I've also tried to set the client_encoding, but this then blows up something and postgres insists I'm out of memory!
m_experiment=# SET client_encoding = 'UTF8';
SET
m_experiment=# show client_encoding;
Not enough memory.
m_experiment=#
I discovered multiple bug reports about this issue circa 2011 but it was never fixed??? Anyway, I found a manual way to fix it, pset pager off incantation solves the issue.
Now psql can at least spit out a response, although it doesn't render the kanji correctly.
m_experiment=# select english_name, kanji_name from schools limit 1;
english_name | kanji_name
-------------------------------------+--------------------
TOKYO INTERNATIONAL JAPANESE SCHOOL | Ã¦ÂÂ±Ã¤ÂºÂ¬Ã¥â€ºÂ½Ã©Å¡â€ºÃ¦â€”Â¥Ã¦Å“Â¬Ã¨ÂªÅ¾Ã¥ÂÂ¦Ã©â„¢Â¢
(1 row)
One hack-solution was to change my locale to Japanese. Now the console shows my kanji properly. But it screws up the display thereafter (the >prompt shows up strangely and the cursor graphic doesn't align to where the cursor actually is!).

From your error message, cp437 is the US Windows console default encoding. You could try temporarily switching your system locale to "Japanese(Japan)" so you could print Kanji to the console. Go to Control Panel, Region and Language, Administrative tab and click "Change system locale...". After rebooting, the default Windows console default encoding should be one suitable for Japanese.
I've done this before to print Chinese to the console. The setting only affects non-Unicode programs, and most programs are fully Unicode nowadays.

python read utf8 text file problem

I have a problem with python about reading and print utf8 text file.
I have a test.txt in utf8 encoding without BOM. This file has two characters in it:
大声
The first character "大" is Chinese and the second "声" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters.
import codecs
infile = "C:\\test.txt"
f = codecs.open(infile, "r", "utf-8")
s = f.read()
print(s)
I got this error,
"UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1:
illegal multibyte sequence"
I found it caused from the second character "声" .
But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem.
My running environment is python 3.1 , windows xp traditional Chinese.

You get the error when you are printing because:
(1) Ulipad is printing to sys.stdout which is the stdout of the legacy MS-DOS Command Prompt window.
(2) Your traditional chinese Windows XP uses cp950 encoding, which is big5 plus Microsoftian fiddling.
(3) You say your 2nd character is Japanese by which you probably mean that it's not also Chinese and thus unlikely to be a valid character in big5+.
On the other hand IDLE is writing to its own window and is not bound on the MS-DOS wheel :-) ... so there's a much greater repertoire of characters that it can print.

声 may be Japanese, but it is also the Simplified Chinese for "sound" (traditional 聲). cp950 is Traditional Chinese and doesn't support that simplified character.
Since you are using a Chinese version of Windows, you may be able to change your default code page to cp936 (Unified Chinese) and see the output.
I'm unfamiliar with Ulipad, but try running in a Windows console:
chcp 936
and then running your script. If that doesn't work, you can change the default language for non-Unicode programs through Control Panel, Regional and Language Options, Advanced tab. This is how I was able to print Chinese in a console on my US English-based Windows.
Update
Reading about Ulipad, it says:
Multilanguage support Currently
supports 4 languages: English,
Spanish, Simplified Chinese and
Traditional Chinese, which can be
auto-detected.
Perhaps you can override the auto-detected Traditional Chinese to Simplified Chinese, which may select a code page and/or font that supports that particular character. Since it doesn't support Japanese, there will probably still be characters you can't display properly.

Converting from ascii to utf-8 with Python

I have xmpp bot written in python. One of it's plugins is able to execute OS commands and send output to the user. As far as I know output should be unicode-like to send it over xmpp protocol. So I tried to handle it this way:
output = os.popen(cmd).read()
if not isinstance(output, unicode):
output = unicode(output,'utf-8','ignore')
bot.send(xmpp.Message(mess.getFrom(),output))
But when Russian symbols appear in output they aren't converted well.
sys.getdefaultencoding()
says that default command prompt encoding is 'ascii', but when I try to do
output.decode('ascii')
in python console I get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 1:
ordinal not in range(128)
OS: Win XP, Python 2.5.4
PS: Sorry for my English :(

sys.getdefaultencoding() returns python's default encoding - which is ASCII unless you have changed it. ASCII doesn't support Russian characters.
You need to work out what encoding the actual text is, either manually, or using the locale module.
Typically something like:
import locale
encoding = locale.getpreferredencoding(do_setlocale=True)¶

Ascii has no defined character values above 127 0x7F. Perhaps you mean the Cyrillic code page? It's 866
See http://en.wikipedia.org/wiki/Code_page
edit: since this answer was marked correct presumably 886 worked, but as other answers have pointed it, 886 is not the only Russian language code page. If you use a code page different from the one that was used when the Russian symbols were encoded, you will get the wrong result.

You say """sys.getdefaultencoding() says that default command prompt encoding is 'ascii'"""
sys.getdefaultencoding says NOTHING about the "command prompt" encoding.
On Windows, sys.stdout.encoding should do the job. On my machine, it contains cp850 when Python is run in a Command Prompt window, and cp1252 in IDLE. Yours should contain cp866 and cp1251 respectively.
Update You say that you still need cp866 in IDLE. Note this:
IDLE 2.6.4
>>> import os
>>> os.popen('chcp').read()
'Active code page: 850\n'
>>>
So when your app starts up, check if you are on Windows and if so, parse the result of os.popen('chcp').read(). The text before the : is probably locale-dependent. codepage = result.split()[-1] may be good enough "parsing". On Unix, which doesn't have a Windows/MS-DOS split personality, sys.stdout.encoding should be OK.

In Python 'cp855', 'cp866', 'cp1251', 'iso8859_5', 'koi8_r' are differing Russian code pages. You'll need to use the right one to decode the output of popen. In the Windows console, the 'chcp' command lists the code page used by console commands. That won't necessarily be the same code page as Windows applications. On US Windows, 'cp437' is used for the console and 'cp1252' is used for applications like Notepad.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.