python read utf8 text file problem - python

I have a problem with python about reading and print utf8 text file.
I have a test.txt in utf8 encoding without BOM. This file has two characters in it:
大声
The first character "大" is Chinese and the second "声" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters.
import codecs
infile = "C:\\test.txt"
f = codecs.open(infile, "r", "utf-8")
s = f.read()
print(s)
I got this error,
"UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1:
illegal multibyte sequence"
I found it caused from the second character "声" .
But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem.
My running environment is python 3.1 , windows xp traditional Chinese.

You get the error when you are printing because:
(1) Ulipad is printing to sys.stdout which is the stdout of the legacy MS-DOS Command Prompt window.
(2) Your traditional chinese Windows XP uses cp950 encoding, which is big5 plus Microsoftian fiddling.
(3) You say your 2nd character is Japanese by which you probably mean that it's not also Chinese and thus unlikely to be a valid character in big5+.
On the other hand IDLE is writing to its own window and is not bound on the MS-DOS wheel :-) ... so there's a much greater repertoire of characters that it can print.

声 may be Japanese, but it is also the Simplified Chinese for "sound" (traditional 聲). cp950 is Traditional Chinese and doesn't support that simplified character.
Since you are using a Chinese version of Windows, you may be able to change your default code page to cp936 (Unified Chinese) and see the output.
I'm unfamiliar with Ulipad, but try running in a Windows console:
chcp 936
and then running your script. If that doesn't work, you can change the default language for non-Unicode programs through Control Panel, Regional and Language Options, Advanced tab. This is how I was able to print Chinese in a console on my US English-based Windows.
Update
Reading about Ulipad, it says:
Multilanguage support Currently
supports 4 languages: English,
Spanish, Simplified Chinese and
Traditional Chinese, which can be
auto-detected.
Perhaps you can override the auto-detected Traditional Chinese to Simplified Chinese, which may select a code page and/or font that supports that particular character. Since it doesn't support Japanese, there will probably still be characters you can't display properly.

Related

I got this error after hashing sha256 file in Python, how can I edit the command line? [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

UnicodeDecodeError "charmap" when trying to import a .json file into python [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

Python 3.5: Exporting Chinese Characters

I have been trying several times to export Chinese from list variables to csv or txt file and found problems with that.
Specifically, I have already set the encoding as utf-8 or utf-16 when reading the data and writing them to the file. However, I noticed that I cannot do that when my Window 7’s base language is English, even that I change the language setting to Chinese. When I run the Python programs under the Window 7 with Chinese as base language, I can successfully export and show Chinese perfectly.
I am wondering why that happens and any solution helping me show Chinese characters in the exported file when running the Python programs under English-based Window?
I just found that you need to do 2 things to achieve this:
Change the Window's display language to Chinese.
Use encoding UTF-16 in the writing process.
Here's US Windows 10, running a Python IDE called PythonWin. There are no problems with Chinese.
Here's the same program running in a Windows console. Note the US codepage default for the console is cp437. cp65001 is UTF-8. Switching to an encoding that supports Chinese text is the key. The text below was cut-and-pasted directly from the console. While the characters display correctly pasted to Stack Overflow, the console font didn't support Chinese and actually displayed .
C:\>chcp
Active code page: 437
C:\>x
Traceback (most recent call last):
File "C:\\x.py", line 5, in <module>
print(f.read())
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
C:\>chcp 65001
Active code page: 65001
C:\>type test.txt
我是美国人。
C:\>x
我是美国人。
Notepad displays the output file correctly as well:
Either use an IDE that supports UTF-8, or write your output to a file and read it with a tool like Notepad.
Ways to get the Windows console to actually output Chinese are the win-unicode-console package, and changing the Language and Region settings, Administrative tab, system locale to Chinese. For the latter, Windows will remain English, but the Windows console will use a Chinese code page instead of an English one.

hi-ascii characters python string

I am always perplexed with the whole hi-ascii handling in python 2.x. I am currently facing an issue in which I have a string with hiascii characters in it. I have a few questions related to it.
How can a string store hiascii characters in it (not a unicode string, but a normal str in python 2.x), which I thought can handle only ascii chars. Does python internally convert the hiascii to something else ?
I have a cli which I spawn as a subprocess from my python code, when I pass this string to the cli, it works fine. While, if I encode this string to utf-8, the cli fails( this string is a password, so it fails saying the password is invalid).
For the second point, I actually did a bit of research and found the following:
1) In windows(sucks), the command line args are encoded in mbcs (sys.getfilesystemencoding). The question I still don't get is, if I read the same string using raw_input, it is encoded in Windows console encoding(on EN windows, it was cp437).
I have a different question that am confused about now regarding Windows encoding. Is the windows sys.stdin.encoding different from Windows console encoding ?
If yes, is there a pythonic way to figure out what my windows console encoding is. I needed this because when I read input using raw_input, its encoded in Windows console encoding, and I want to convert it to say, utf-8. Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.
To answer your first question, Python 2.x byte strings contain the source-encoded bytes of the characters, meaning the exact bytes used to store the string on disk in the source file. For example, here is a Python 2.x program where the source is saved in Windows-1252 encoding (Notepad's default on US Windows):
#!python2
#coding:windows-1252
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xe6\xfc\xff\x80\xe9\xea\xe8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
The byte string contains the bytes that represent the characters in Windows-1252.
The Python decodes that same sequence of using the declared source encoding (!#coding:Windows-1252) into Unicode codepoints. Since Windows-1252 is very close to iso-8859-1, and iso-8859-1 is a 1:1 mapping to the first 0-255 Unicode codepoints, the code points are almost the same, except for the Euro character.
But save the source in a different encoding, and you'll get those bytes instead for the byte string:
#!python2
#coding:utf8
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xc3\xa6\xc3\xbc\xc3\xbf\xe2\x82\xac\xc3\xa9\xc3\xaa\xc3\xa8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
So, Python 2.X just gives you the source code bytes directly, without decoding them to Unicode codepoints, like a Unicode string would do.
Python 3.X notes that this is confusing, and just forbids non-ASCII characters in byte strings:
#!python3
#coding:utf8
s = b'æüÿ€éêè'
u = 'æüÿ€éêè'
print(repr(s))
print(repr(u))
Output:
File "C:\test.py", line 3
s = b'æüÿ\u20acéêè'
^
SyntaxError: bytes can only contain ASCII literal characters.
To answer your second question, please edit your question to show an example that demonstrates the problem.
Is the windows sys.stdin.encoding different from Windows console encoding?
Yes. There are two locale-specific codepages:
the ANSI code page, aka mbcs, used for strings in Win32 ...A APIs and hence for C runtime operations like reading the command line;
the IO code page, used for stdin/stdout/stderr streams.
These do not have to be the same encoding, and typically they aren't. In my locale (UK), the ANSI code page is 1252 and the IO code page defaults to 850. You can change the console code page using the chcp command, so you can make the two encodings match using eg chcp 1252 before running the Python command.
(You also have to be using a TrueType font in the console for chcp to make any difference.)
is there a pythonic way to figure out what my windows console encoding is.
Python reads it at startup using the Win32 API GetConsoleOutputCP and—unless overridden by PYTHONIOENCODING—writes that to the property sys.stdout.encoding. (Similarly GetConsoleCP for stdin though they will generally be the same code page.)
If you need to read this directly, regardless of whether PYTHONIOENCODING is set, you might have to use ctypes to call GetConsoleOutputCP directly.
Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.
(How have you done that? It's a read-only property.)
Although you can certainly treat input and output as UTF-8 at your end, the Windows console won't supply or display content in that encoding. Most other tools you call via the command line will also be treating their input as encoded in the IO code page, so would misinterpret any UTF-8 sent to them.
You can affect what code page the console side uses by calling the Win32 SetConsoleCP/SetConsoleOutputCP APIs with ctypes (equivalent of the chcp command and also requires TTF console font). In principle you should be able to set code page 65001 and get something that is nearly UTF-8. Unfortunately long-standing console bugs usually make this approach infeasible.
windows(sucks)
yes.

Wrong encoding in filenames created on Windows XP by Python script

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.
First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:
filename = excelsheet.cell_value(rowx=first_row, colx=5)
Then, I've tried some encodings without success to generate the file with the right encode:
filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")
Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml
Thanks in advance for your help
You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:
fn = u'ma\u00d1o' # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()
it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.
In summary, do not encode the output of
filename = excelsheet.cell_value(rowx=first_row, colx=5)
instead make sure it is a unicode string.
Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.
Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.
I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):
import xlrd3 as xlrd
Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()
filename = str(filename[:-1])
Now, my script works perfect and generate the files on Windows XP without strange characters.
First,
if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -
Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.
The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" -
Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).
So, this happens:
>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>>
So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)
Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do
import sys
encoding = sys.getfilesystemencoding()
(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)
In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.
# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
f.write('test')
Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

Categories

Resources