Unicode Error when opening text file - Geany

Unicode Error when opening text file - Geany - python

I'm trying to create a little program that reads the contents of two stories, Alice in Wonderland & Moby Dick, and then counts how many times the word 'the' is found in each story.
However I'm having issues with getting Geany text editor to open the files. I've been creating and using my own small text files with no issues so far.
with open('alice_test.txt') as a_file:
contents = a_file.readlines()
print(contents)
I get the following error:
Traceback (most recent call last):
File "add_cats_dogs.py", line 50, in <module>
print(contents)
File "C:\Users\USER\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2018' in position 279: character maps to <undefined>
As I said, no issues experienced with any small homemade text files.
Strangely enough, when I excecute the above code in Python IDLE, I have no problems, even if I change the text file's encoding between UTF-8 to ANSII.
I tried encoding the text file as UTF-8 and ANSII, I also checked to make sure the default encoding of Geany is UTF-8 (also tried without using default encoding), as well using and not using fixed encoding when opening non-Unicode files.
I get the same error every time. The text file was from gutenberg.org, I tried using another file from there and got the same issue.
I know it must be some sort of issue between Geany and the text file, but I can't figure out what.
EDIT: I found a sort of fix.
Here is the text that was giving me problems:https://www.gutenberg.org/files/11/11-0.txt
Here is the text that I can use without problems:http://www.textfiles.com/etext/FICTION/alice13a.txt
Top one is encoded in UTF-8, bottom one is encoded in windows-1252. I would've imagined the reverse to be true, but for whatever reason the UTF-8 encoding seems to be causing the problem.

What OS do you use? There are similar problems in Windows. If so, you can try to run chcp 65001 before you command in console. Also you can add # encoding: utf-8 at the top of you .py file. Hope this will help because I can't reply same encoding problem with .txt file from gutenberg.org on my machine.

Related

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.

You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

Python 3.5: Exporting Chinese Characters

I have been trying several times to export Chinese from list variables to csv or txt file and found problems with that.
Specifically, I have already set the encoding as utf-8 or utf-16 when reading the data and writing them to the file. However, I noticed that I cannot do that when my Window 7’s base language is English, even that I change the language setting to Chinese. When I run the Python programs under the Window 7 with Chinese as base language, I can successfully export and show Chinese perfectly.
I am wondering why that happens and any solution helping me show Chinese characters in the exported file when running the Python programs under English-based Window?

I just found that you need to do 2 things to achieve this:
Change the Window's display language to Chinese.
Use encoding UTF-16 in the writing process.

Here's US Windows 10, running a Python IDE called PythonWin. There are no problems with Chinese.
Here's the same program running in a Windows console. Note the US codepage default for the console is cp437. cp65001 is UTF-8. Switching to an encoding that supports Chinese text is the key. The text below was cut-and-pasted directly from the console. While the characters display correctly pasted to Stack Overflow, the console font didn't support Chinese and actually displayed .
C:\>chcp
Active code page: 437
C:\>x
Traceback (most recent call last):
File "C:\\x.py", line 5, in <module>
print(f.read())
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
C:\>chcp 65001
Active code page: 65001
C:\>type test.txt
我是美国人。
C:\>x
我是美国人。
Notepad displays the output file correctly as well:
Either use an IDE that supports UTF-8, or write your output to a file and read it with a tool like Notepad.
Ways to get the Windows console to actually output Chinese are the win-unicode-console package, and changing the Language and Region settings, Administrative tab, system locale to Chinese. For the latter, Windows will remain English, but the Windows console will use a Chinese code page instead of an English one.

UnicodeEncodeError - works in Spyder but not when executed from terminal

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.

That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

Python: Ascii characters from file display wrong

Here's my code:
import sys, os
print("█████") #<-- Those are solid blocks.
f= open('file.txt')
for line in f:
print(line)
In file.txt is this:
hay hay, guys
████████████
But the output is this:
██████
hay hay, guys <----- ***Looks like it outptutted this correctly!***
Traceback (most recent call last):
File "echofile.py", line 6, in <module>
print(line)
File "C:\python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: cha
racter maps to <undefined> <------ ***But not from the file!***
Anybody have any suggestions as to why it is doing this? I wrote the code in IDLE, tried editing the file.txt in both Programmer's Notepad and IDLE. The file is ASCII / ANSI.
I'm using Python 3, by the way. 3.3 alpha win-64 if it matters.

This is clearly an issue with character encodings.
In Python 3.x, all strings are Unicode. But when reading or writing a file, it will be necessary to translate the Unicode to some specific encoding.
By default, a Python source file is handled as UTF-8. I don't know exactly what characters you pasted into your source file for the blocks, but whatever it is, Python reads it as UTF-8 and it seems to work. Maybe your text editor converted to valid UTF-8 when you inserted those?
The backtrace suggests that Python is treating the input file as "Code Page 437" or the original IBM PC 8-bit character set. Is that correct?
This link shows how to set a specific decoder to handle a particular file encoding on input:
http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
EDIT: I found a better resource:
http://docs.python.org/release/3.0.1/howto/unicode.html
And based on that, here's some sample code:
with open('mytextfile.txt', encoding='utf-8') as f:
for line in f:
print(line, end='')
Originally I had the above set to "cp437" but in a comment you said "utf-8" was correct, so I made that change to this example. I'm specifying end='' here because the input lines from the file already have a newline on the end, so we don't need print() to supply another newline.
EDIT: I found a short discussion of default encodings here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html
The important bit: "There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default."
So, I had thought that Python defaulted to UTF-8, but not always, it seems. Actually, from your stack backtrace, I think on your system with your LANG environment setting you are getting "cp437" as your default.
So, I learned something too by answering your question!
P.S. I changed the code example above to specify utf-8 since that is what you needed.

Try making that string unicode:
print(u"█████")
^ Add this

UnicodeDecodeError only with cx_freeze

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.

Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...

That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise

fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")

Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.