Python: Ascii characters from file display wrong - python

Here's my code:
import sys, os
print("█████") #<-- Those are solid blocks.
f= open('file.txt')
for line in f:
print(line)
In file.txt is this:
hay hay, guys
████████████
But the output is this:
██████
hay hay, guys <----- ***Looks like it outptutted this correctly!***
Traceback (most recent call last):
File "echofile.py", line 6, in <module>
print(line)
File "C:\python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: cha
racter maps to <undefined> <------ ***But not from the file!***
Anybody have any suggestions as to why it is doing this? I wrote the code in IDLE, tried editing the file.txt in both Programmer's Notepad and IDLE. The file is ASCII / ANSI.
I'm using Python 3, by the way. 3.3 alpha win-64 if it matters.

This is clearly an issue with character encodings.
In Python 3.x, all strings are Unicode. But when reading or writing a file, it will be necessary to translate the Unicode to some specific encoding.
By default, a Python source file is handled as UTF-8. I don't know exactly what characters you pasted into your source file for the blocks, but whatever it is, Python reads it as UTF-8 and it seems to work. Maybe your text editor converted to valid UTF-8 when you inserted those?
The backtrace suggests that Python is treating the input file as "Code Page 437" or the original IBM PC 8-bit character set. Is that correct?
This link shows how to set a specific decoder to handle a particular file encoding on input:
http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide/
EDIT: I found a better resource:
http://docs.python.org/release/3.0.1/howto/unicode.html
And based on that, here's some sample code:
with open('mytextfile.txt', encoding='utf-8') as f:
for line in f:
print(line, end='')
Originally I had the above set to "cp437" but in a comment you said "utf-8" was correct, so I made that change to this example. I'm specifying end='' here because the input lines from the file already have a newline on the end, so we don't need print() to supply another newline.
EDIT: I found a short discussion of default encodings here:
http://docs.python.org/release/3.0.1/whatsnew/3.0.html
The important bit: "There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default."
So, I had thought that Python defaulted to UTF-8, but not always, it seems. Actually, from your stack backtrace, I think on your system with your LANG environment setting you are getting "cp437" as your default.
So, I learned something too by answering your question!
P.S. I changed the code example above to specify utf-8 since that is what you needed.

Try making that string unicode:
print(u"█████")
^ Add this

Related

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Unicode Error when opening text file - Geany

I'm trying to create a little program that reads the contents of two stories, Alice in Wonderland & Moby Dick, and then counts how many times the word 'the' is found in each story.
However I'm having issues with getting Geany text editor to open the files. I've been creating and using my own small text files with no issues so far.
with open('alice_test.txt') as a_file:
contents = a_file.readlines()
print(contents)
I get the following error:
Traceback (most recent call last):
File "add_cats_dogs.py", line 50, in <module>
print(contents)
File "C:\Users\USER\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2018' in position 279: character maps to <undefined>
As I said, no issues experienced with any small homemade text files.
Strangely enough, when I excecute the above code in Python IDLE, I have no problems, even if I change the text file's encoding between UTF-8 to ANSII.
I tried encoding the text file as UTF-8 and ANSII, I also checked to make sure the default encoding of Geany is UTF-8 (also tried without using default encoding), as well using and not using fixed encoding when opening non-Unicode files.
I get the same error every time. The text file was from gutenberg.org, I tried using another file from there and got the same issue.
I know it must be some sort of issue between Geany and the text file, but I can't figure out what.
EDIT: I found a sort of fix.
Here is the text that was giving me problems:https://www.gutenberg.org/files/11/11-0.txt
Here is the text that I can use without problems:http://www.textfiles.com/etext/FICTION/alice13a.txt
Top one is encoded in UTF-8, bottom one is encoded in windows-1252. I would've imagined the reverse to be true, but for whatever reason the UTF-8 encoding seems to be causing the problem.
What OS do you use? There are similar problems in Windows. If so, you can try to run chcp 65001 before you command in console. Also you can add # encoding: utf-8 at the top of you .py file. Hope this will help because I can't reply same encoding problem with .txt file from gutenberg.org on my machine.

UnicodeError Reading Accentuation Portuguese Characters from File

Preface:
It's a cold, rainy day, in mid 2016, and a developer is still having encode issues with python for not using Python 3.0. Will the great S.O community help him ? I don't know, we will have to wait and see
Scope:
I have a UTF-8 encoded file that contains words with accentuation, such as CURRÍCULO and NÓS. For some reason I can't grasp, I can't manage to read them properly using Python 2.7.
Code Snippet:
import codecs
f_reader = codecs.open('PATH_TO_FILE/Data/Input/kw.txt', 'r', encoding='utf-8')
for line in f_reader:
keywords.append(line.strip().upper())
print line
The output I get is:
TRABALHE CONOSCO
ENVIE SEU CURRICULO
ENVIE SEU CURRÍCULO
UnicodeEncodeError, 'ascii' codec can't encode character u'\xcd' in position 14: ordinal not in range(128)
Encoding, Encoding, Encoding:
I have used notepad++ to convert the file to both regular utf-8 and the one without the ByteOrderMark, and it shows me the characters just fine, without any issue. I'm using Windows, by the way, which will create files as ANSI by default.
Question:
What should I do to be able to read this file properly, including the í and ó and other accentuated characters ?
Just to make it clearer, I want to keep the accentuation on the strings I use in memory.
Update:
Here's the List of Keywords, in memory, read from the file using the code you can see.
The problem seems not to be in the reading, but in the printing. You sad
I'm using Windows, by the way, which will create files as ANSI by default.
I think that includes printing to stdout. Try change the sys.output codec:
sys.stdout = codecs.getwriter("utf-8")(sys.stdout)

Python3: Why i'm getting a UnicodeDecodeError or is this a Memory issue?

I'm writing a program to iterate my Robocopy-Log (>25 MB). It's by far not ready, cause I'm stuck with a problem.
The problem is that after iterating ~1700 lines of my log -> I get an "UnicodeError":
Traceback (most recent call last):
File "C:/Users/xxxxxx.xxxxxx/SkyDrive/#Python/del_robo2.py", line 6, in <module>
for line in data:
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7869: character maps to <undefined>
The program looks as follows:
x="Error"
y=1
arry = []
data = open("Ausstellungen.txt",mode="r")
for line in data:
arry = line.split("\t")
print(y)
y=y+1
if x in arry:
print("found")
print(line)
data.close()
If I reduce the txt file to 1000 lines then the program works.
If I delete line 1500 to 3000 and run again, I get again the unicode error around line 1700.
So have I made an error or is this some memory limiting problem of Python?
Given your data & snippet, I would be surprised if this is a memory issue. It's more likely the encoding: Python is using your system's default encoding to read the file, which is "cp1252" (the default MS Windows encoding), but the file contains byte sequences/bytes which cannot be decoded in that encoding. A candidate for the file's actual encoding might be "latin-1", which you can make Python 3 use by saying
open("Ausstellungen.txt",mode="r", encoding="latin-1")
A possibly similar issue is Python 3 chokes on CP-1252/ANSI reading. A nice talk about the whole thing is here: http://nedbatchelder.com/text/unipain.html
Python decodes all file data to Unicode values. You didn't specify an encoding to use, so Python uses the default for your system, the cp1252 Windows Latin codepage.
However, that is the wrong encoding for your file data. You need to specify an explicit codec to use:
data = open("Ausstellungen.txt",mode="r", encoding='UTF8')
What encoding to use exactly, is unfortunately something you need to figure out yourself. I used UTF-8 as an example codec.
Be aware that some versions of RoboCopy have problems producing valid output.
If you don't yet know what Unicode is, or want to know about encodings, see:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The reason you see the error crop up for different parts of your file is that your data contains more than one codepoint that the cp1252 encoding cannot handle.

UnicodeDecodeError only with cx_freeze

I get the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 7338: ordinal not in range(128)" once I try to run the program after I freeze my script with cx_freeze. If I run the Python 3 script normally it runs fine, but only after I freeze it and try to run the executable does it give me this error. I would post my code, but I don't know exactly what parts to post so if there are any certain parts that will help just let me know and I will post them, otherwise it seems like I have had this problem once before and solved it, but it has been a while and I can't remember what exactly the problem was or how I fixed it so any help or pointers to get me going in the right direction will help greatly. Thanks in advance.
Tell us exactly which version of Python on what platform.
Show the full traceback that you get when the error happens. Look at it yourself. What is the last line of your code that appears? What do you think is the bytes string that is being decoded? Why is the ascii codec being used??
Note that automatic conversion of bytes to str with a default codec (e.g. ascii) is NOT done by Python 3.x. So either you are doing it explicitly or cx_freeze is.
Update after further info in comments.
Excel does not save csv files in ASCII. It saves them in what MS calls "the ANSI codepage", which varies by locale. If you don't know what yours is, it is probably cp1252. To check, do this:
>>> import locale; print(locale.getpreferredencoding())
cp1252
If Excel did save files in ASCII, your offending '\xa0' byte would have been replaced by '?' and you would not be getting a UnicodeDecodeError.
Saving your files in UTF-8 would need you to open your files with encoding='utf8' and would have the same problem (except that you'd get a grumble about 0xc2 instead of 0xa0).
You don't need to post all four of your csv files on the web. Just run this little script (untested):
import sys
for filename in sys.argv[1:]:
for lino, line in enumerate(open(filename), 1):
if '\xa0' in line:
print(ascii(filename), lino, ascii(line))
The '\xa0' is a NO-BREAK SPACE aka ... you may want to edit your files to change these to ordinary spaces.
Probably you will need to ask on the cx_freeze mailing list to get an answer to why this error is happening. They will want to know the full traceback. Get some practice -- show it here.
By the way, "offset 7338" is rather large -- do you expect lines that long in your csv file? Perhaps something is reading all of your file ...
That error itself indicates that you have a character in a python string that isn't a normal ASCII character:
>>> b'abc\xa0'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 3: ordinal not in range(128)
I certainly don't know why this would only happen when a script is frozen. You could wrap the whole script in a try/except and manually print out all or part of the string in question.
EDIT: here's how that might look
try:
# ... your script here
except UnicodeDecodeError as e:
print("Exception happened in string '...%s...'"%(e.object[e.start-50:e.start+51],))
raise
fix by set default coding:
reload(sys)
sys.setdefaultencoding("utf-8")
Use str.decode() function for that lines. And also you can specify encoding like myString.decode('cp1252').
Look also: http://docs.python.org/release/3.0.1/howto/unicode.html#unicode-howto

Categories

Resources