python cleaning high or non-ascii out of a file [duplicate]

python cleaning high or non-ascii out of a file [duplicate] - python

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Related

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

When running the following code (which just prints out file names):
print filename
It throws the following error:
File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.
The script only fails when trying read a filename such as Coé.jpg.
Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?
NB. I'm a python noob

filename is already encoded. It is already a byte string and doesn't need encoding again.
But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:
>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').
You probably need to read up on Python and Unicode. I recommend:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

Python 3 UTF-8 encoding doesnt really work

I have read a lot now on the topic of UTF-8 encoding in Python 3 but it still doesn't work, and I can't find my mistake.
My code looks like this
def main():
with open("test.txt", "rU", encoding='utf-8') as test_file:
text = test_file.read()
print(str(len(text)))
if __name__ == "__main__":
main()
My test.txt file looks like this
ö
And I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte

Your file is not UTF-8 encoded. I'm not sure what encoding uses F6 for ä either; that codepoint is the encoding for ö in Latin 1 and CP-1252:
>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'
You'll need to save that file as UTF-8 instead, with whatever tool you used to create that file.
If open('text').read() works, then you were able to decode the file using the default system encoding. See the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
That is not to say that you were reading the file using the correct encoding; that just means that the default encoding didn't break (encountered bytes for which it doesn't have a character mapping). It could still be mapping those bytes to the wrong characters.
I urge you to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Python3: Why i'm getting a UnicodeDecodeError or is this a Memory issue?

I'm writing a program to iterate my Robocopy-Log (>25 MB). It's by far not ready, cause I'm stuck with a problem.
The problem is that after iterating ~1700 lines of my log -> I get an "UnicodeError":
Traceback (most recent call last):
File "C:/Users/xxxxxx.xxxxxx/SkyDrive/#Python/del_robo2.py", line 6, in <module>
for line in data:
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7869: character maps to <undefined>
The program looks as follows:
x="Error"
y=1
arry = []
data = open("Ausstellungen.txt",mode="r")
for line in data:
arry = line.split("\t")
print(y)
y=y+1
if x in arry:
print("found")
print(line)
data.close()
If I reduce the txt file to 1000 lines then the program works.
If I delete line 1500 to 3000 and run again, I get again the unicode error around line 1700.
So have I made an error or is this some memory limiting problem of Python?

Given your data & snippet, I would be surprised if this is a memory issue. It's more likely the encoding: Python is using your system's default encoding to read the file, which is "cp1252" (the default MS Windows encoding), but the file contains byte sequences/bytes which cannot be decoded in that encoding. A candidate for the file's actual encoding might be "latin-1", which you can make Python 3 use by saying
open("Ausstellungen.txt",mode="r", encoding="latin-1")
A possibly similar issue is Python 3 chokes on CP-1252/ANSI reading. A nice talk about the whole thing is here: http://nedbatchelder.com/text/unipain.html

Python decodes all file data to Unicode values. You didn't specify an encoding to use, so Python uses the default for your system, the cp1252 Windows Latin codepage.
However, that is the wrong encoding for your file data. You need to specify an explicit codec to use:
data = open("Ausstellungen.txt",mode="r", encoding='UTF8')
What encoding to use exactly, is unfortunately something you need to figure out yourself. I used UTF-8 as an example codec.
Be aware that some versions of RoboCopy have problems producing valid output.
If you don't yet know what Unicode is, or want to know about encodings, see:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The reason you see the error crop up for different parts of your file is that your data contains more than one codepoint that the cp1252 encoding cannot handle.

printing hebrew in python works in eclipse but not shell

I have some code that converts a Unicode representation of hebrew text file into hebrew for display
for example:
f = open(sys.argv[1])
for line in f:
print eval('u"' + line +'"')
This works fun when I run it in PyDev (eclipse), but when I run it from the command line, I get
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)
An example line from the input file is:
\u05d9\u05d5\u05dd
What is the problem? How can I solve this?

Do not use eval(); instead use the unicode_escape codec to interpret that data:
for line in f:
line = line.decode('unicode_escape')
The unicode_escape encoding interprets \uabcd character sequences the same way Python would when parsing a unicode literal in the source code:
>>> '\u05d9\u05d5\u05dd'.decode('unicode_escape')
u'\u05d9\u05d5\u05dd'
The exception you see is not caused by the eval() statement though; I suspect it is being caused by an attempt to print the result instead. Python will try to encode unicode values automatically and will detect what encoding the current terminal uses.
Your Eclipse output window uses a different encoding from your terminal; if the latter is configured to support Latin-1 then you'll see that exact exception, as Python tries to encode Hebrew codepoints to an encoding that doesn't support those:
>>> u'\u05d9\u05d5\u05dd'.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
The solution is to reconfigure your terminal (UTF-8 would be a good choice), or to not print unicode values with codepoints that cannot be encoded to Latin-1.
If you are redirecting output from Python to a file, then Python cannot determine the output encoding automatically. In that case you can use the PYTHONIOENCODING environment variable to tell Python what encoding to use for standard I/O:
PYTHONIOENCODING=utf-8 python yourscript.py > outputfile.txt

Thank you, this solved my problem.
line.decode('unicode_escape')
did the trick.
Followup - This now works, but if I try to send the output to a file:
python myScript.py > textfile.txt
The file itself has the error:
'ascii' codec can't encode characters in position 42-44: ordinal not in range(128)

Diacritic signs

How should I write "mąka" in Python without an exception?
I've tried var= u"mąka" and var= unicode("mąka") etc... nothing helps
I have coding definition in first line in my document, and still I've got that exception:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte

Save the following 2 lines into write_mako.py:
# -*- encoding: utf-8 -*-
open(u"mąka.txt", 'w').write("mąka\n")
Run:
$ python write_mako.py
mąka.txt file that contains the word mąka should be created in the current directory.
If it doesn't work then you can use chardet to detect actual encoding of the file (see chardet example usage):
import chardet
print chardet.detect(open('write_mako.py', 'rb').read())
In my case it prints:
{'confidence': 0.75249999999999995, 'encoding': 'utf-8'}

The # -- coding: -- line must specify the encoding the source file is saved in. This error message:
'utf8' codec can't decode byte 0xb1 in position 0: unexpected code byte
indicates you aren't saving the source file in UTF-8. You can save your source file in any encoding that supports the characters you are using in the source code, just make sure you know what it is and have an appropriate coding line.

What exception are you getting?
You might try saving your source code file as UTF-8, and putting this at the top of the file:
# coding=utf-8
That tells Python that the file’s saved as UTF-8.

This code works for me, saving the file as UTF-8:
v = u"mąka"
print repr(v)
The output I get is:
u'm\u0105ka'
Please copy and paste the exact error you are getting. If you are getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character ... in position ...: character maps to <undefined>
Then you are trying to output the character somewhere that does not support UTF-8 (e.g. your shell's character encoding is set to something other than UTF-8).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python cleaning high or non-ascii out of a file [duplicate] - python

Related

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

Python 3 UTF-8 encoding doesnt really work

Python3: Why i'm getting a UnicodeDecodeError or is this a Memory issue?

printing hebrew in python works in eclipse but not shell

Diacritic signs

Categories

Resources