genome diagram fail: Unicode Decode Error

genome diagram fail: Unicode Decode Error - python

I am trying to get the genome diagram function of biopython to work but it currently fails.
This is the output, i'm not sure what the error means. Any suggestions?
======================================================================
ERROR: test_partial_diagram (test_GenomeDiagram.DiagramTest)
construct and draw SVG and PDF for just part of a SeqRecord.
----------------------------------------------------------------------
Traceback (most recent call last):
File "./test_GenomeDiagram.py", line 662, in test_partial_diagram
assert open(output_filename).read().replace("\r\n", "\n") \
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 11: invalid start byte

Your data file is composed of bytes which are encoded is some encoding other than utf-8. You need to specify the right encoding.
open(output_filename, encoding=...)
There is no entirely reliable way for us to tell you what encoding it should be. But since
In [156]: print('\x93'.decode('cp1252'))
“
(and since the quotation mark is a pretty common character) you might want to try using
open(output_filename, encoding='cp1252')
on line 662 of test_GenomeDiagram.py.

UTF-8 is a variable byte encoding. In cases where a character is being encoding that requires multiple bytes, the second an subsequent bytes are of the form 10xxxxxx and no initial bytes (or single byte characters) have this form. As such, 0x93 cannot ever be the first byte of a UTF-8 character. The error message is telling you that your buffer contains an invalid UTF-8 byte sequence.

Related

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Python Unicode Decode Error for Byte not in file

I am reading a large file in python line by line with readline(). After reaching close to 672,280 lines I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 228:
invalid start byte.
However, I have searched the file using grep for a byte 0xfd and it returned none. I also wrote c++ code to go through the file and look for a byte 0xfd and still got nothing. So I have no idea what is going on here. Is it an error because the file is too big?
I just don't see how a decoding error can happen for a byte not in a file.
Thanks

you can try out to open file with ISO encoding.
open('myfile.txt', encoding = "ISO-8859-1")

Which python encoding type I must use to read a non utf-8 character?

I have to make my python script read a DNA query strings file and do a search with it.
Well, the file contains this type of character:
And python default encoding cannot read this line with the readline() function for files. The following error is raised:
[...]
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 860: invalid start byte
I have tried with utf_16 and ascii too but with no positive results. How can I read this?

You need to first figure out what the actual encoding of the text file you have to read, then use open with that file and the correct encoding argument to open that. The diamond ? is simply a placeholder character in your console so your default system encoding is incompatible with the file you have displayed (and vice versa).
Alternatively if you do not care about the "junk" characters you can simply 'ignore' or 'replace' for the errors argument. Again please consult the documentation first for options available.

Python 3 UTF-8 encoding doesnt really work

I have read a lot now on the topic of UTF-8 encoding in Python 3 but it still doesn't work, and I can't find my mistake.
My code looks like this
def main():
with open("test.txt", "rU", encoding='utf-8') as test_file:
text = test_file.read()
print(str(len(text)))
if __name__ == "__main__":
main()
My test.txt file looks like this
ö
And I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte

Your file is not UTF-8 encoded. I'm not sure what encoding uses F6 for ä either; that codepoint is the encoding for ö in Latin 1 and CP-1252:
>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'
You'll need to save that file as UTF-8 instead, with whatever tool you used to create that file.
If open('text').read() works, then you were able to decode the file using the default system encoding. See the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
That is not to say that you were reading the file using the correct encoding; that just means that the default encoding didn't break (encountered bytes for which it doesn't have a character mapping). It could still be mapping those bytes to the wrong characters.
I urge you to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

UnicodeDecodeError

I got an UnicodeDecodeError,
'utf8' codec can't decode byte 0xe5 in position 1923: invalid continuation byte
I have use Danish letter "å" in my template. How can I solve the problem, then I can use non-English letter in my Django project and database?

I can get a similar error (mentioning the same byte value) doing this:
>>> 'å'.encode('latin-1')
b'\xe5'
>>> _.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
_.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
This implies that your data is encoded in latin-1 rather than utf-8. In general, there are two solutions to this: if you have control over your input data, re-save it as UTF-8. Otherwise, when you read the data in Python, set the encoding to latin-1. For a django template, you should be able to use the first - the editor you use should have an 'encoding' option somewhere, change it to utf-8, resave, and everything should work.

this helped me https://stackoverflow.com/a/23278373/2571607
Basically, open C:\Python27\Lib\mimetypes.py
replace
‘default_encoding = sys.getdefaultencoding()’
with
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

genome diagram fail: Unicode Decode Error - python

Related

python cleaning high or non-ascii out of a file [duplicate]

Python Unicode Decode Error for Byte not in file

Which python encoding type I must use to read a non utf-8 character?

Python 3 UTF-8 encoding doesnt really work

UnicodeDecodeError

Categories

Resources