'utf-8' decode error in tensorflow tutorial - python

I'm running into this bizarre problem where when I run
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('/home/fqiao/development/MNIST_data/', one_hot=True)
I get:
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 199, in read_data_sets
train_images = extract_images(local_file)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 58, in extract_images
magic = _read32(bytestream)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 51, in _read32
return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
File "/usr/lib/python3.5/gzip.py", line 274, in read
return self._buffer.read(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 461, in read
if not self._read_gzip_header():
File "/usr/lib/python3.5/gzip.py", line 404, in _read_gzip_header
magic = self._fp.read(2)
File "/usr/lib/python3.5/gzip.py", line 91, in read
self.file.read(size-self._length+read)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/default/_gfile.py", line 45, in sync
return fn(self, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/default/_gfile.py", line 199, in read
return self._fp.read(n)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
However, if I just run the code in input_data.py directly, everything appears to be fine:
>>> dt = numpy.dtype(numpy.uint32).newbyteorder('>')
>>> f = tf.gfile.Open('/home/fqiao/development/MNIST_data/train-images-idx3-ubyte.gz', 'rb')
>>> bytestream = gzip.GzipFile(fileobj=f)
>>> testbytes = numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
>>> testbytes
2051
Anyone has any idea what's going on?
My system: Ubuntu 15.10 x64 python 3.5.0.

The bug has been addressed by a recent change 555e73d. MNIST files need to be opened with binary 'rb' mode instead of just text 'r'.

In my case, the problem was in the encoding of the data file.
Open the file using vim and execute:
:set fileencoding=utf-8
That solved the issue in my case.

Related

How to filter none utf-8 HTML to get an utf-8 HTML?

http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70
If I use the following python code to parse the above HTML page, I will get UnicodeDecodeError.
from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
If I filter the input with iconv -f utf-8 -t utf-8 -c first, then run the same python code, I still get UnicodeDecodeError. What is a robust filter (without knowing the encoding of the input HTML) so that the filtered result always work with the python code? Thanks.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
EDIT: Here are the commands used.
$ wget 'http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70'
$ ./main.py < 'view.html?doi=10.15430%2FJCP.2018.23.2.70'
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
$ iconv -f utf-8 -t utf-8 -c < 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | ./main.py
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
After digging I found that this file is not in utf-8 but in latin1 and problem has sys.stdin which uses utf-8. But you can't change encoding directly in sys.stdin. You have to use sys.stdin to create new stream with new encoding.
main-latin1.py
import sys
import io
from lxml import html
#input_stream = sys.stdin # gives error
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')
doc = html.parse(input_stream)
print(html.tostring(doc))
And now you can run
cat 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | python main-latin1.py
EDIT: You can also convert it in console with iconv -f latin1 -t utf-8
cat 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | iconv -f latin1 -t utf-8 | python main-utf8.py
main-utf8.py
import sys
from lxml import html
doc = html.parse(sys.stdin)
print(html.tostring(doc))
BTW: It has no problem to read it directly from page using requests
import requests
from lxml import html
r = requests.get('http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70')
doc = html.fromstring(r.text)
print(html.tostring(doc))
EDIT: You can read data as bytes and use for-loop and try/except to decode with different encoding.
You run it without <
myscript filename.html
import sys
from lxml import html
# --- function ---
def decode(data, encoding):
try:
return data.decode(encoding)
except:
pass
# --- main ---
# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')
if len(sys.argv) == 1:
print('need file name')
exit(1)
data = open(sys.argv[1], 'rb').read()
for encoding in ('utf-8', 'latin1', 'cp1250'):
result = decode(data, encoding)
if result:
print('encoding:', encoding)
doc = html.fromstring(result)
#print(html.tostring(doc))
break
EDIT: I tried to use module chardet (char detection) which uses requests but it gives me windows-1252 (cp1252) instead of latin1. But for some reason requests has no problem to get it correctly.
import sys
from lxml import html
import chardet
# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')
if len(sys.argv) == 1:
print('need file name')
exit(1)
data = open(sys.argv[1], 'rb').read()
encoding = chardet.detect(data)['encoding']
print('encoding:', encoding)
doc = html.fromstring(data.decode(encoding))
You could filter the input using str = unicode(str, errors='ignore') as suggested in UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c. This is not always desirable since unreadable characters will be dropped, but it may be fine for your use case.
Or it seems lxml can use encoding='unicode' in certain cases. Have you tried that?

Encoding error when opening an Excel file with python xlrd module

I have some excel files with extensions that are xls,when I use xlrd to open these files, it failed,I do not know how to solve it.
oldbook=xlrd.open_workbook('file.xls')
oldsheet=oldbook.sheets()[0]
PS C:\Users\我是猫\Desktop\python> python -u "c:\Users\我是猫\Desktop\python\a.py"
Traceback (most recent call last):
File "c:\Users\我是猫\Desktop\python\a.py", line 64, in <module>
oldbook=xlrd.open_workbook(result)
File "E:\python\lib\site-packages\xlrd\__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "E:\python\lib\site-packages\xlrd\book.py", line 117, in open_workbook_xls
bk.parse_globals()
File "E:\python\lib\site-packages\xlrd\book.py", line 1209, in parse_globals
self.handle_format(data)
File "E:\python\lib\site-packages\xlrd\formatting.py", line 538, in handle_format
unistrg = unpack_unicode(data, 2)
File "E:\python\lib\site-packages\xlrd\biffh.py", line 284, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "E:\python\lib\site-packages\xlrd\timemachine.py", line 31, in <lambda>
unicode = lambda b, enc: b.decode(enc)
File "E:\python\lib\encodings\utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 10-11: illegal encoding
PS C:\Users\我是猫\Desktop\python>
Try overriding the encoding used:
oldbook = xlrd.open_workbook('file.xls', encoding_override="cp1252")
You can also try encoding_override="utf-8", play around with the encoding till you get the right one.

Encoding problem with /usr/lib64/python3.4/http/client.py

I do not understand the error below. If I run :
python3.4 ./bug.py "salé.txt"
It is fine.
If I run : python3.4 ./bug.py "Capture d’écran du 2019-03-21 15-17-10.png"
I got this error :
Traceback (most recent call last):
File "./bug.py", line 45, in <module>
status=testB_CreateSimpleDocumentWithFile(session)
File "./bug.py", line 32, in testB_CreateSimpleDocumentWithFile
status, result = session.create_document_with_properties(path,mydoc,simple_document,properties=props,files=kk)
File "/home/karim/testatrium/nuxeolib/session.py", line 345, in create_document_with_properties
_document_properties, _ = self.encode_properties(properties, files)
File "/home/karim/testatrium/nuxeolib/session.py", line 251, in encode_properties
_names, _sizes = self.upload_files(files, batch_id=_batch_id)
File "/home/karim/testatrium/nuxeolib/session.py", line 136, in upload_files
_status, _result = self.execute_api(param=_param, headers=_headers, file_name=_name)
File "/home/karim/testatrium/nuxeolib/session.py", line 1325, in execute_api
_connection.request(method, url, headers=h2, body=data)
File "/usr/lib64/python3.4/http/client.py", line 1139, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python3.4/http/client.py", line 1179, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python3.4/http/client.py", line 1110, in putheader
values[i] = one_value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 9: ordinal not in range(256)
The problem comes from the Right Single Quotation Mark. I do not manage to fix it.
Thanks for any advice.
Karim
Since File name is not in your control, I would sanitize the file name..
Any of the methods in this question, would solve the problem.

Pandas can't read excel encoding

I'm trying to import an excel file into Pandas. I'm using df=pd.read_excel(file_path) but it keeps getting me this error:
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/FindCos/FindCos_Functions.py", line 5468, in <module>
adjust_sheet(y1,y2,y3)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/FindCos/FindCos_Functions.py", line 5130, in adjust_sheet
y1=pd.read_excel(y1)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py", line 118, in wrapper
return func(*args, **kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py", line 230, in read_excel
io = ExcelFile(io, engine=engine)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py", line 294, in __init__
self.book = xlrd.open_workbook(self._io)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 119, in open_workbook_xls
bk.get_sheets()
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 719, in get_sheets
self.get_sheet(sheetno)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 710, in get_sheet
sh.read(self)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/sheet.py", line 815, in read
strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/biffh.py", line 249, in unpack_string
return unicode(data[pos:pos+nchars], encoding)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
The file I'm trying to import is this one.
Is that an encoding problem or some character in the file is causing this? What would be the way to solve it?
pd.read_excel('data.csv' encoding='utf-8')
#astrobiologist gave a good hint
Since I didn't want the hassle of going into patches, the way I found to solve was to open the file in Open Office and save it as an Excel 97 file. Finally worked

Python pandas to excel UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11

After the web scraping of an e-commerce web site I have saved all the data into a pandas dataframe. Well, when I'm trying to save my pandas dataframe to an excel file but I get the following error:
Traceback (most recent call last):
File "<ipython-input-7-3dafdf6b87bd>", line 2, in <module>
sheet_name='Dolci', encoding='iso-8859-1')
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line
1466, in to_excel
excel_writer.save()
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\excel.py", line
1502, in save
return self.book.close()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\workbook.py",
line 299, in close
self._store_workbook()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\workbook.py",
line 607, in _store_workbook
xml_files = packager._create_package()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\packager.py",
line 139, in _create_package
self._write_shared_strings_file()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\packager.py",
line 286, in _write_shared_strings_file
sst._assemble_xml_file()
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file
self._write_sst_strings()
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings
self._write_si(string)
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 110, in _write_si
self._xml_si_element(string, attributes)
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\xmlwriter.py",
line 122, in _xml_si_element
self.fh.write("""<si><t%s>%s</t></si>""" % (attr, string))
File "C:\ProgramData\Anaconda2\lib\codecs.py", line 706, in write
return self.writer.write(data)
File "C:\ProgramData\Anaconda2\lib\codecs.py", line 369, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11:
ordinal not in range(128)
The code I use is this:
df.to_excel('my_file.xlsx',sheet_name='Dolci', encoding='iso-8859-1')
but it doesn't work, I even have tried:
df.to_excel('my_file.xlsx',sheet_name='Dolci', encoding='utf-8')
but it still give me error.
Can somebody help me on this issue?
It seems like you use xlsxwriter engine in ExcelWriter.
Try to use openpyxl instead.
writer = pd.ExcelWriter('file_name.xlsx', engine='openpyxl')
df.to_excel(writer)
writer.save()
There is an essential param of to_excel method, try
df.to_excel('filename.xlsx', engine='openpyxl')
and it works for me.
Adding to #Vadym 's response, you may have to close your writer to get the file to be created.
writer = pd.ExcelWriter(xlPath, engine='openpyxl')
df.to_excel(writer)
writer.close()
"depends on the behaviour of the used engine"
See:
https://github.com/pandas-dev/pandas/issues/9145
This should be a comment but I don't have the rep...

Categories

Resources