numpy loadtxt, unicode, and python 2 or 3 - python

I have a (windows) text file reported by linux as being a:
ISO-8859 text, with very long lines, with CRLF line terminators
I want to read this into numpy, except the first line which contains labels (with special characters, usually only the greek mu).
Python 2.7.6, Numpy 1.8.0, this works perfectly:
data = np.loadtxt('input_file.txt', skiprows=1)
Python 3.4.0, Numpy 1.8.0, gives an error:
>>> np.loadtxt('input_file.txt', skiprows=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt
next(fh)
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid start byte
To me this is "buggy" behaviour for the following reasons:
I want to skip the first line so it should be ignored, regardless of its encoding
If I delete the first line from the file, loadtxt works fine in both versions of python
Shouldn't numpy.loadtxt behave the same in python2 and python3?
Questions:
How to get around this problem (using python3 of course)?
Should I file a bug report or is this expected behaviour?

It seems like a bug in loadtxt(), try to use genfromtxt() instead.

Yes, it seems to be a bug in Numpy - it tries to do some parsing even in skipped rows and fails. Better report it.
By the time, the documentation says that loadtxt supports fileobject or string generator as its first argument. Try this
f = open ('load_file.txt')
f.readline()
data = np.loadtxt(f)
P.S. Error 'utf-8' codec can't decode byte 0xb5 in position 4158 don't seem to happen in the beginning of the file. Are you sure your file does not contain some weird symbol that is invisible or looks like space but actually is not?

Related

UnicodeDecodeError on python3 [duplicate]

This question already has answers here:
Switching to Python 3 causing UnicodeDecodeError
(3 answers)
Closed 5 years ago.
Im currently trying to use some simple regex on a very big .txt file (couple of million lines of text). The most simple code that causes the problem:
file = open("exampleFileName", "r")
for line in file:
pass
The error message:
Traceback (most recent call last):
File "example.py", line 34, in <module>
example()
File "example.py", line 16, in example
for line in file:
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7332: invalid continuation byte
How can i fix this? is utf-8 the wrong encoding? And if it is, how do i know which one is right?
Thanks and best regards!
It looks like it is invalid UTF-8 and you should try to read with latin-1 encoding. Try
file = open('exampleFileName', 'r', encoding='latin-1')
It is not possible to identify the encoding on the fly. So, either user a method which I wrote as a comment or use similar constructions (as proposed by another answer), but this is a wild shot:
try:
file = open("exampleFileName", "r")
except UnicodeDecodeError:
try:
file = open("exampleFileName", "r", encoding="latin2")
except: #...
And so on, until you test all the encodings from Standard Python Encodings.
So I think there's no need to bother with this nested hell, just do file -bi [filename] once, copy the encoding and forget about this.
UPD. Actually, I've found another stackoverflow answer which you can use if you're on Windows.

python Unicode decode error when accessing records of OrderedDict

using python 3.5.2 on windows (32), I'm reading a DBF file which returns me an OrderedDict.
from dbfread import DBF
Table = DBF('FME.DBF')
for record in Table:
print(record)
When accessing the first record all is ok until I reach a record which contains diacritics:
Traceback (most recent call last):
File "getdbe.py", line 3, in <module>
for record in Table:
File "...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\dbfread\dbf.py", line 311, in _iter_records
for field in self.fields]
File "...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\dbfread\dbf.py", line 311, in <listcomp>
for field in self.fields]
File "...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\dbfread\field_parser.py", line 75, in parse
return func(field, data)
File "...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\dbfread\field_parser.py", line 83, in parseC
return decode_text(data.rstrip(b'\0 '), self.encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 11: ordinal not in range(128)
Even if I don't print the record I still have the problem.
Any idea ?
dbfread failed to detect the correct encoding from your DBF file. From the Character Encodings section of the documentation:
dbfread will try to detect the character encoding (code page) used in the file by looking at the language_driver byte. If this fails it reverts to ASCII. You can override this by passing encoding='my-encoding'.
Emphasis mine.
You'll have to pass in an explicit encoding; this will invariably be a Windows codepage. Take a look at the supported codecs in Python; you'll have to use one that starts with cp here. If you don't know what codepage to you you'll have some trial-and-error work to do. Note that some codepages overlap in characters, so even if a codepage appears to produce legible results, you may want to continue searching and trying out different records in your data file to see what fits best.

UnicodeDecodeError even when importing simple txt file in Python

I would like to import even a simple text file into Python. For example, here's the contents of example.txt:
hello
my
friend
Very simple. However, when I try to import the file and read it:
f = open('example.txt')
f.read()
I get the following error:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
f.read()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
What's the source of this problem? Clearly there are not any non-ascii characters in the file.
I've tried this in IDLE, terminal (Mac OSX) and Rodeo and get similar issues in all.
I'm very new to Python and am concerned I may have screwed up something in installation. I've downloaded various versions over the years, straight from Python, Anaconda, macports, etc. and I'm wondering if the various sources are not playing nicely...
Python 3.5.1 on OSX 10.11.4.
Maybe your file is saved with the encoding UTF-8 with BOM (Byte order mark). Try to save you file explicit as UTF-8 (without BOM). While the BOM is not included in the ASCII codec, it causes an UnicodeError.
Hope this helps!

Can't handle strings in windows

I have written a python 2.7 code in linux and it worked fine.
It uses
os.listdir(os.getcwd())
to read folder names as variables and uses them later in some parts.
In linux I used simple conversion trick to manually convert the non asci characters into asci ones.
str(str(tfile)[0:-4]).replace('\xc4\xb0', 'I').replace("\xc4\x9e", 'G').replace("\xc3\x9c", 'U').replace("\xc3\x87", 'C').replace("\xc3\x96", 'O').replace("\xc5\x9e", 'S') ,str(line.split(";")[0]).replace(" ", "").rjust(13, "0"),a))
This approach failed in windows. I tried
udata = str(str(str(tfile)[0:-4])).decode("UTF-8")
asci = udata.encode("ascii","ignore")
Which also failed with following
DEM¦-RTEPE # at this string error occured
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1532, in __call__
return self.func(*args)
File "C:\Users\benhur.satir\workspace\Soykan\tkinter.py", line 178, in SparisDerle
udata = str(str(str(tfile)[0:-4])).decode("utf=8")
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa6 in position 3: invalid start byte
How can I handle such characters in windows?
NOTE:Leaving them UTF causes xlswriter module to fail, so I need to convert them to asci. Missing characters are not desirable yet acceptable.
Windows does not like UTF8. You probably get the folder names in the default system encoding, generally win1252 (a variant of ISO-8859-1).
That's the reason why you could not find UTF8 characters in the file names. By the way the exception says you found a character of code 0xa6, which in win1252 encoding would be |.
It does not say exactly what is the encoding on your windows system as it may depends on the localization, but it proves the data is not UTF8 encoded.
How about this?
You can use this for optional .replace()
In the module of string, there is a set of characters that can be used..
>>> import string
>>> string.digits+string.punctuation
'0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>

UnicodeDecodeError when using Python 2.x unicodecsv

I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:
# -*- coding: utf-8 -*-
import codecs
import unicodecsv
raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')
with codecs.open('test.csv', 'w', 'UTF-8') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
This is the traceback:
Traceback (most recent call last):
File "unicode_test.py", line 11, in <module>
w.writerow(["1", encoded_contents])
File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)
I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.
From the traceback, it looks like I can reproduce the error like this:
>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>
Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.
You should not use codecs.open() for your file. unicodecsv wraps the csv module, which always writes a byte string to the open file object. In order to write that byte string to a Unicode-aware file object such as returned by codecs.open(), it is implicitly decoded; this is where your UnicodeDecodeError exception stems from.
Use a file in binary mode instead:
with open('test.csv', 'wb') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
The binary mode is not strictly necessary unless your data contains embedded newlines, but the csv module wants to control how newlines are written to ensure that such values are handled correctly. However, not using codecs.open() is an absolute requirement.
The same thing happens when you call .encode() on a byte string; you already have encoded data there, so Python implicitly decodes to get a Unicode value to encode.

Categories

Resources