Python Unicode Decode Error for Byte not in file

Python Unicode Decode Error for Byte not in file - python

I am reading a large file in python line by line with readline(). After reaching close to 672,280 lines I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 228:
invalid start byte.
However, I have searched the file using grep for a byte 0xfd and it returned none. I also wrote c++ code to go through the file and look for a byte 0xfd and still got nothing. So I have no idea what is going on here. Is it an error because the file is too big?
I just don't see how a decoding error can happen for a byte not in a file.
Thanks

you can try out to open file with ISO encoding.
open('myfile.txt', encoding = "ISO-8859-1")

Related

Unable to Read a tsv file in pandas. Gives UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 113: invalid start byte

I am trying to read a tsv file (one of many) but it is given me the following error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 113: invalid start byte
Before this I have read similar other files using the same code.
df = pd.read_csv(fname,sep='\t',error_bad_lines=False)
But for this particular, I am getting an error.
How can I read this file? Hopefully without missing any data.

A suggestion would be to check waht encoding you actually have. Do it this way:
with open('filename.tsv) as f: ### or whatever your etension is
print(f)
from that you'll obtain the encoding. Then,
df=pd.read_csv('filename.tsv', encoding="the encoding that was returned")
It would be nice if you could post lien 113 of you dataset, where the first occurrence of the error occured.

ERROR: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 715: character maps to <undefined>

I am using Python Notebook to open this text file in Windows 10. However UTF-8 encoding isn't working. How should I solve this error?

Python is trying to open the file using your system's default encoding, but the bytes in the file cannot be decoded with that encoding.
You need to pass the correct encoding to open. We don't know what the correct encoding is, but the most likely candidates are UTF-8 or, less common these days, latin-1. So you would do something like
with open('myfile.txt', 'r', encoding='UTF-8') as f:
for line in f:
# do something with the line

How to read ® character from Windows-1252 file and write to UTF-8 file

I have an input file in Windows-1252 encoding that contains the '®' character. I need to write this character to a UTF-8 file. Also assume I must use Python 2.7. Seems easy enough, but I keep getting UnicodeDecodeErrors.
I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043:
invalid start byte
I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. But that produced a new error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22:
ordinal not in range(128)
Here is a minimum working example:
with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
for line in inf:
of.write(line.encode('utf-8'))
Here is the contents of in.txt:
Sample file
Here is my sample file® yay.
I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so:
of.write(line.decode('cp1252').encode('utf-8'))
But that also didn't work, giving the same error as when I just opened it as UTF-8.
How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? The above method has always worked for me in the past until I encountered the ® character.

Your file is not in Windows-1252 if 0xC2 should represent the ® character; in Windows-1252, 0xC2 is Â.
However, you should just use
of.write(line)
since encoding properly is the whole reason you're using codecs in the first place.

problems of reading a collection of files including non-ascii characters

I am trying to build word vectors using the following code segment
DIR = "C:\Users\Desktop\data\\rec.sport.hockey"
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
x_train = vectorizer.fit_transform(posts)
However, the code returns the following error message
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte
I think it is related to some non-ascii characters. How to solve this issue?

Is the file automatically generated? If not, one simple solution is to open the file with Notepad++ and convert the encoding to utf-8.

genome diagram fail: Unicode Decode Error

I am trying to get the genome diagram function of biopython to work but it currently fails.
This is the output, i'm not sure what the error means. Any suggestions?
======================================================================
ERROR: test_partial_diagram (test_GenomeDiagram.DiagramTest)
construct and draw SVG and PDF for just part of a SeqRecord.
----------------------------------------------------------------------
Traceback (most recent call last):
File "./test_GenomeDiagram.py", line 662, in test_partial_diagram
assert open(output_filename).read().replace("\r\n", "\n") \
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 11: invalid start byte

Your data file is composed of bytes which are encoded is some encoding other than utf-8. You need to specify the right encoding.
open(output_filename, encoding=...)
There is no entirely reliable way for us to tell you what encoding it should be. But since
In [156]: print('\x93'.decode('cp1252'))
“
(and since the quotation mark is a pretty common character) you might want to try using
open(output_filename, encoding='cp1252')
on line 662 of test_GenomeDiagram.py.

UTF-8 is a variable byte encoding. In cases where a character is being encoding that requires multiple bytes, the second an subsequent bytes are of the form 10xxxxxx and no initial bytes (or single byte characters) have this form. As such, 0x93 cannot ever be the first byte of a UTF-8 character. The error message is telling you that your buffer contains an invalid UTF-8 byte sequence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.