problems of reading a collection of files including non-ascii characters - python

I am trying to build word vectors using the following code segment
DIR = "C:\Users\Desktop\data\\rec.sport.hockey"
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
x_train = vectorizer.fit_transform(posts)
However, the code returns the following error message
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte
I think it is related to some non-ascii characters. How to solve this issue?

Is the file automatically generated? If not, one simple solution is to open the file with Notepad++ and convert the encoding to utf-8.

Related

Unable to Read a tsv file in pandas. Gives UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 113: invalid start byte

I am trying to read a tsv file (one of many) but it is given me the following error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 113: invalid start byte
Before this I have read similar other files using the same code.
df = pd.read_csv(fname,sep='\t',error_bad_lines=False)
But for this particular, I am getting an error.
How can I read this file? Hopefully without missing any data.
A suggestion would be to check waht encoding you actually have. Do it this way:
with open('filename.tsv) as f: ### or whatever your etension is
print(f)
from that you'll obtain the encoding. Then,
df=pd.read_csv('filename.tsv', encoding="the encoding that was returned")
It would be nice if you could post lien 113 of you dataset, where the first occurrence of the error occured.

Odd corruption of byte string in PDF annotations, won't decode in utf-8 (pdfminer)

I have a strange issue popping up when attempting to scrape the links from a pdf file. The link appears as 'http://www.mbc.ca.gov/Licensees/License_Renewal/Physician_Survey.aspx' in the pdf file. However, it comes out as:
b'http://www.mbc.ca.gov/Licensees/License_Renewal/Physici\xe9C#|\xf2\xefw\x0e\xd3\x8d>X\x0f\xe7\xc6'
when performing a resolve() method on the PDFObjRef. Why the sudden corruption in the link there? Almost looks like a newline or something that got interpreted as a byte. Also, why is this even a byte string if it's clearly human readable? Is this normal behavior for pdfminer?
It gives this error when attempting to decode that byte string with utf-8:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 55: invalid continuation byte
I think this is a goner. The script works with all pdfs I've encountered but this one. So unless someone can come up with a reason that pdfminer would get a weird/corrupt encoding about 40-60 characters into a byte string, this is FUBAR.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.
Code
with open(file_path, 'rb') as f:
file_read=f.read()
file_read.decode("us-ascii")
Error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)
Us-ascii is file encoding I found when typing in terminal: file -I file_name. I tried other encodings but none works.
I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?
It is tricky without looking at the file. However this works most of the time
from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
file_read=f.read()
Setting error to ignore resolved the problem, thanks N M. The code looks like:
ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore')
file_read=ref_file.read()
The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.
Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!
My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)
Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

Categories

Resources