unicode error with codecs when reading a pdf file in python - python

I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7

PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.

The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

Related

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

Odd corruption of byte string in PDF annotations, won't decode in utf-8 (pdfminer)

I have a strange issue popping up when attempting to scrape the links from a pdf file. The link appears as 'http://www.mbc.ca.gov/Licensees/License_Renewal/Physician_Survey.aspx' in the pdf file. However, it comes out as:
b'http://www.mbc.ca.gov/Licensees/License_Renewal/Physici\xe9C#|\xf2\xefw\x0e\xd3\x8d>X\x0f\xe7\xc6'
when performing a resolve() method on the PDFObjRef. Why the sudden corruption in the link there? Almost looks like a newline or something that got interpreted as a byte. Also, why is this even a byte string if it's clearly human readable? Is this normal behavior for pdfminer?
It gives this error when attempting to decode that byte string with utf-8:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 55: invalid continuation byte
I think this is a goner. The script works with all pdfs I've encountered but this one. So unless someone can come up with a reason that pdfminer would get a weird/corrupt encoding about 40-60 characters into a byte string, this is FUBAR.

How to convert unicode characters into their respective symbols in python?

I have a text file which contains unicode characters in the following format:
\u0935\u094d\u0926\u094d\u0928\u094d\u0935\u094d\u0926\
I want to convert it into devnagri characters in the following format:
वर्जनरूपमिति दर्शित्म् । स पूरुषः अमृतत्वाय कल्पते व्द्न्व्द
and then write it to a file.
Presently my code
encoded = x.encode('utf-8')
print (encoded.decode('unicode-escape'))
can print the devnagri characters in the terminal. However when I try to write it to a file using
text = 'target:'+encoded.decode('unicode-escape')+'\n'
fileid.write(text)
I am getting the following error.
'ascii' codec can't encode characters in position 7-18: ordinal not in range(128)
Can anybody please help me?
If you are using Python 2 it's because after using .decode('unicode-escape') you have an unicode object and fileid.write() only accepts string objects. Python then tries to convert the object to a byte string by using the ASCII encoding that doesn't cover devnagri characters. This conversion causes the exception.
You need to manually convert the unicode string back into a byte string before writing it to the file:
fileid.write(text.encode('utf-8'))
Here I assumed you want UTF-8 encoding. If you want to save the characters in another encoding replace 'utf-8' with the name of that encoding.
In Python 3 you can set the used encoding when opening the file:
fileid = open('compare.txt', 'a', encoding='utf-8')
Then the extra .encode('utf-8') isn't neccessary.

How to open a binary file stored in Google App Engine?

I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks
If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.

How to read ® character from Windows-1252 file and write to UTF-8 file

I have an input file in Windows-1252 encoding that contains the '®' character. I need to write this character to a UTF-8 file. Also assume I must use Python 2.7. Seems easy enough, but I keep getting UnicodeDecodeErrors.
I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043:
invalid start byte
I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. But that produced a new error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22:
ordinal not in range(128)
Here is a minimum working example:
with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
for line in inf:
of.write(line.encode('utf-8'))
Here is the contents of in.txt:
Sample file
Here is my sample file® yay.
I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so:
of.write(line.decode('cp1252').encode('utf-8'))
But that also didn't work, giving the same error as when I just opened it as UTF-8.
How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? The above method has always worked for me in the past until I encountered the ® character.
Your file is not in Windows-1252 if 0xC2 should represent the ® character; in Windows-1252, 0xC2 is Â.
However, you should just use
of.write(line)
since encoding properly is the whole reason you're using codecs in the first place.

Categories

Resources