Trouble decoding utf-16 string - python

I'm using python3.3. I've been trying to decode a certain string that looks like this:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc....
keeps going on. However whenever I try to decode this string using str.decode('utf-16') I get an error saying:
'utf16' codec can't decode bytes in position 54-55: illegal UTF-16 surrogate
I'm not exactly sure how to decode this string.

gzipped data begins with \x1f\x8b\x08 so my guess is that your data is gzipped. Try gunzipping the data before decoding.
import io
import gzip
# this raises IOError because `buf` is incomplete. It may work if you supply the complete buf
buf = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc'
with gzip.GzipFile(fileobj=io.BytesIO(buf)) as f:
content = f.read()
print(content.decode('utf-16'))

Related

Problem reading pdf to xml into memory using PDFMiner.Six

Consider the following snippet:
import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()
This results in the following error
ValueError: Codec is required for a binary I/O output
If i leave out output_type i get the error
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.
I don't understand why this happens, and would like help with a workaround.
I figured out how to fix the problem:
First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that.
For example
import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

Python decode and encode with utf-8

I am trying to encode and decode with utf-8. What is wierd is that I get an error trackback saying that I am using gbk.
oneword.decode("utf-8")]
below is the error trackback.
UnicodeEncodeError: 'gbk' codec can't encode character '\u2769' in position 1: illegal multibyte sequence
Can anyone tell me what to do? I seems that the decode parameter does not have effect.
I got it solved.
Actually, I intended to output to a file instead of the console. In such situation, I have to explicitly indicate the decoding of the output target file. Instead of using open I used codecs.open.
import codecs
f = codecs.open(filename, mode='w', encoding='utf-8')
Thanks to #Bakuriu from the comments:
If you are using Python 3 you no longer need to import the codecs module. Just pass the encoding parameter to the built-in open function.

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

Python zipfile sending via suds error: "'ascii' codec can't decode byte 0x8c in position 10: ordinal not in range(128)"

I have the following error when I trying to send Zipfile content via suds method
'ascii' codec can't decode byte 0x8c in position 10: ordinal not in range(128)
Here is my code:
try:
project_archive = open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'), "rb")
data = project_archive.read()
client = Client(settings.UPLOAD_PROJECT_WS_URL)
client.service.uploadProject(data)
except Exception as e:
return HttpResponse(e)
else:
return HttpResponse("Project was exported")
suds doesn't support soap file attachment (not last time I checked, it has been a while).
Work around here:
https://fedorahosted.org/suds/attachment/ticket/350/soap_attachments.2.py
or use a different library
Assuming that in the WSDL the argument type is xsd:base64Binary, you need to:
client.service.uploadProject(base64.b64encode(data))
In my case the sever was written in JAX-WS and the function argument type was Byte[] and Base64 worked for me
The problem seems to be simply that you are trying to read a unicode formatted file using ascii codec. Refer to http://docs.python.org/2/howto/unicode.html for the official documentation on unicode. You can also look at Unicode (UTF-8) reading and writing to files in Python for a similar discussion.
In short for your problem the following code should work:
import codecs
project_archive = codecs.open(os.path.join(settings.MEDIA_ROOT, 'zip/project.zip'),
"rb", "utf-8")
data = project_archive.read()
In the above solution its assumed that the unicode encoding used is utf-8. If some other codec (for example ISO-8859-1) is being used substitute that for the utf-8.

Categories

Resources