Problem reading pdf to xml into memory using PDFMiner.Six

Problem reading pdf to xml into memory using PDFMiner.Six - python

Consider the following snippet:
import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()
This results in the following error
ValueError: Codec is required for a binary I/O output
If i leave out output_type i get the error
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.
I don't understand why this happens, and would like help with a workaround.

I figured out how to fix the problem:
First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that.
For example
import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")

Related

Reading joblib file from Azure Blob in python

I am trying to read a joblib file from Azure blob (see code below). However, I get the following error:
with open(filename, 'rb') as f:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 2: invalid start byte
Code:
from azure.storage.blob import BlobClient
import sklearn.externals
import joblib
blob_client = BlobClient.from_connection_string('connection_string', 'myContainer', 'myBlob.joblib')
downloaded_blob = blob_client.download_blob()
model = joblib.load(downloaded_blob.readall())
pickle has loads() which works fine. How can I achieve the same with joblib?

I used readinto instead of readall (to get a stream and avoid encoding problems) and then used a temporary file to read the path into
def read_blob_model(azure_storage_connectionstring, container_name, path):
blob_client = BlobClient.from_connection_string(azure_storage_connectionstring, container_name, path)
downloaded_blob = blob_client.download_blob()
temp_file = tempfile.mkstemp()
with open(temp_file[1], 'wb') as f:
downloaded_blob.readinto(f)
return temp_file[1]
temp_file = read_blob_model(azure_storage_connectionstring, container_name, roomtype_model)
model = joblib.load(temp_file)

Joblib can also read from BytesIO instead of a file:
import BytesIO
import joblib
model_binary = BytesIO()
blob_client.download_blob().readinto(model_binary)
model = joblib.load(model_binary)

The issue due to reading file that contains the special characters
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them
with open(path, encoding="utf8", errors='ignore') as value:
Using errors='ignore' You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution .
Reference:https://docs.python.org/3/howto/unicode.html#the-unicode-type
Another way to fix problem
change the encoding method. Also, you can find other encoding method here standard-encodings

UnicodeDecodeError can't decode byte solved by writing and reading (with panda) into a file

I have an excel-like data structure composed of bytes that I was not able to decode.
It is a list that looks like:
my_object = [b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1..., ........, b'\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff']
(Note that the last line of my_object is an actual one and is fully written here.)
If I try decoding lines independently I get:
my_object[-1].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte
(Note that I tried several different codecs, including: 'utf8', 'ascii', 'ISO-8859-2', 'gbk', 'latin_1', ...)
However, if I try to save my_object to a file first, using:
f = open('test.xls','wb')
[f.write(my_object[i]) for i in range(len(my_object))]
f.close()
and then open it using pandas like:
import pandas as pd
pd.read_excel('test.xls')
I get the expected result:
Time (s) Acceleration x (m/s^2) Acceleration y (m/s^2) \
0 0.000000 0.863679 0.196953
1 0.002500 0.892268 0.206483
2 0.005001 0.844621 0.196953
......
This is a nice workaround, however, I really would like to avoid writing and reading from and to the disk to perform such an operation.
Can anyone help?
Thank you in advance.

If you just want pandas to read in an excel file when you already have the raw bytes in memory, you can use the io package to turn a string or bytes into a readable file in memory:
import io
file_bytes = b''.join(my_object)
pd.read_excel(io.BytesIO(file_bytes))

Trouble decoding utf-16 string

I'm using python3.3. I've been trying to decode a certain string that looks like this:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc....
keeps going on. However whenever I try to decode this string using str.decode('utf-16') I get an error saying:
'utf16' codec can't decode bytes in position 54-55: illegal UTF-16 surrogate
I'm not exactly sure how to decode this string.

gzipped data begins with \x1f\x8b\x08 so my guess is that your data is gzipped. Try gunzipping the data before decoding.
import io
import gzip
# this raises IOError because `buf` is incomplete. It may work if you supply the complete buf
buf = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed:\xf9w\xdaH\xd2?\xcf\xbc'
with gzip.GzipFile(fileobj=io.BytesIO(buf)) as f:
content = f.read()
print(content.decode('utf-16'))

How to read serialized data by python2 cPikle with python3 pickle?

I'm trying to work with CIFAR-10 dataset which contains a special version for python.
It is a set of binary files, each representing a dictionary of 10k numpy matrices. The files were obviously created by python2 cPickle.
I tried to load it from python2 as follows:
import cPickle
with open("data/data_batch_1", "rb") as f:
data = cPickle.load(f)
This works really great. However, if I try to load the data from python3 (that hasn't cPickle but pickle instead), it fails:
import pickle
with open("data/data_batch_1", "rb") as f:
data = pickle.load(f)
If fails with the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 6: ordinal not in range(128)
Can I somehow transform the ofiginal dataset into new one that will be readable from python3? Or may I somehow read it from python3 direrctly?
I've tried loading it by cPickle, dumping it into json and reading it back by pickle, but numpy matrices obviously can't be written as a json file.

You'll need to tell pickle what codec to use for those bytestrings, or tell it to load the data as bytes instead. From the pickle.load() documentation:
The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.
To load the strings as bytes objects that'd be:
import pickle
with open("data/data_batch_1", "rb") as f:
data = pickle.load(f, encoding='bytes')

Python: UnicodeEncodeError when reading from stdin

When running a Python program that reads from stdin, I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 320: ordinal not in range(128)
How can I fix it?
Note: The error occurs internal to antlr and the line looks like that:
self.strdata = unicode(data)
Since I don't want to modify the source code,
I'd like to pass in something that is acceptable.
The input code looks like that:
#!/usr/bin/python
import sys
import codecs
import antlr3
import antlr3.tree
from LatexLexer import LatexLexer
from LatexParser import LatexParser
char_stream = antlr3.ANTLRInputStream(codecs.getreader("utf8")(sys.stdin))
lexer = LatexLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = LatexParser(tokens)
r = parser.document()

The problem is, that when reading from stdin, python decodes
it using the system default encoding:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
The input is very likely UTF-8 or Windows-CP-1252, so the program
chokes on non-ASCII-characters.
To convert sys.stdin to a stream with the proper decoder, I used:
import codecs
char_stream = codecs.getreader("utf-8")(sys.stdin)
That fixed the problem.
BTW, this is the method ANTLRs FileStream uses to open a file
with given filename (instead of a given stream):
fp = codecs.open(fileName, 'rb', encoding)
try:
data = fp.read()
finally:
fp.close()
BTW #2: For strings I found
a_string.encode(encoding)
useful.

You're not getting this error on input, you're getting this error when trying to output the read data. You should be decoding data you read, and throwing the unicodes around instead of dealing with bytestrings the whole time.

Here is an excellent writedown about how Python handles encodings:
How to use UTF-8 with Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem reading pdf to xml into memory using PDFMiner.Six - python

Related

Reading joblib file from Azure Blob in python

UnicodeDecodeError can't decode byte solved by writing and reading (with panda) into a file

Trouble decoding utf-16 string

How to read serialized data by python2 cPikle with python3 pickle?

Python: UnicodeEncodeError when reading from stdin

Categories

Resources