Reading joblib file from Azure Blob in python - python

I am trying to read a joblib file from Azure blob (see code below). However, I get the following error:
with open(filename, 'rb') as f:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 2: invalid start byte
Code:
from azure.storage.blob import BlobClient
import sklearn.externals
import joblib
blob_client = BlobClient.from_connection_string('connection_string', 'myContainer', 'myBlob.joblib')
downloaded_blob = blob_client.download_blob()
model = joblib.load(downloaded_blob.readall())
pickle has loads() which works fine. How can I achieve the same with joblib?

I used readinto instead of readall (to get a stream and avoid encoding problems) and then used a temporary file to read the path into
def read_blob_model(azure_storage_connectionstring, container_name, path):
blob_client = BlobClient.from_connection_string(azure_storage_connectionstring, container_name, path)
downloaded_blob = blob_client.download_blob()
temp_file = tempfile.mkstemp()
with open(temp_file[1], 'wb') as f:
downloaded_blob.readinto(f)
return temp_file[1]
temp_file = read_blob_model(azure_storage_connectionstring, container_name, roomtype_model)
model = joblib.load(temp_file)

Joblib can also read from BytesIO instead of a file:
import BytesIO
import joblib
model_binary = BytesIO()
blob_client.download_blob().readinto(model_binary)
model = joblib.load(model_binary)

The issue due to reading file that contains the special characters
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them
with open(path, encoding="utf8", errors='ignore') as value:
Using errors='ignore' You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution .
Reference:https://docs.python.org/3/howto/unicode.html#the-unicode-type
Another way to fix problem
change the encoding method. Also, you can find other encoding method here standard-encodings

Related

Problem reading pdf to xml into memory using PDFMiner.Six

Consider the following snippet:
import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()
This results in the following error
ValueError: Codec is required for a binary I/O output
If i leave out output_type i get the error
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.
I don't understand why this happens, and would like help with a workaround.
I figured out how to fix the problem:
First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that.
For example
import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")

Is it possible to specify the encoding of a file with Paramiko?

I'm trying to read a CSV over SFTP using pysftp/Paramiko. My code looks like this:
input_conn = pysftp.Connection(hostname, username, password)
file = input_conn.open("Data.csv")
file_contents = list(csv.reader(file))
But when I do this, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 23: invalid start byte
I know that this means the file is expected to be in UTF-8 encoding but isn't. The strange thing is, if I download the file and then use my code to open the file, I can specify the encoding as "macroman" and get no error:
with open("Data.csv", "r", encoding="macroman") as csvfile:
file_contents = list(csv.reader(csvfile))
The Paramiko docs say that the encoding of a file is meaningless over SFTP because it treats all files as bytes – but then, how can I get Python's CSV module to recognize the encoding if I use Paramiko to open the file?
If the file is not huge, so it's not a problem to have it loaded twice into the memory, you can download and convert the contents in memory:
with io.BytesIO() as bio:
input_conn.getfo("Data.csv", bio)
bio.seek(0)
with io.TextIOWrapper(bio, encoding='macroman') as f:
file_contents = list(csv.reader(f))
Partially based on Convert io.BytesIO to io.StringIO to parse HTML page.

Why can't I create a file object from a network datastream

I'm downloading a tarfile from a REST API, writing it to a local file, then extracting the contents locally. Here's my code:
with open ('output.tar.gz', 'wb') as f:
f.write(o._retrieve_data_stream(p).read())
with open ('output.tar.gz', 'rb') as f:
t = tarfile.open(fileobj=f)
t.extractall()
o._retrieve_data_stream(p) retrieves the datastream for the file.
This code works fine, but it seems unncessarily complicated to me. I think I should be able to read the bytestream directly into the fileobject read by the tarfile. Something like this:
with open(o._retrieve_data_stream(p).read(), 'rb') as f:
t = tarfile.open(fileobj=f)
t.extractall()
I realize that my syntax may be a little shaky there, but I think it communicates what I'm trying to do.
But when I do this, I get an encoding error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
What's going on?
Posting because I solved it while I was writing this. Turns out I needed to use a BytesIO object.
This code works as expected:
from io import BytesIO
t = tarfile.open(fileobj=BytesIO(o._retrieve_data_stream(p).read()))
t.extractall()
Canadian_Marine's answer was very close to what I needed but not quite enough for my particular situation. Seeing the BytesIO object inside the open command in their answer helped me solve my problem.
I found it necessary to split the request portion out from the tarfile.open and then wrap the response content inside a BytesIO object within the tarfile.open command. Here is my code:
from io import BytesIO
import requests
import tarfile
remote_file=requests.get ('https://download.site.com/files/file.tar.gz')
#Extract tarball contents to memory
tar=tarfile.open(fileobj=BytesIO(remote_file.content))
#Optionally print all folders / files within the tarball
print(tar.getnames())
tar.extractall('/home/users/Documents/target_directory/')
This eliminated the ValueError: embedded null byte and expected str, bytes or os.PathLike object, not _io.BytesIO errors that I was experiencing with other methods.

Re-encode Unicode stream as Ascii ignoring errors

I'm trying to take a Unicode file stream, which contains odd characters, and wrap it with a stream reader that will convert it to Ascii, ignoring or replacing all characters that can't be encoded.
My stream looks like:
"EventId","Rate","Attribute1","Attribute2","(。・ω・。)ノ"
...
My attempt to alter the stream on the fly looks like this:
import chardet, io, codecs
with open(self.csv_path, 'rb') as rawdata:
detected = chardet.detect(rawdata.read(1000))
detectedEncoding = detected['encoding']
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = codecs.getreader('ascii')(csv_file, errors='ignore')
log( csv_ascii_stream.read() )
The result on the log line is: UnicodeEncodeError: 'ascii' codec can't encode characters in position 36-40: ordinal not in range(128) even though I explicitly constructed the StreamReader with errors='ignore'
I would like the resulting stream (when read) to come out like this:
"EventId","Rate","Attribute1","Attribute2","(?????)?"
...
or alternatively, "EventId","Rate","Attribute1","Attribute2","()" (using 'ignore' instead of 'replace')
Why is the Exception happening anyway?
I've seen plenty of problems/solutions for decoding strings, but my challenge is to change the stream as it's being read (using .next()), because the file is potentially too large to be loaded into memory all at once using .read()
You're mixing up the encode and decode sides.
For decoding, you're doing fine. You open it as binary data, chardet the first 1K, then reopen in text mode using the detected encoding.
But then you're trying to further decode that already-decoded data as ASCII, by using codecs.getreader. That function returns a StreamReader, which decodes data from a stream. That isn't going to work. You need to encode that data to ASCII.
But it's not clear why you're using a codecs stream decoder or encoder in the first place, when all you want to do is encode a single chunk of text in one go so you can log it. Why not just call the encode method?
log(csv_file.read().encode('ascii', 'ignore'))
If you want something that you can use as a lazy iterable of lines, you could build something fully general, but it's a lot simpler to just do something like the UTF8Recorder example in the csv docs:
class AsciiRecoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("ascii", "ignore")
Or, even more simply:
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = (line.encode('ascii', 'ignore') for line in csv_file)
I'm a little late to the party with this, but here's an alternate solution, using codecs.StreamRecoder:
from codecs import getencoder, getdecoder, getreader, getwriter, StreamRecoder
with io.open(self.csv_path, 'rb') as f:
csv_ascii_stream = StreamRecoder(f,
getencoder('ascii'),
getdecoder(detectedEncoding),
getreader(detectedEncoding),
getwriter('ascii'),
errors='ignore')
print(csv_ascii_stream.read())
I guess you may want to use this if you need the flexibility to be able to call read()/readlines()/seek()/tell() etc. on the stream that gets returned. If you just need to iterate over the stream, the generator expression abarnert provided is a bit more concise.

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.
Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).
In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)
import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)
I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)
This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0
You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")
In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.

Categories

Resources