How to output model pickle file to s3 in luigi?

How to output model pickle file to s3 in luigi? - python

I have a task which trains the model eg:
class ModelTrain(luigi.Task):
def output(self):
client = S3Client(os.getenv("CONFIG_AWS_ACCESS_KEY"),
os.getenv("CONFIG_AWS_SECRET_KEY"))
model_output = os.path.join(
"s3://", _BUCKET, exp.version + '_model.joblib')
return S3Target(model_output, client)
def run(self):
joblib.dump(model, '/tmp/model.joblib')
with open(self.output().path, 'wb') as out_file:
out_file.write(joblib.load('/tmp/model.joblib'))
FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/version_model.joblib'
Any pointers in this regard would be helpful

A few suggestions-
First, make sure you're using the actual self.output().open() method instead of wrapping open(self.output().path). This loses the 'atomicity' of the luigi targets, plus those targets are supposed to be swappable, so if you changed back to a a LocalTarget your code should work the same way. You let the specific target class handle what it means to open the file. The error you get looks like python is trying to find a local path, which obviously doesn't work.
Second, I just ran into the same issue, so here's my solution plugged into this code:
from luigi import format
class ModelTrain(luigi.Task):
def output(self):
client = S3Client(os.getenv("CONFIG_AWS_ACCESS_KEY"),
os.getenv("CONFIG_AWS_SECRET_KEY"))
model_output = os.path.join(
"s3://", _BUCKET, exp.version + '_model.joblib')
# Use luigi.format.Nop for binary files
return S3Target(model_output, client, format=format.Nop)
def run(self):
# where does `model` come from?
with self.output().open('w') as s3_f:
joblib.dump(model, s3_f)
My task is using pickle so I had to follow something similar to this post to re-import.
class MyNextTask(Task):
...
def run(self):
with my_pickled_task.output().open() as f:
# The S3Target implements a read method and then I can use
# the `.loads()` method to import from a binary string
results = pickle.loads(f.read())
... do more stuff with results ...
I recognize this post is stale, but putting the solution I found out there for the next poor soul trying to do this same thing.

Could you try to remove .path in your open statement.
def run(self):
joblib.dump(model, '/tmp/model.joblib')
with open(self.output(), 'wb') as out_file:
out_file.write(joblib.load('/tmp/model.joblib'))

Related

Get Progress in Python file Upload to Azure

i'm uploading files to azure like so:
with open(tempfile, "rb") as data:
blob_client.upload_blob(data, blob_type='BlockBlob', length=None, metadata=None)
how can i get a progress indication?
when i try uploading as stream, it only uploads one chunk.
i'm sure i'm doing something wrong, but can't find info.
thanks!

It looks like the Azure library doesn't include a callback function to monitor progress.
Fortunately, you can add a wrapper around Python's file object which can call a callback everytime there's a read.
Try this:
import os
from io import BufferedReader, FileIO
class ProgressFile(BufferedReader):
# For binary opening only
def __init__(self, filename, read_callback):
f = FileIO(file=filename, mode='r')
self._read_callback = read_callback
super().__init__(raw=f)
# I prefer Pathlib but this should still support 2.x
self.length = os.stat(filename).st_size
def read(self, size=None):
calc_sz = size
if not calc_sz:
calc_sz = self.length - self.tell()
self._read_callback(position=self.tell(), read_size=calc_sz, total=self.length)
return super(ProgressFile, self).read(size)
def my_callback(position, read_size, total):
# Write your own callback. You could convert the absolute values to percentages
# Using .format rather than f'' for compatibility
print("position: {position}, read_size: {read_size}, total: {total}".format(position=position,
read_size=read_size,
total=total))
myfile = ProgressFile(filename='mybigfile.txt', read_callback=my_callback)
Then you would do
blob_client.upload_blob(myfile, blob_type='BlockBlob', length=None, metadata=None)
myfile.close()
Edit:
It looks like TQDM (progress monitor) has a neat wrapper: https://github.com/tqdm/tqdm#hooks-and-callbacks.
The bonus there is that you get easy access to a pretty progress bar.

This is how I ended up using the tqdm wrapper that Alastair mentioned above
size = os.stat(fname).st_size
with tqdm.wrapattr(open(fname, 'rb'), "read", total=size) as data:
blob_client.upload_blob(data)
Works perfectly, shows time estimate, progress bar, human readable file sizes and transfer speeds.

How to pickle a class instance if the class has another instance of different class?

So I am trying to build a chatbot using the famous "chatterbot" library in python. I made a class called Trainer which trains my chatbot. So in this class I am initializing the instance for the class "chatterbot" and then training it. So, to avoid retraining it again and again I though to pickle the instance of my Trainer class. So if it already exist, I don't need to retrain it and hence I am trying to pickle the Trainer class instance. I am using dill library to pickle the class instance but as it tries to pickle my model, it shows me the following error:
_pickle.PicklingError: Can't pickle >'sqlalchemy.orm.session.Session'>: it's not the same object as >>sqlalchemy.orm.session.Session
Now, I don't see anywhere in my code that I have created any kind of session. But I believe the chatterbot library that I am using in my Trainer class must be using any kind of session in it. Infact I checked the source code and it is using the logger. So it might be possible that's causing the pain. I have no clue how to proceed with this problem. I tried to change the source code of chatterbot library and removed every occurrence of logger from it but I did nothing but broke the code. Can anyone help me how to fix this issue. I am attaching required code here.
utils:
import logging
from pathlib import Path
import pickle
import dill
import os
from .trainer import Trainer
# Method returns the directories in which model objects are stored/saved.
def model_base_dir():
directory = 'MLModel/pickled_models'
parent_directory = os.pardir
return os.path.abspath(os.path.join(parent_directory,directory))
def picked_model(base_dir=None):
if base_dir == None:
logging.exception("Base Directory does not exist !!!")
raise AssertionError("Base Directory does not exist !!!")
model_path = base_dir+'/version1.pickle'
if Path(model_path).is_file():
with open(model_path,'rb') as handle:
model = dill.load(handle)
return model
else:
return None
def save_model(model_obj):
dir = model_base_dir() # Get the directory where model needs to be saved
with open(dir+'/version1.pickle','wb') as f:
dill.dump(model_obj,f)
f.close()
def train_model():
mod_obj = Trainer()
save_model(mod_obj)
return mod_obj
Trainer
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
class Trainer():
def __init__(self):
self.chatbot = ChatBot('Dexter')
self.create_chatbot_trainer(language="chatterbot.corpus.english")
def train_chatbot(self,trainer,language):
return trainer.train(language)
def create_chatbot_trainer(self,language):
self.trainer = ChatterBotCorpusTrainer(self.chatbot)
self.trainer = self.train_chatbot(self.trainer,language)
return self.trainer
def __getstate__(self):
d = self.__dict__.copy()
d.pop('_parents',None)
return d
def response(self,text=""):
if text is None:
return "Sorry, your query is empty"
else:
return self.chatbot.get_response(text)
The train_model() gets trigger from my django view.
Any help appreciated.

Unable to lock file in python

I have written two scripts in python that will be working with data in the same directory. One script will be set to run every 5 minutes and will save data to the directory, and then once a day the other script will zip all of that data, archive it and delete the original files. To avoid the archiving script deleting files which may be being saved by the worker script, I want to create a system-wide mutex so that the worker script knows not to run while the archiver is doing its thing.
I've done some searching and seen that on unix-based systems, the generally accepted method of doing this is to attempt to lock a file. If you get the lock then great, go ahead and run, if you can't get it then you know the other process is already running. I've written the following code:
import fcntl
import traceback
class ProcessLock:
def __init__(self, path_to_file, block):
self.file_path = path_to_file
try:
options = fcntl.LOCK_EX
if not block:
options = options | fcntl.LOCK_NB
self.file = open(path_to_file, 'w+')
self.lock = fcntl.flock(file, options)
except:
print 'caught something: {}'.format(traceback.format_exc())
self.file = None
self.lock = None
def is_locked(self):
return self.lock is not None
def unlock(self):
self.lock = None
self.file = None
def aquire_lock(lock_name):
path = '/tmp/{}.lock'.format(lock_name)
return ProcessLock(path, False)
def aquire_lock_blocking(lock_name):
path = '/tmp/{}.lock'.format(lock_name)
return ProcessLock(path, True)
However for the life of me I can't get it to actually work. I have searched and all of the samples I've seen and other questions posted on here seem to work using the code that I have got. I've also tried both flock and lockf, but neither work. The call to open works correctly, but I get the following logged out to the console:
self.lock = fcntl.flock(file, options)
TypeError: descriptor 'fileno' of 'file' object needs an argument
I don't know enough about Python to be able to know what this error means. Hopefully someone can see if I'm doing anything wrong. I'm running this in Pycharm on macOS

Mocking Files In Python Unittest In Imported Modules

I readily admit to going a bit overboard with unit testing.
While I have passing tests, I find my solution to be inelegant, and I'm curious if anyone has a cleaner solution.
The class being tested:
class Config():
def __init__(self):
config_parser = ConfigParser()
try:
self._read_config_file(config_parser)
except FileNotFoundError as e:
pass
self.token = config_parser.get('Tokens', 'Token', )
#staticmethod
def _read_config_file(config):
if not config.read(os.path.abspath(os.path.join(BASE_DIR, ROOT_DIR, CONFIG_FILE))):
raise FileNotFoundError(f'File {CONFIG_FILE} not found at path {BASE_DIR}{ROOT_DIR}')
The ugly test:
class TestConfiguration(unittest.TestCase):
#mock.patch('config.os.path.abspath')
def test_config_init_sets_token(self, mockFilePath: mock.MagicMock):
with open('mock_file.ini', 'w') as file: #here's where it gets ugly
file.write('[Tokens]\nToken: token')
mockFilePath.return_value = 'mock_file.ini'
config = Config()
self.assertEqual(config.token, 'token')
os.remove('mock_file.ini') #quite ugly
EDIT: What I mean is I'm creating a file instead of mocking one.
Does anyone know how to mock a file object, while having its data set so that it reads ascii text? The class is deeply buried.
Other than that, the way ConfigParser sets data with .read() is throwing me off. Granted, the test "works", it doesn't do it nicely.
For those asking about other testing behaviors, here's an example of another test in this class:
#mock.patch('config.os.path.abspath')
def test_warning_when_file_not_found(self, mockFilePath: mock.MagicMock):
mockFilePath.return_value = 'mock_no_file.ini'
with self.assertRaises(FileNotFoundError):
config.Config._read_config_file(ConfigParser())
Thank you for your time.

I've found it!
I had to start off with a few imports:from io import TextIOWrapper, BytesIO
This allows a file object to be created: TextIOWrapper(BytesIO(b'<StringContentHere>'))
The next part involved digging into the configparser module to see that it calls open(), in order to mock.patch the behavior, and, here we have it, an isolated unittest!
#mock.patch('configparser.open')
def test_bot_init_sets_token(self, mockFileOpen: mock.MagicMock):
mockFileOpen.return_value = TextIOWrapper(BytesIO(b'[Tokens]\nToken: token'))
config = Config()
self.assertEqual(config.token, 'token')

Python NamedTemporaryFile - ValueError When Reading

I am having an issue writing to a NamedTemporaryFile in Python and then reading it back. The function downloads a file via tftpy to the temp file, reads it, hashes the contents, and then compares the hash digest to the original file. The function in question is below:
def verify_upload(self, image, destination):
# create a tftp client
client = TftpClient(ip, 69, localip=self.binding_ip)
# generate a temp file to hold the download info
if not os.path.exists("temp"):
os.makedirs("temp")
with NamedTemporaryFile(dir="temp") as tempfile, open(image, 'r') as original:
try:
# attempt to download the target image
client.download(destination, tempfile, timeout=self.download_timeout)
except TftpTimeout:
raise RuntimeError("Could not download {0} from {1} for verification".format(destination, self.target_ip))
# hash the original file and the downloaded version
original_digest = hashlib.sha256(original.read()).hexdigest()
uploaded_digest = hashlib.sha256(tempfile.read()).hexdigest()
if self.verbose:
print "Original SHA-256: {0}\nUploaded SHA-256: {1}".format(original_digest, uploaded_digest)
# return the hash comparison
return original_digest == uploaded_digest
The problem is that every time I try to execute the line uploaded_digest = hashlib.sha256(tempfile.read()).hexdigest() the application errors out with a ValueError - I/O Operation on a closed file. Since the with block is not complete I am struggling to understand why the temp file would be closed. The only possibility I can think of is that tftpy is closing the file after doing the download, but I cannot find any point in the tftpy source where this would be happening. Note, I have also tried inserting the line tempfile.seek(0) in order to put the file back in a proper state for reading, however this also gives me the ValueError.
Is tftpy closing the file possibly? I read that there is possibly a bug in NamedTemporaryFile causing this problem? Why is the file closed before the reference defined by the with block goes out of scope?

TFTPy is closing the file. When you were looking at the source, you missed the following code path:
class TftpClient(TftpSession):
...
def download(self, filename, output, packethook=None, timeout=SOCK_TIMEOUT):
...
self.context = TftpContextClientDownload(self.host,
self.iport,
filename,
output,
self.options,
packethook,
timeout,
localip = self.localip)
self.context.start()
# Download happens here
self.context.end() # <--
TftpClient.download calls TftpContextClientDownload.end:
class TftpContextClientDownload(TftpContext):
...
def end(self):
"""Finish up the context."""
TftpContext.end(self) # <--
self.metrics.end_time = time.time()
log.debug("Set metrics.end_time to %s", self.metrics.end_time)
self.metrics.compute()
TftpContextClientDownload.end calls TftpContext.end:
class TftpContext(object):
...
def end(self):
"""Perform session cleanup, since the end method should always be
called explicitely by the calling code, this works better than the
destructor."""
log.debug("in TftpContext.end")
self.sock.close()
if self.fileobj is not None and not self.fileobj.closed:
log.debug("self.fileobj is open - closing")
self.fileobj.close() # <--
and TftpContext.end closes the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to output model pickle file to s3 in luigi? - python

Could you try to remove .path in your open statement. def run(self): joblib.dump(model, '/tmp/model.joblib') with open(self.output(), 'wb') as out_file: out_file.write(joblib.load('/tmp/model.joblib'))

Related

Get Progress in Python file Upload to Azure

How to pickle a class instance if the class has another instance of different class?

Unable to lock file in python

Mocking Files In Python Unittest In Imported Modules

Python NamedTemporaryFile - ValueError When Reading

Categories

Resources