How can I leverage luigi for Openstack tasks

How can I leverage luigi for Openstack tasks - python

I want to use Luigi to manage workflows in Openstack. I am new to Luigi. For the starter, I just want to authenticate myself to Openstack and then fetch image list, flavor list etc using Luigi. Any help will be appreciable.
I am not good with python but I tried below code. I am also not able to list images. Error: glanceclient.exc.HTTPNotFound: The resource could not be found. (HTTP 404)
import luigi
import os_client_config
import glanceclient.v2.client as glclient
from luigi.mock import MockFile
import sys
import os
def get_credentials():
d = {}
d['username'] = 'X'
d['password'] = 'X'
d['auth_url'] = 'X'
d['tenant_name'] = 'X'
d['endpoint'] = 'X'
return d
class LookupOpenstack(luigi.Task):
d =[]
def requires(self):
pass
def output(self):
gc = glclient.Client(**get_credentials())
images = gc.images.list()
print("images", images)
for i in images:
print(i)
return MockFile("images", mirror_on_stderr=True)
def run(self):
pass
if __name__ == '__main__':
luigi.run(["--local-scheduler"], LookupOpenstack())

The general approach to this is just write python code to perform the tasks you want using the OpenStack API. https://docs.openstack.org/user-guide/sdk.html It looks like the error you are getting is addressed on the OpenStack site. https://ask.openstack.org/en/question/90071/glanceclientexchttpnotfound-the-resource-could-not-be-found-http-404/
You would then just wrap this code in luigi Tasks as appropriate- there's nothing special about doing with this OpenStack, except that you must define the output() of your luigi tasks to match up with an output that indicates the task is done. Right now it looks like the work is being done in the output() method, which should be in the run() method, the output method should just be what to look for to indicate that the run() method is complete so it doesn't run() when required by another task if it is already done.
It's really impossible to say more without understanding more details of your workflow.

Related

ERROR:root:can't pickle fasttext_pybind.fasttext objects

I am using gunicorn with multiple workers for my machine learning project. But the problem is when I send a train request only the worker getting the training request gets updated with the latest model after training is done. Here it is worth to mention that, to make the inference faster I have programmed to load the model once after each training. This is why, the only worker which is used for current training operation loads the latest model and the other workers still keeps the previously loaded model. Right now the model file (binary format) is loaded once after each training in a global dictionary variable where key is the model name and the value is the model file. Obviously, this problem won't occur if I program it to load the model every time from disk for each prediction, but I cannot do it, as it will make the prediction slower.
I studied further on global variables and further investigation shows that, in a multi-processing environment, all the workers (processes) create their own copies of global variables. Apart from the binary model file, I also have some other global variables (in dictionary type) need to be synced across all processes. So, how to handle this situation?
TL;DR: I need some approach which can help me to store variable which will be common across all the processes (workers). Any way to do this? With multiprocessing.Manager, dill etc.?
Update 1: I have multiple machine learning algorithms in my project and they have their own model files, which are being loaded to memory in a dictionary where the key is the model name and the value is the corresponding model object. I need to share all of them (in other words, I need to share the dictionary). But some of the models are not pickle serializable like - FastText. So, when I try to use a proxy variable (in my case a dictionary to hold models) with multiprocessing.Manager I get error for those non-pickle-serializable object while assigning the loaded model file to this dictionary. Like: can't pickle fasttext_pybind.fasttext objects. More information on multiprocessing.Manager can be found here: Proxy Objects
Following is the summary what I have done:
import multiprocessing
import fasttext
mgr = multiprocessing.Manager()
model_dict = mgr.dict()
model_file = fasttext.load_model("path/to/model/file/which/is/in/.bin/format")
model_dict["fasttext"] = model_file # This line throws this error
Error:
can't pickle fasttext_pybind.fasttext objects
I printed the model_file which I am trying to assign, it is:
<fasttext.FastText._FastText object at 0x7f86e2b682e8>
Update 2:
According to this answer I modified my code a little bit:
import fasttext
from multiprocessing.managers import SyncManager
def Manager():
m = SyncManager()
m.start()
return m
# As the model file has a type of "<fasttext.FastText._FastText object at 0x7f86e2b682e8>" so, using "fasttext.FastText._FastText" as the class of it
SyncManager.register("fast", fasttext.FastText._FastText)
# Now this is the Manager as a replacement of the old one.
mgr = Manager()
ft = mgr.fast() # This line gives error.
This gives me EOFError.
Update 3: I tried using dill both with multiprocessing and multiprocess. The summary of changes are as the following:
import multiprocessing
import multiprocess
import dill
# Any one of the following two lines
mgr = multiprocessing.Manager() # Or,
mgr = multiprocess.Manager()
model_dict = mgr.dict()
... ... ...
... ... ...
model_file = dill.dumps(model_file) # This line throws the error
model_dict["fasttext"] = model_file
... ... ...
... ... ...
# During loading
model_file = dill.loads(model_dict["fasttext"])
But still getting the error: can't pickle fasttext_pybind.fasttext objects.
Update 4:
This time I am using another library called jsonpickle. It seems to be that serialization and de-serialization occurs properly (as it is not reporting any issue while running). But surprisingly enough, after de-serialization whenever I am making a prediction, it faces segmentation fault. More details and the steps to reproduce it can be found here: Segmentation fault (core dumped)
Update 5: Tried cloudpickle, srsly, but couldn't make the program working.

For the sake of completeness I am providing the solution that worked for me. All the approaches I have tried to serialize FastText went in vain. Finally, as #MedetTleukabiluly mentioned in the comment, I managed to share the message of loading the model from the disk with other workers with redis-pubsub. Obviously, it is not actually sharing the model from the same memory space, rather, just sharing the message to other workers to inform them they should load the model from the disk (as a new training just happened). Following is the general solution:
# redis_pubsub.py
import logging
import os
import fasttext
import socket
import threading
import time
"""The whole purpose of GLOBAL_NAMESPACE is to keep the whole pubsub mechanism separate.
As this might be a case another service also publishing in the same channel.
"""
GLOBAL_NAMESPACE = "SERVICE_0"
def get_ip():
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
# doesn't even have to be reachable
s.connect(('10.255.255.255', 1))
IP = s.getsockname()[0]
except Exception:
IP = '127.0.0.1'
finally:
s.close()
return IP
class RedisPubSub:
def __init__(self):
self.redis_client = get_redis_client() #TODO: A SAMPLE METHOD WHICH CAN RETURN YOUR REDIS CLIENT (you have to implement)
# Unique ID is used, to identify which worker from which server is the publisher. Just to avoid updating
# getting a message which message is indeed sent by itself.
self.unique_id = "IP_" + get_ip() + "__" + str(GLOBAL_NAMESPACE) + "__" + "PID_" + str(os.getpid())
def listen_to_channel_and_update_models(self, channel):
try:
pubsub = self.redis_client.pubsub()
pubsub.subscribe(channel)
except Exception as exception:
logging.error(f"REDIS_ERROR: Model Update Listening: {exception}")
while True:
try:
message = pubsub.get_message()
# Successful operation gives 1 and unsuccessful gives 0
# ..we are not interested to receive these flags
if message and message["data"] != 1 and message["data"] != 0:
message = message["data"].decode("utf-8")
message = str(message)
splitted_msg = message.split("__SEPERATOR__")
# Not only making sure the message is coming from another worker
# but also we have to make sure the message sender and receiver (i.e, both of the workers) are under the same namespace
if (splitted_msg[0] != self.unique_id) and (splitted_msg[0].split('__')[1] == GLOBAL_NAMESPACE):
algo_name = splitted_msg[1]
model_path = splitted_msg[2]
# Fasttext
if "fasttext" in algo_name:
try:
#TODO: YOU WILL GET THE LOADED NEW FILE IN model_file. USE IT TO UPDATE THE OLD ONE.
model_file = fasttext.load_model(model_path + '.bin')
except Exception as exception:
logging.error(exception)
else:
logging.info(f"{algo_name} model is updated for process with unique_id: {self.unique_id} by process with unique_id: {splitted_msg[0]}")
time.sleep(1) # sleeping for 1 second to avoid hammering the CPU too much
except Exception as exception:
time.sleep(1)
logging.error(f"PUBSUB_ERROR: Model or component update: {exception}")
def publish_to_channel(self, channel, algo_name, model_path):
def _publish_to_channel():
try:
message = self.unique_id + '__SEPERATOR__' + str(algo_name) + '__SEPERATOR__' + str(model_path)
time.sleep(3)
self.redis_client.publish(channel, message)
except Exception as exception:
logging.error(f"PUBSUB_ERROR: Model or component publishing: {exception}")
# As the delay before pubsub can pause the next activities which are independent, hence, doing this publishing in another thread.
thread = threading.Thread(target = _publish_to_channel)
thread.start()
Also you have to start the listener:
from redis_pubsub import RedisPubSub
pubsub = RedisPubSub()
# start the listener:
thread = threading.Thread(target = pubsub.listen_to_channel_and_update_models, args = ("sync-ml-models", ))
thread.start()
From fasttext training module, when you finish the training, publish this message to other workers, such that the other workers get a chance to re-load the model from the disk:
# fasttext_api.py
from redis_pubsub import RedisPubSub
pubsub = RedisPubSub()
pubsub.publish_to_channel(channel = "sync-ml-models", # a sample name for the channel
algo_name = f"fasttext",
model_path = "path/to/fasttext/model")

Error while calling the send_file function from another file in python flask | RuntimeError: Working outside of request context

Greetings stack overflow community, I am currently working on a flask app and I am trying to retrieve a file from a helper function with the send_file method in flask.
I have a route that goes like so:
#app.route("/process",methods=['GET','POST'])
def do_something():
process = threading.Thread(target=function_name,args=[arg1,arg2])
process.start()
return render_template("template.html")
The function_name (which is on a different file) function is suposed to return a file like so
def function_name():
filename = 'ohhey.pdf'
return send_file(filename,as_attachment=True,cache_timeout=0)
When I run my app like this I get the following error
RuntimeError: Working outside of application context.
This typically means that you attempted to use functionality that needed
to interface with the current application object in some way. To solve
this, set up an application context with app.app_context(). See the
documentation for more information.
So I try to change the function for the following:
def function_name():
filename = 'ohhey.pdf'
with app.app_context():
return send_file(filename,as_attachment=True,cache_timeout=0)
and get this new error
RuntimeError: Working outside of request context.
This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
so I try the following:
def function_name():
filename = 'ohhey.pdf'
with app.test_request_context():
return send_file(filename,as_attachment=True,cache_timeout=0)
After making this final change my app doesn't return a file or an error. I appreciate your help.

How to get the source code of a file where an object is created?

In my main.py I have the following Python code:
from tasks import SimpleTask
from reflectors import Reflector
task = SimpleTask()
source_code = Reflector().get_source_code(task)
I want to get the source code of main.py from inside the Reflector.get_source_code() using only the parameter task passed.
If I do the following in Reflector.get_source_code() using Python's inspect module:
def get_source_code(self, task):
return str(inspect.getsource(inspect.getmodule(task)))
I get the source code of tasks module where SimpleTask class is defined.
However, I want to get source code of main.py (the file where task object is created).
I have managed to do the following:
source_code = Reflector().get_source_code(lambda: task)
def run(self, task_callback):
source_code = str(inspect.getsource(inspect.getmodule(task_callback)))
However, I don't like the lambda: task syntax.
Is there any way to achieve this without the lambda hack?

Use .replace method with Celery sub-tasks

I'm trying to solve a problem in celery:
I have one task that queries an API for ids, and then starts a sub-task for each of these.
I do not know, ahead of time, what the ids are, or how many there are.
For each id, I go through a big calculation that then dumps some data into a database.
After all the sub-tasks are complete, I want to run a summary function (export DB results to an Excel format).
Ideally, I do not want to block my main worker querying the status of the sub-tasks (Celery gets angry if you try this.)
This question looks very similar (if not identical?): Celery: Callback after task hierarchy
So using the "solution" (which is a link to this discussion, I tried the following test script:
# test.py
from celery import Celery, chord
from celery.utils.log import get_task_logger
app = Celery('test', backend='redis://localhost:45000/10?new_join=1', broker='redis://localhost:45000/11')
app.conf.CELERY_ALWAYS_EAGER = False
logger = get_task_logger(__name__)
#app.task(bind=True)
def get_one(self):
print('hello world')
self.replace(get_two.s())
return 1
#app.task
def get_two():
print('Returning two')
return 2
#app.task
def sum_all(data):
print('Logging data')
logger.error(data)
return sum(data)
if __name__ == '__main__':
print('Running test')
x = chord(get_one.s() for i in range(3))
body = sum_all.s()
result = x(body)
print(result.get())
print('Finished w/ test')
It doesn't work for me. I get an error:
AttributeError: 'get_one' object has no attribute 'replace'
Note that I do have new_join=1 in my backend URL, though not the broker. If I put it there, I get an error:
TypeError: _init_params() got an unexpected keyword argument 'new_join'
What am I doing wrong? I'm using the Python 3.4.3 and the following packages:
amqp==1.4.6
anyjson==0.3.3
billiard==3.3.0.20
celery==3.1.18
kombu==3.0.26
pytz==2015.4
redis==2.10.3

The Task.replace method will be added in Celery 3.2: http://celery.readthedocs.org/en/master/whatsnew-3.2.html#task-replace (that changelog entry is misleading, because it suggests that Task.replace existed before and has been changed.)

Mocking ftplib.FTP for unit testing Python code

I don't know why I'm just not getting this, but I want to use mock in Python to test that my functions are calling functions in ftplib.FTP correctly. I've simplified everything down and still am not wrapping my head around how it works. Here is a simple example:
import unittest
import ftplib
from unittest.mock import patch
def download_file(hostname, file_path, file_name):
ftp = ftplib.FTP(hostname)
ftp.login()
ftp.cwd(file_path)
class TestDownloader(unittest.TestCase):
#patch('ftplib.FTP')
def test_download_file(self, mock_ftp):
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
mock_ftp.cwd.assert_called_with('pub/files')
When I run this, I get:
AssertionError: Expected call: cwd('pub/files')
Not called
I know it must be using the mock object since that is a fake server name, and when run without patching, it throws a "socket.gaierror" exception.
How do I get the actual object the fuction is running? The long term goal is not having the "download_file" function in the same file, but calling it from a separate module file.

When you do patch(ftplib.FTP) you are patching FTP constructor. dowload_file() use it to build ftp object so your ftp object on which you call login() and cmd() will be mock_ftp.return_value instead of mock_ftp.
Your test code should be follow:
class TestDownloader(unittest.TestCase):
#patch('ftplib.FTP', autospec=True)
def test_download_file(self, mock_ftp_constructor):
mock_ftp = mock_ftp_constructor.return_value
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
mock_ftp_constructor.assert_called_with('ftp.server.local')
self.assertTrue(mock_ftp.login.called)
mock_ftp.cwd.assert_called_with('pub/files')
I added all checks and autospec=True just because is a good practice

Like Ibrohim's answer, I prefer pytest with mocker.
I have went a bit further and have actually wrote a library which helps me to mock easily. Here is how to use it for your case.
You start by having your code and a basic pytest function, with the addition of my helper library to generate mocks to modules and the matching asserts generation:
import ftplib
from mock_autogen.pytest_mocker import PytestMocker
def download_file(hostname, file_path, file_name):
ftp = ftplib.FTP(hostname)
ftp.login()
ftp.cwd(file_path)
def test_download_file(mocker):
import sys
print(PytestMocker(mocked=sys.modules[__name__],
name=__name__).mock_modules().prepare_asserts_calls().generate())
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
When you run the test for the first time, it would fail due to unknown DNS, but the print statement which wraps my library would give us this valuable input:
...
mock_ftplib = mocker.MagicMock(name='ftplib')
mocker.patch('test_29817963.ftplib', new=mock_ftplib)
...
import mock_autogen
...
print(mock_autogen.generator.generate_asserts(mock_ftplib, name='mock_ftplib'))
I'm placing this in the test and would run it again:
def test_download_file(mocker):
mock_ftplib = mocker.MagicMock(name='ftplib')
mocker.patch('test_29817963.ftplib', new=mock_ftplib)
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
import mock_autogen
print(mock_autogen.generator.generate_asserts(mock_ftplib, name='mock_ftplib'))
This time the test succeeds and I only need to collect the result of the second print to get the proper asserts:
def test_download_file(mocker):
mock_ftplib = mocker.MagicMock(name='ftplib')
mocker.patch(__name__ + '.ftplib', new=mock_ftplib)
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
mock_ftplib.FTP.assert_called_once_with('ftp.server.local')
mock_ftplib.FTP.return_value.login.assert_called_once_with()
mock_ftplib.FTP.return_value.cwd.assert_called_once_with('pub/files')
If you would like to keep using unittest while using my library, I'm accepting pull requests.

I suggest using pytest and pytest-mock.
from pytest_mock import mocker
def test_download_file(mocker):
ftp_constructor_mock = mocker.patch('ftplib.FTP')
ftp_mock = ftp_constructor_mock.return_value
download_file('ftp.server.local', 'pub/files', 'wanted_file.txt')
ftp_constructor_mock.assert_called_with('ftp.server.local')
assert ftp_mock.login.called
ftp_mock.cwd.assert_called_with('pub/files')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I leverage luigi for Openstack tasks - python

Related

ERROR:root:can't pickle fasttext_pybind.fasttext objects

Error while calling the send_file function from another file in python flask | RuntimeError: Working outside of request context

How to get the source code of a file where an object is created?

Use .replace method with Celery sub-tasks

Mocking ftplib.FTP for unit testing Python code

Categories

Resources