I have created a module that does some heavy computations, and returns some data to be stored in a nosqldatabase. The computation process is started via a post request in my flask application. The flask function will execute the cumputation code and after the code and then the returned results will be stored in db. I was thinking of celery. But I am wondering and haven't found any clear info on that if it would be possible to use python trheading E.g
from mysci_module import heavy_compute
#route('/initiate_task/', methods=['POST',])
def run_computation():
import thread
thread.start_new_thread(heavy_compute, post_data)
return reponse
Its very abstract I know. The only problem I see in this method is that my function will have to know and be responsible in storing data in the database, so It is not very independant on the database used. Correct? Why is Celery a better (is it really?) than the method above?
Since CPython is restricted from true concurrency using threads by the GIL, all computations will infact happen serially. Instead you could use the python multiprocessing module and create a pool of processes to complete your heavy computation task.
There are a few microframeworks such as twisted klein apart from celery that can also help achieve that concurrency and independence that you're looking for. They aren't necessarily better, but are available for those who don't want to get their hands messy with various issues that are likely to come up when one gets into synchronizing flask and the actual business logic, especially when response is based on that activity.
I would suggest the following method to start a thread for the long procedure first. Then leave Flask to communicate with the procedure time by time upon your requirements:
from mysci_module import heavy_compute
import thread
thread.start_new_thread(heavy_compute, post_data)
#route('/initiate_task/', methods=['POST',])
def check_computation():
response = heave_compute.status
return response
The best part of this method is to make sure you have a callable thread in the background all the time while it's possible to get the necessary result even passing some parameters to the task.
Related
I'm working with a django application hosted on heroku with redistogo addon:nano pack. I'm using rq, to execute tasks in the background - the tasks are initiated by online users. I've a constraint on increasing number of connections, limited resources I'm afraid.
I'm currently having a single worker running over 'n' number of queues. Each queue uses an instance of connection from the connection pool to handle 'n' different types of task. For instance, lets say if 4 users initiate same type of task, I would like to have my main worker create child processes dynamically, to handle it. Is there a way to achieve required multiprocessing and concurrency?
I tried with multiprocessing module, initially without introducing Lock(); but that exposes and overwrites user passed data to the initiating function, with the previous request data. After applying locks, it restricts second user to initiate the requests by returning a server error - 500
github link #1: Looks like the team is working on the PR; not yet released though!
github link #2: This post helps to explain creating more workers at runtime.
This solution however also overrides the data. The new request is again processed with the previous requests data.
Let me know if you need to see some code. I'll try to post a minimal reproducible snippet.
Any thoughts/suggestions/guidelines?
Did you get a chance to try AutoWorker?
Spawn RQ Workers automatically.
from autoworker import AutoWorker
aw = AutoWorker(queue='high', max_procs=6)
aw.work()
It makes use of multiprocessing with StrictRedis from redis module and following imports from rq
from rq.contrib.legacy import cleanup_ghosts
from rq.queue import Queue
from rq.worker import Worker, WorkerStatus
After looking under the hood, I realised Worker class is already implementing multiprocessing.
The work function internally calls execute_job(job, queue) which in turn as quoted in the module
Spawns a work horse to perform the actual work and passes it a job.
The worker will wait for the work horse and make sure it executes within the given timeout bounds,
or will end the work horse with SIGALRM.
The execute_job() funtion makes a call to fork_work_horse(job, queue) implicitly which spawns a work horse to perform the actual work and passes it a job as per the following logic:
def fork_work_horse(self, job, queue):
child_pid = os.fork()
os.environ['RQ_WORKER_ID'] = self.name
os.environ['RQ_JOB_ID'] = job.id
if child_pid == 0:
self.main_work_horse(job, queue)
else:
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))
The main_work_horse makes an internal call to perform_job(job, queue) which makes a few other calls to actually perform the job.
All the steps about The Worker Lifecycle mentioned over rq's official documentation page are taken care within these calls.
It's not the multiprocessing I was expecting, but I guess they have a way of doing things. However my original post is still not answered with this, also I'm still not sure about concurrency..
The documentation there still needs to be worked upon, since it hardly covers the true essence of this library!
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I'm building a website which provides some information to the visitors. This information is aggregated in the background by polling a couple external APIs every 5 seconds. The way I have it working now is that I use APScheduler jobs. I initially preferred APScheduler because it makes the whole system more easy to port (since I don't need to set cron jobs on the new machine). I start the polling functions as follows:
from apscheduler.scheduler import Scheduler
#app.before_first_request
def initialize():
apsched = Scheduler()
apsched.start()
apsched.add_interval_job(checkFirstAPI, seconds=5)
apsched.add_interval_job(checkSecondAPI, seconds=5)
apsched.add_interval_job(checkThirdAPI, seconds=5)
This kinda works, but there's some trouble with it:
For starters, this means that the interval-jobs are running outside of the Flask context. So far this hasn't been much of a problem, but when calling an endpoint fails I want the system to send me an email (saying "hey calling API X failed"). Because it doesn't run within the Flask context however, it complaints that flask-mail cannot be executed (RuntimeError('working outside of application context')).
Secondly, I wonder how this is going to behave when I don't use the Flask built-in debug server anymore, but a production server with lets say 4 workers. Will it start every job four times then?
All in all I feel that there should be a better way of running these recurring tasks, but I'm unsure how. Does anybody out there have an interesting solution to this problem? All tips are welcome!
[EDIT]
I've just been reading about Celery with its schedules. Although I don't really see how Celery is different from APScheduler and whether it could thus solve my two points, I wonder if anyone reading this thinks that I should investigate more in Celery?
[CONCLUSION]
About two years later I'm reading this, and I thought I could let you guys know what I ended up with. I figured that #BluePeppers was right in saying that I shouldn't be tied so closely to the Flask ecosystem. So I opted for regular cron-jobs running every minute which are set using Ansible. Although this makes it a bit more complex (I needed to learn Ansible and convert some code so that running it every minute would be enough) I think this is more robust.
I'm currently using the awesome pythonr-rq for queueing a-sync jobs (checking APIs and sending emails). I just found out about rq-scheduler. I haven't tested it yet, but it seems to do precisely what I needed in the first place. So maybe this is a tip for future readers of this question.
For the rest, I just wish all of you a beautiful day!
(1)
You can use the app.app_context() context manager to set the application context. I imagine usage would go something like this:
from apscheduler.scheduler import Scheduler
def checkSecondApi():
with app.app_context():
# Do whatever you were doing to check the second API
#app.before_first_request
def initialize():
apsched = Scheduler()
apsched.start()
apsched.add_interval_job(checkFirstAPI, seconds=5)
apsched.add_interval_job(checkSecondAPI, seconds=5)
apsched.add_interval_job(checkThirdAPI, seconds=5)
Alternatively, you could use a decorator
def with_application_context(app):
def inner(func):
#functools.wraps(func)
def wrapper(*args, **kwargs):
with app.app_context():
return func(*args, **kwargs)
return wrapper
return inner
#with_application_context(app)
def checkFirstAPI():
# Check the first API as before
(2)
Yes it will still work. The sole (significant) difference is that your application will not be communicating directly with the world; it will be going through a reverse proxy or something via fastcgi/uwsgi/whatever. The only concern is that if you have multiple instances of the app starting, then multiple schedulers will be created. To manage this, I would suggest you move your backend tasks out of the Flask application, and use a tool designed for running tasks regularly (i.e. Celery). The downside to this is that you won't be able to use things like Flask-Mail, but imo, it's not too good to be so closely tied to the Flask ecosystem; what are you gaining by using Flask-Mail over a standard, non Flask, mail library?
Also, breaking up your application makes it much easier to scale up individual components as the capacity is required, compared to having one monolithic web application.
so I have a handler below:
class PublishHandler(BaseHandler):
def post(self):
message = self.get_argument("message")
some_function(message)
self.write("success")
The problem that I'm facing is that some_function() takes some time to execute and I would like the post request to return straight away when called and for some_function() to be executed in another thread/process if possible.
I'm using berkeley db as the database and what I'm trying to do is relatively simple.
I have a database of users each with a filter. If the filter matches the message, the server will send the message to the user. Currently I'm testing with thousands of users and hence upon each publication of a message via a post request it's iterating through thousands of users to find a match. This is my naive implementation of doing things and hence my question. How do I do this better?
You might be able to accomplish this by using your IOLoop's add_callback method like so:
loop.add_callback(lambda: some_function(message))
Tornado will execute the callback in the next IOLoop pass, which may (I'd have to dig into Tornado's guts to know for sure, or alternatively test it) allow the request to complete before that code gets executed.
The drawback is that that long-running code you've written will still take time to execute, and this may end up blocking another request. That's not ideal if you have a lot of these requests coming in at once.
The more foolproof solution is to run it in a separate thread or process. The best way with Python is to use a process, due to the GIL (I'd highly recommend reading up on that if you're not familiar with it). However, on a single-processor machine the threaded implementation will work just as fine, and may be simpler to implement.
If you're going the threaded route, you can build a nice "async executor" module with a mutex, a thread, and a queue. Check out the multiprocessing module if you want to go the route of using a separate process.
I've tried this, and I believe the request does not complete before the callbacks are called.
I think a dirty hack would be to call two levels of add_callback, e.g.:
def get(self):
...
def _defered():
ioloop.add_callback(<whatever you want>)
ioloop.add_callback(_defered)
...
But these are hacks at best. I'm looking for a better solution right now, probably will end up with some message queue or simple thread solution.