Using Django as GUI for long running python process - python

This a question about architecture. Say I have a long running process on a server such as machine learning in a middle of a training. Now as this run on external machine I would like to have a tool to quickly see from time to time the results. So I thought the best way would be to have a website which quickly connects to the process for example using RPC to display the results as this allows me to always check in. Now the question is how should Django view gather the information from the server process:
1) Using RPC calls such as rpyc directly in the views?
2) Using some kind of messaging queue such as celery ?
3) Or in a completely different way I am not seeing ?

There's at least 2 possible ways to do this.
Implement your data-refreshing function as a view and visit it by ajax(sync)+javascript timer.Since you visit your page that contains these js, it will fetch your data silently and update the page. However,this solution does not work well when you need to record all the data in a given frequency;the ajax/view only executes when the web page is open.
Use messaging queue like selcuk suggests.Alongside celery, APscheduler is also a good choice because it's easier to install and use.You can implement a task(as modal) queue with status(queue/done/stoped/whatever as field) and check them at the frequency you wanted,save the date you retrieved and do all the other stuff.

Related

Best Way to Handle user triggered task (like import data) in Django

I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.

Using celery to send tasks from component A to component B

The technology I would like to use in this example is Celery for queueing and python for component implementation.
Imagine a simple project hat exists of 2 components. One is a web app that connects to an API and gathers data. Component 2 is a processor that can then process the data. When the web app has gotten a piece of data from the API it is supposed to send a task into a task queue including the just crawled data which is then consumed by the processor to process the Data.
Whether or not this is a sensible way to go about a task like this is debatable and not the point of my question.
My question is, the tasks to process things are defined within the processor since they state what processing function shall be executed and the definition of that function is obviously within the processor. Now that the web app doesn't have access to the task definition how does he communicate the task to the processor?
Do you have to hold a copy of the source code of the processor within the web app?
Do you make the processor a dependency of the web app?
What is the best practice approach to handle such a scenario?
What you are describing is probably one of the most common use-cases for Celery. Just look how many people are asking Django/Flask + Celery questions here on StackOverflow... If you are a Django user, there is an entire section in the Celery documentation describing how to do exactly what you want. Things should be similar with other frameworks.
Do you have to hold a copy of the source code of the processor within the web app?
As far as I know you do not have to (I do not use any web framework) but it could be that you do need to because of some deeper integration with Celery. If your web application knows the Celery task name, and its parameters, it can schedule it to run without actually having access to the Python code. This is accomplished using send_task(task_name, ...).
Do you make the processor a dependency of the web app?
As I wrote above there are several ways to use it. If you want tighter integration then yes. If you just want to run task and get result using the send_task() than your web application should only depend on Celery.
What is the best practice approach to handle such a scenario?
Follow the Django guide. I advise you to run Celery independently, run some tasks, just so you learn about basic principles how it distributes the work, etc.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Can I have Python code to continue executing after I call Flask app.run?

I have just started with Python, although I have been programming in other languages over the past 30 years. I wanted to keep my first application simple, so I started out with a little home automation project hosted on a Raspberry Pi.
I got my code to work fine (controlling a valve, reading a flow sensor and showing some data on a display), but when I wanted to add some web interactivity it came to a sudden halt.
Most articles I have found on the subject suggest to use the Flask framework to compose dynamic web pages. I have tried, and understood, the basics of Flask, but I just can't get around the issue that Flask is blocking once I call the "app.run" function. The rest of my python code waits for Flask to return, which never happens. I.e. no more water flow measurement, valve motor steering or display updating.
So, my basic question would be: What tool should I use in order to serve a simple dynamic web page (with very low load, like 1 request / week), in parallel to my applications main tasks (GPIO/Pulse counting)? All this in the resource constrained environment of a Raspberry Pi (3).
If you still suggest Flask (because it seems very close to target), how should I arrange my code to keep handling the real-world events, such as mentioned above?
(This last part might be tough answering without seeing the actual code, but maybe it's possible answering it in a "generic" way? Or pointing to existing examples that I might have missed while searching.)
You're on the right track with multithreading. If your monitoring code runs in a loop, you could define a function like
def monitoring_loop():
while True:
# do the monitoring
Then, before you call app.run(), start a thread that runs that function:
import threading
from wherever import monitoring_loop
monitoring_thread = threading.Thread(target = monitoring_loop)
monitoring_thread.start()
# app.run() and whatever else you want to do
Don't join the thread - you want it to keep running in parallel to your Flask app. If you joined it, it would block the main execution thread until it finished, which would be never, since it's running a while True loop.
To communicate between the monitoring thread and the rest of the program, you could use a queue to pass messages in a thread-safe way between them.
The way I would probably handle this is to split your program into two distinct separately running programs.
One program handles the GPIO monitoring and communication, and the other program is your small Flask server. Since they run as separate processes, they won't block each other.
You can have the two processes communicate through a small database. The GIPO interface can periodically record flow measurements or other relevant data to a table in the database. It can also monitor another table in the database that might serve as a queue for requests.
Your Flask instance can query that same database to get the current statistics to return to the user, and can submit entries to the requests queue based on user input. (If the GIPO process updates that requests queue with the current status, the Flask process can report that back out.)
And as far as what kind of database to use on a little Raspberry Pi, consider sqlite3 which is a very small, lightweight file-based database well supported as a standard library in Python. (It doesn't require running a full "database server" process.)
Good luck with your project, it sounds like fun!
Hi i was trying the connection with dronekit_sitl and i got the same issue , after 30 seconds the connection was closed.To get rid of that , there are 2 solutions:
You use the decorator before_request:in this one you define a method that will handle the connection before each request
You use the decorator before_first_request : in this case the connection will be made once the first request will be called and the you can handle the object in the other route using a global variable
For more information https://pythonise.com/series/learning-flask/python-before-after-request

Django, Signals and another process

I've got my Django project running well, and a separate background process which will collect data from various sources and store that data in an index.
I've got a model in a Django app called Sources which contains, essentially, a list of sources that data can come from! I've successfully managed to create a signal that is activated/called when a new entry is put in the Sources model.
My question is, is there a simple way that anybody knows of whereby I can send some form of signal/message to my background process indicating that the Sources model has been changed? Or should I just resort to polling for changes every x seconds, because it's so much simpler?
Many thanks for any help received.
It's unclear how are you running the background process you're talking about.
Anyway, I'd suggest that in your background task you use the Sources model directly. There are convenient ways to run the task without leaving realm of Django (so as to have an access to your models. You can use Celery [1], for example, or RQ [2].
Then you won't need to pass any messages, any changes to Sources model will take effect the next time your task is run.
[1] Celery is an open source asynchronous task queue/job queue, it isn't hard to set up and integrates with Django well.
Celery: general introduction
Django with celery introduction
[2] RQ means "Redis Queue", it is ‘a simple Python library for queueing jobs and processing them in the background with workers’.
Introductory post
GitHub repository
Polling is probably the easiest if you don't need split-second latency.
If you do, however, then you'll probably want to look into either, say,
sending an UNIX signal (or other methods of IPC, depending on platform) to the process
having the background process have a simple listening socket that you just send, say, a byte to (which is, admittedly, a form of IPC), and that triggers the action you want to trigger
or some sort of task/message queue. Celery or ZeroMQ come to mind.

Categories

Resources