Asynchronous task queue processing of in-memory data stucture in Django

Asynchronous task queue processing of in-memory data stucture in Django - python

I have a singleton in-memory data-structure inside my Django project (some kind of kd-tree that needs to get accessed all across the project).
For those that don't know Django, I believe the same issue would appear with a regular Python code.
I know it's evil (Singleton), and I'm looking for better ways to implement that, but my question here is related to another topic:
I am instantiating the singleton inside my code by calling
Singleton.instance() and it gives me the object correctly, it then stays at some place in the memory inside my ./manage.py runserver.
The problem is that I am making some asynchronous processing with Celery on this same Singleton data-structure (such as reconstructing the kd-tree).
BUT when launching a Celery worker, it runs the code within a different process, and therefore has a different memory space, which means that it works on a totally different instance of the Singleton.
What would be the best Design Pattern to this issue? I have thought of doing all the processing related to my data-structure inside the Django project (without using Celery) but what I liked very much about Celery is that the processing required on the data-structure can take a long-time (around 30 seconds) and it needs to handle concurrency nicely (there could be several requests at the same time to reconstruct the Kd-Tree).
I would be very glad to have some insights for this since I'm not making any progress those last 3 days.
Thanks a lot.

Related

Django Background tasks vs Celery

I am trying to do some tasks in django that consume alot of time. For that, I will be running background tasks.
After some R&D, i have found two solutions:
Celery with RabbitMQ.
Django Background tasks.
Both options seem to fulfill the criteria but setting up Celery will require some work. Now as far as the second option is concerned, setup is fairly simple and in fairly quick amount of time, i can go on writing background tasks. Now my questions if i adopt the 2nd option is this:
How well does Django Background tasks perform ? (Scalability wise in Production environment).
Can i poll the tasks (after some time) in DB to check the task's status?
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?)
Again coming to the first point, how well does Django Background tasks perform in production. (Talking about prior experience of using this in prod.)

Setting up celery takes work (although less when using Redis). It's also serious tool with almost a decade of investment and widespread industry adoption.
As for performance, scaling behaviors of task systems which are backed by queues vs those backed by RDBMs are well understood – but may not be relevant to you as "scalability" is a very subjective term. This thread provides some good framing on the subject and questions.
Comparing stars on GitHub (bg tasks' 3XX vs Celery's 13XXX), you should realize Django-Background-tasks has a smaller user base, and you're probably going to need to get into the internals to understand the architecture and precise mechanics. That shouldn't stop you – just be prepared to DIY when answers aren't forthcoming.

How well does Django Background tasks perform ? - This will depend upon how and what you implement. One thing to note is, Django-background-tasks is based upon database where celery can have redis/rabbitmq as backend, so most probably we'll see considerable performance difference here.
Can I poll the tasks (after some time) in DB to check the task's status? - It's possible in celery and maybe you can find a solution by inspecting django-background-tasks internal code. But one thing is, we can abort celery task, which maybe not possible in Django-Background-tasks.
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?) - It's simple Django based project. You can have a look at code. It's seems to be pretty straightforward.
Again coming to the first point, how well does Django Background tasks perform in production. - Haven't used in production. But since Django-Background-tasks is database based and celery can be configured to use redis/rabbitmq - I think celery have a plus point here.
To me this comparison, seems to be link comparing pistol with a high-end automatic machine guns. Both do same job. But one simple straightforward - other little complicated but with lots of options and scope.
Choose based on your use case.

I have decided to use Django-Background-Tasks. Let me clarify my motivations.
The tasks that will be processed by Django-Background-Tasks doesn't need to be processed in a fast manner. As it is stated by the name, they are background tasks. I accept delays.
The architecture of Django-Background-Tasks is very simple. When you call a method to be process in the background in your code a task record is inserted to the Django-Background-Tasks tables in your database. And the method you called is not executed actually. It is proxied. Then you should trigger another process to execute the jobs. Your method is then executed in this process.
The process that execute jobs can be executed by a cron entry in your server.
Since this setup is so easy and work for I decided to use Django-Background-Tasks. But If I needed something more responsive and fast I would use Celery since it is using memory and there is an active process that processes the jobs. Which isn't the case in Django-Background-Tasks.

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Static variable across processes in django

Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?

One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.

Python multithreading - Global Interpreter Lock

Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou

I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.

Clarification of use-cases for Hadoop versus RabbitMQ+Celery

I know that there are similar questions to this, such as:
https://stackoverflow.com/questions/8232194/pros-and-cons-of-celery-vs-disco-vs-hadoop-vs-other-distributed-computing-packag
Differentiate celery, kombu, PyAMQP and RabbitMQ/ironMQ
but I'm asking this because I'm looking for a more particular distinction backed by a couple of use-case examples, please.
So, I'm a python user who wants to make programs that either/both:
Are too large to
Take too long to
do on a single machine, and process them on multiple machines. I am familiar with the (single-machine) multiprocessing package in python, and I write mapreduce style code right now. I know that my function, for example, is easily parallelizable.
In asking my usual smart CS advice-givers, I have phrased my question as:
"I want to take a task, split it into a bunch of subtasks that are executed simultaneously on a bunch of machines, then those results to be aggregated and dealt with according to some other function, which may be a reduce, or may be instructions to serially add to a database, for example."
According to this break-down of my use-case, I think I could equally well use Hadoop or a set of Celery workers + RabbitMQ broker. However, when I ask the sage advice-givers, they respond to me as if I'm totally crazy to look at Hadoop and Celery as comparable solutions. I've read quite a bit about Hadoop, and also about Celery---I think I have a pretty good grasp on what both do---what I do not seem to understand is:
Why are they considered so separate, so different?
Given that they seem to be received as totally different technologies---in what ways? What are the use cases that distinguish one from the other or are better for one than another?
What problems could be solved with both, and what areas would it be particularly foolish to use one or the other for?
Are there possibly better, simpler ways to achieve multiprocessing-like Pool.map()-functionality to multiple machines? Let's imagine my problem is not constrained by storage, but by CPU and RAM required for calculation, so there isn't an issue in having too little space to hold the results returned from the workers. (ie, I'm doing something like simulation where I need to generate a lot of things on the smaller machines seeded by a value from a database, but these are reduced before they return to the source machine/database.)
I understand Hadoop is the big data standard, but Celery also looks well supported; I appreciate that it isn't java (the streaming API python has to use for hadoop looked uncomfortable to me), so I'd be inclined to use the Celery option.

They are the same in that both can solve the problem that you describe (map-reduce). They are different in that Hadoop is entirely build to solve only that usecase and Celey/RabbitMQ is build to facilitate Task execution on different nodes using message passing. Celery also supports different usecases.
Hadoop is solving the map-reduce problem by having a large and special filesystem from which the mapper takes its data, sends it to a bunch of map nodes and reduces it to that filesystem. This has the advantage that it is really fast in doing this. The downsides are that it only operates on text based data input, Python is not really supported and that if you can't do (slightly) different usecases.
Celery is a message based task executor. In it you define tasks and group them together in a workflow (which can be a map-reduce workflow). Its advantages are that it is python based, that you can stitch tasks together in a custom workflow. Disadvantages are its reliance on single broker/result backend and its setup time.
So if you have a couple of Gb's worth of logfiles and don't care to write in Java and have some servers to spare that are exclusively used to run Hadoop, use that. If you want flexibility in running workflowed tasks use Celery. Or.....
Yes! There is a new project from one of the companies that helped create the messaging protocol AMQP that is used by RabbitMQ (and others). It is called ZeroMQ and it takes distributed messaging/execution to the next level by strangely going down a level in abstraction compared to Celery. It defines sockets that you can link together in various ways to create messaging links between nodes. Anything you want to do with these messages is up to you to write. Although this might sounds like "what good is a thin wrapper around a socket" it is actually at the right level of abstraction. Right now at our company we are factoring out all our celery messaging and rebuilding it with ZeroMQ. We found that Celery is just too opinionated about how tasks should be executed and that the setup/config in general is a pain. Also that broker in the middle that has to handle all traffic was becoming to much of a bottleneck.
Resume:
Count the occurrences of "the" in a book with as less programming as possible and lots of setup/config time: Hadoop
Create atomic Tasks and be able to have them work together with not to much programming and a lot of setup/config time: Celery
Have complete control over what to do with your messages and how to program them with almost no setup/config time: ZeroMQ
Have pain with no setup/config time: Sockets

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.