I have a Django application written to handle displaying a webpage with data from a model based on the primary key passed in the URL, this all works fine and the Django component is working perfectly for the most part.
My question though is, and I have tried multiple methods such as using an AppConfig, is how I can make it so when the Django server boots up, code is called that would then create a separate thread which would then monitor an external source, logging valid data from that source as a model into the database.
I have the threading code written along with the section that creates the model and saves it in the database, my issue though is that if I try to use an AppConfig to create the thread which would then handle the code, I get an django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet. error and the server does not boot up.
Where would be appropriate to place the code? Is my approach incorrect to the matter?
Trying to use threading to get around blocking processes like web servers is an exercise in pain. I've done it before and it's fragile and often yields unpredictable results.
A much easier idea is to create a separate worker that runs in a totally different process that you start separately. It would have the same database access and could even use your Django models. This is how hosts like Heroku approach this problem. It comes with the added benefit of being able to be tested separately and doesn't need to run at all while you're working on your main Django application.
These days, with a multitude of virtualization options like Vagrant and containerization options like Docker, running parallel processes and workers is trivial. In the wild they may literally be running on separate servers with your database on yet another server. As was mentioned in the comments, starting a worker process could easily be delegated to a separate Django management command. This, in turn, can be fairly easily turned into separate worker processes by gunicorn on your web server.
Related
I have a webapp that monitors sites that users add for any changes. To do this, I need to have some sort of separate background thread/process that is constantly iterating through the list of sites, pinging them one at a time, and emailing any users that are monitoring a site that changes. I am currently using a thread that I initialize at the end of my urls.py file. This works fine with Django's development server, but it begins to break down once I deploy it to Heroku with Gunicorn. As soon as there are multiple connections, multiple copies of the worker thread get started, as Gunicorn starts more worker threads to handle the concurrent connections (at least, this is what I think is the reason behind the extra threads is). This causes duplicate emails to be sent out, one from each thread.
I am now trying to find another means of spawning this worker thread/process. I saw a similar inquiry here, but when I tried the posted solution, I was unable to reference the models from my Django app and received this error message when I tried to do so:
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
I have also tried using django-background-tasks, which is frequently recommended as a simple solution for issues like this. However, it doesn't seem suited for looping, continuous processes. The same goes for Celery and other solutions like it. I am just looking for a way to start a separate worker Dyno that continuously runs in the background, without a queue or anything like that, and is able to use the models from my Django app to create QuerySets that can be iterated through. What would be the best way to do something like this? Please let me know if any more information would help.
You could try editing the code so that the parts that handle the email specifically aren't tried so intrinsically to the django model, such that both the django model and this secondary application interact with the standard python class/module/object/etc, instead of trying to graft out the part of django you need elsewhere.
Alternatively, you can try using something like threading.Lock if your app is actually using threads inside one interpreter to prevent multiple messages from sending. There is also a multiprocessing.Lock that may work if the threading one does not.
Another option would be to make it so each requested change would have a unique value to it, preferably something based on the contents of the changes themselves. IE if you have something like:
def check_send_email(email_addr, website_url, text_that_changed):
database.query('INSERT INTO website_updates VALUES %s, %s', (website_url, text_that_changed,))
if (database.check_result()): # update was not already present in database
send_email(email_addr)
check_send_email('email#example.com', 'website.com', '<div id="watched-div">')
obviously you'd need to interact with some more concrete tools, but the general idea above is that if requests come in, you don't send multiple emails needlessly. Of course, finding a value you can always generate exactly the same given a specific change, but is also unique every time may prove difficult.
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
I'm working on a distributed system where one process is controlling a hardware piece and I want it to be running as a service. My app is Django + Twisted based, so Twisted maintains the main loop and I access the database (SQLite) through Django, the entry point being a Django Management Command.
On the other hand, for user interface, I am writing a web application on the same Django project on the same database (also using Crossbar as websockets and WAMP server). This is a second Django process accessing the same database.
I'm looking for some validation here. Is anything fundamentally wrong to this approach? I'm particularly scared of issues with database (two different processes accessing it via Django ORM).
Consider that Django, like all WSGI-based web servers, almost always has multiple processes accessing the database. Because a single WSGI process can handle only one connection at a time, it's normal for servers to run multiple processes in parallel when they get any significant amount of traffic.
That doesn't mean there's no cause for concern. You have the database as if data might change between any two calls to it. Familiarize yourself with how Django uses transactions (default is autocommit mode, not atomic requests), and …
and oh, you said sqlite. Yeah, sqlite is probably not the best database to use when you need to write to it from multiple processes. I can imagine that might work for a single-user interface to a piece of hardware, but if you run in to any problems when adding the webapp, you'll want to trade up to a database server like postgresql.
No there is nothing inherently wrong with that approach. We currently use a similar approach for a lot of our work.
I'm writing a Django web app that makes use of Scrapy and locally all works great, but I wonder how to set up a production environment where my spiders are launched periodically and automatically (I mean that once a spiders complete its job it gets relaunched after a certain time... for example after 24h).
Currently I launch my spiders using a custom Django command, which has the main goal of allowing the use of Django's ORM to store scraped items, so I run:
python manage.py scrapy crawl myspider
and results are stored in my Postgres database.
I installed scrapyd, since it seems that is the preferred way to run scrapy in production
but unfortunately I can't use it without writing a monkey patch (which I would like to avoid), since it use JSON for its web-service API and I get "modelX is not json serializable" exception.
I looked at django-dynamic-scraper, but it seems not be designed to be flexible and customizable as Scrapy is and in fact in docs they say:
Since it simplifies things DDS is not usable for all kinds of
scrapers, but it is well suited for the relatively common case of
regularly scraping a website with a list of updated items
I also thought to use crontab to schedule my spiders... but at what interval should I run my spiders? and if my EC2 instance (I'm gonna use amazon webservices to host my code) needs a reboot I have to re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment? How do you handle it? What's your advice?
I had the same question which led me to yours here. Here is what I think and what I did with my project.
Currently I launch my spiders using a custom Django command, which has
the main goal of allowing the use of Django's ORM to store scraped
items
This sounds very interesting. I also wanted to use Django's ORM inside Scrapy spiders, so I did import django and set it up before scraping took place. I guess that is unnecessary if you call scrapy from already instantiated Django context.
I installed scrapyd, since it seems that is the preferred way to run
scrapy in production but unfortunately I can't use it without writing
a monkey patch (which I would like to avoid)
I had idea of using subprocess.Popen, with stdout and stderr redirected to PIPE. Then take both stdout and stderr results and process them. I didn't have need to gather items from output, since spiders are already writing results to database via pipelines. It gets recursive if you call scrapy process from Django this way, and scrapy process sets up Django context so it can use ORM.
Then I tried scrapyd and yes, you have to fire up HTTP requests to the scrapyd to enqueue job, but it doesn't signal you when job is finished or if it is pending. That part you have to check and I guess that is a place for monkey patch.
I also thought to use crontab to schedule my spiders... but at what
interval should I run my spiders? and if my EC2 instance (I'm gonna
use amazon webservices to host my code) needs a reboot I have to
re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment?
How do you handle it? What's your advice?
I'm currently using cron for scheduling scraping. It's not something that users can change, even though they want but it has it's pros too. That way I'm sure users wont shrink the period and make multiple scrapers work at the same time.
I have concerns with introducing unnecessary links in chain. Scrapyd would be the middle link and it seems like its doing it's job for now, but it also can be weak link if it can't hold the production load.
Having in mind that you posted this while ago, I'd be grateful to hear what was your solution regarding the whole Django-Scrapy-Scrapyd integration.
Cheers
I have a multi-stage process that needs to be run at some intervals.
I also have a Controller program which starts the process at the right times, chains together the stages of the process, and checks that each stage has executed correctly.
The Controller accesses a database which stores information about past runs of the process, parameters for future executions of the process, etc.
Now, I want to use Pyramid to build a web interface to the Controller, so that I can view information about the process and affect the operation of the Controller.
This will mean that actions in the web interface must effect changes in the controller database.
Naturally, the web interface will use the exact same data models as the Controller.
What's the best way for the Controller and Web Server to interact?
I've considered two possibilities:
Combine the controller and web server by calling sched in Pyramid's initialisation routine
Have the web server make RPCs to the controller, e.g. using Pyro.
How should I proceed here? And how can I avoid code duplication (of the data models) when using the second option?
I would avoid running your Controller in the same process as the web application - it is a common practice to run web-applications with lowered permissions, for example; in some multi-threaded/multi-process environment which may spawn multiple workers and then possibly kill/recycle them whenever it feels like doing so. So having your controller running in a separate process with some kind of RPC mechanism seems like a much better idea.
Regarding code duplication - there are 2 options:
you can extract the common code (models) into a separate module/egg which is used by both applications
if you're finding that you need to share a lot of code - nothing forces you to have separate projects for those applications at all. You can have a single code base with two or more "entry points" - one of which would start a Pyramid WSGI application and another would start your Controller process.