I have two sites running essentially the same codebase, with only slight differences in settings. Each site is built in Django, with a WordPress blog integrated.
Each site needs to import blog posts from WordPress and store them in the Django database. When a user publishes a post, WordPress hits a webhook URL on the Django side, which kicks off a Celery task that grabs the JSON version of the post and imports it.
My initial thought was that each site could run its own instance of manage.py celeryd, each is in its own virtualenv, and the two sites would stay out of each other's way. Each is daemonized with a separate upstart script.
But it looks like they're colliding somehow. I can run one at a time successfully, but if both are running, one instance won't receive tasks, or tasks will run with the wrong settings (in this case, each has a WORDPRESS_BLOG_URL setting).
I'm using a Redis queue, if that makes a difference. What am I doing wrong here?
Have you specified the name of the default queue that celery should use? If you haven't set CELERY_DEFAULT_QUEUE the both sites will be using the same queue and getting each other's messages. You need to set this setting to a different value for each site to keep the message separate.
Edit
You're right, CELERY_DEFAULT_QUEUE is only for backends like RabbitMQ. I think you need to set a different database number for each site, using a different number at the end of your broker url.
If you are using django-celery then make sure you don't have an instance of celery running outside of your virtualenvs. Then start the celery instance within your virtualenvs using manage.py celeryd like you have done. I recommend setting up supervisord to keep track of your instances.
Related
I have two servers, A primary server that provide REST API to accept data from user and maintain a product details list. This server is also responsible to share product list (a subset of product data) with secondary server as soon as product is updated/created.
also note that secondary url depends on product details, not a fix server.
Primary server written in Django. I have used django model db signal as product update, create and delete event.
Now problem is that I don’t want to bock my primary server REST call until it populates detail to secondary server. I need some scheduler stuff to do that, i.e. create a task to populate data in background without blocking my current thread.
I found python asyncio module comes with a function 'run_in_executor', and its working till now, But I don’t have a knowledge of the side effect over django run in wsgi server, can anyone explain ? or any other alternate ?
I found django channel, but it need extra stuff like run worker thread separately, redis cache.
You should use Django Celery for running Tasks asynchronously or in the background.
Celery is a task queue with batteries included. It’s easy to use so that you can get started without learning the full complexities of the problem it solves.
You can get more information on celery from http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#first-steps
I've got a django project with simple form to take users details in. I want to use python bot running in the background and constantly checking django database for any changes. Is it Celery the right tool for this job? Any other solution? Thank you
I don't think Celery is really what you want here - Celery is primarily for moving tasks that don't need to be dealt with in the same process to a separate worker, such as sending registration emails.
For this situation I'd be inclined to use Django's signals to trigger the required functionality whenever the appropriate changes are made to the database. For instance, if it needed to be triggered when a particular type of object was created, such as a new user, then you might use the post_save signal of the user model.
The bot would be in a separate process, but it's not too hard to communicate between processes using Redis. Just have the signal publish a message to Redis, and have the bot listen for that message and carry out the required action on that event.
I don't have the details of your needs but, there are a few ways to achieve such things:
The Constantly checking approach:
A crontab which launch your python script every minute.
Like you said, you could use Celery beat, to achieve what a crontab would do, in your python environment
"On change" approach:
Probably the best, if you have control of the Django project, you could have your script run on the form validation/save! For this, You can add a celery task, run the python script, use Django signals...
I have a Django application written to handle displaying a webpage with data from a model based on the primary key passed in the URL, this all works fine and the Django component is working perfectly for the most part.
My question though is, and I have tried multiple methods such as using an AppConfig, is how I can make it so when the Django server boots up, code is called that would then create a separate thread which would then monitor an external source, logging valid data from that source as a model into the database.
I have the threading code written along with the section that creates the model and saves it in the database, my issue though is that if I try to use an AppConfig to create the thread which would then handle the code, I get an django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet. error and the server does not boot up.
Where would be appropriate to place the code? Is my approach incorrect to the matter?
Trying to use threading to get around blocking processes like web servers is an exercise in pain. I've done it before and it's fragile and often yields unpredictable results.
A much easier idea is to create a separate worker that runs in a totally different process that you start separately. It would have the same database access and could even use your Django models. This is how hosts like Heroku approach this problem. It comes with the added benefit of being able to be tested separately and doesn't need to run at all while you're working on your main Django application.
These days, with a multitude of virtualization options like Vagrant and containerization options like Docker, running parallel processes and workers is trivial. In the wild they may literally be running on separate servers with your database on yet another server. As was mentioned in the comments, starting a worker process could easily be delegated to a separate Django management command. This, in turn, can be fairly easily turned into separate worker processes by gunicorn on your web server.
I am trying to figure out the best way to structure a Django app that uses Celery to handle async and scheduled tasks in an autoscaling AWS ElasticBeanstalk environment.
So far I have used only a single instance Elastic Beanstalk environment with Celery + Celerybeat and this worked perfectly fine. However, I want to have multiple instances running in my environment, because every now and then an instance crashes and it takes a lot of time until the instance is back up, but I can't scale my current architecture to more than one instance because Celerybeat is supposed to be running only once across all instances as otherwise every task scheduled by Celerybeat will be submitted multiple times (once for every EC2 instance in the environment).
I have read about multiple solutions, but all of them seem to have issues that don't make it work for me:
Using django cache + locking: This approach is more like a quick fix than a real solution. This can't be the solution if you have a lot of scheduled tasks and you need to add code to check the cache for every task. Also tasks are still submitted multiple times, this approach only makes sure that execution of the duplicates stops.
Using leader_only option with ebextensions: Works fine initially, but if an EC2 instance in the enviroment crashes or is replaced, this would lead to a situation where no Celerybeat is running at all, because the leader is only defined once at the creation of the environment.
Creating a new Django app just for async tasks in the Elastic Beanstalk worker tier: Nice, because web servers and workers can be scaled independently and the web server performance is not affected by huge async work loads performed by the workers. However, this approach does not work with Celery because the worker tier SQS daemon removes messages and posts the message bodies to a predefined urls. Additionally, I don't like the idea of having a complete additional Django app that needs to import the models from the main app and needs to be separately updated and deployed if the tasks are modified in the main app.
How to I use Celery with scheduled tasks in a distributed Elastic Beanstalk environment without task duplication? E.g. how can I make sure that exactly one instance is running across all instances all the time in the Elastic Beanstalk environment (even if the current instance with Celerybeat crashes)?
Are there any other ways to achieve this? What's the best way to use Elastic Beanstalk's Worker Tier Environment with Django?
I guess you could single out celery beat to different group.
Your auto scaling group runs multiple django instances, but celery is not included in the ec2 config of the scaling group.
You should have different set (or just one) of instance for celery beat
In case someone experience similar issues: I ended up switching to a different Queue / Task framework for django. It is called django-q and was set up and working in less than an hour. It has all the features that I needed and also better Django integration than Celery (since djcelery is no longer active).
Django-q is super easy to use and also lighter than the huge Celery framework. I can only recommend it!
I'm writing a Django web app that makes use of Scrapy and locally all works great, but I wonder how to set up a production environment where my spiders are launched periodically and automatically (I mean that once a spiders complete its job it gets relaunched after a certain time... for example after 24h).
Currently I launch my spiders using a custom Django command, which has the main goal of allowing the use of Django's ORM to store scraped items, so I run:
python manage.py scrapy crawl myspider
and results are stored in my Postgres database.
I installed scrapyd, since it seems that is the preferred way to run scrapy in production
but unfortunately I can't use it without writing a monkey patch (which I would like to avoid), since it use JSON for its web-service API and I get "modelX is not json serializable" exception.
I looked at django-dynamic-scraper, but it seems not be designed to be flexible and customizable as Scrapy is and in fact in docs they say:
Since it simplifies things DDS is not usable for all kinds of
scrapers, but it is well suited for the relatively common case of
regularly scraping a website with a list of updated items
I also thought to use crontab to schedule my spiders... but at what interval should I run my spiders? and if my EC2 instance (I'm gonna use amazon webservices to host my code) needs a reboot I have to re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment? How do you handle it? What's your advice?
I had the same question which led me to yours here. Here is what I think and what I did with my project.
Currently I launch my spiders using a custom Django command, which has
the main goal of allowing the use of Django's ORM to store scraped
items
This sounds very interesting. I also wanted to use Django's ORM inside Scrapy spiders, so I did import django and set it up before scraping took place. I guess that is unnecessary if you call scrapy from already instantiated Django context.
I installed scrapyd, since it seems that is the preferred way to run
scrapy in production but unfortunately I can't use it without writing
a monkey patch (which I would like to avoid)
I had idea of using subprocess.Popen, with stdout and stderr redirected to PIPE. Then take both stdout and stderr results and process them. I didn't have need to gather items from output, since spiders are already writing results to database via pipelines. It gets recursive if you call scrapy process from Django this way, and scrapy process sets up Django context so it can use ORM.
Then I tried scrapyd and yes, you have to fire up HTTP requests to the scrapyd to enqueue job, but it doesn't signal you when job is finished or if it is pending. That part you have to check and I guess that is a place for monkey patch.
I also thought to use crontab to schedule my spiders... but at what
interval should I run my spiders? and if my EC2 instance (I'm gonna
use amazon webservices to host my code) needs a reboot I have to
re-run all my spiders manually... mmmh... things get complicated...
So... what could be an effective setup for a production environment?
How do you handle it? What's your advice?
I'm currently using cron for scheduling scraping. It's not something that users can change, even though they want but it has it's pros too. That way I'm sure users wont shrink the period and make multiple scrapers work at the same time.
I have concerns with introducing unnecessary links in chain. Scrapyd would be the middle link and it seems like its doing it's job for now, but it also can be weak link if it can't hold the production load.
Having in mind that you posted this while ago, I'd be grateful to hear what was your solution regarding the whole Django-Scrapy-Scrapyd integration.
Cheers