I have got a basic Django web application running on Heroku. I would like to add a spider to crawl some webs (e.g with Scrapy) based on a scheduled task ( e.g. via APScheduler ) to get some tables of Django databases loaded with collected data.
Does anybody know of documentation or examples for the basis to achieve this kind of integration? I find it very hard to figure it out.
I have not used Scrapy at all, but I'm actually working with APScheduler and it's very simple to use. So my first guess would be to use a BackgroundScheduler (inside your Django app) and add a job to it that would execute a callable "spider" periodically.
The thing here is how could you embed a Scrapy project inside your Django app so you can access one of its "spiders" and effectively use it as a callable in your scheduled job.
I'm maybe not helping much, but I'm just trying to give you some kickstart orientation. I'm pretty sure that if you carefully read the Scrapy's documentation you'll make your way.
Best.
Related
Long story short, I need to call a python script from a Celery worker using subprocess. This script interacts with a REST API. I would like to avoid hard-coding the URLs and django reverse seems like a nice way to do that.
Is there a way to use reverse outside of Django while avoiding the following error?
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
I would prefer something with low-overhead.
As documented
If you’re using components of Django “standalone” – for example,
writing a Python script which loads some Django templates and renders
them, or uses the ORM to fetch some data – there’s one more step
you’ll need in addition to configuring settings.
After you’ve either set DJANGO_SETTINGS_MODULE or called configure(),
you’ll need to call django.setup() to load your settings and populate
Django’s application registry.
I am using a custom manager command to boot my django app from external scripts. It is like smashing a screw with a hammer, but the setup is fast and it takes care for pretty much everything.
I'm writing web application for get data from Database(MySQL) to file.csv using Django Framework.I can get data but It take so much time to get complete successfully.I don't want to wait for get data finish. I want to continue to next activity but background still get data from Database.
My setup : Python 3.6.5, Django 2.1
Thank you for helping me.
Celery is a python package that will help you perform asynchronous tasks in Django. You can refer to First Steps With Celery for getting started. Also, I have covered a few setup issues and their solution in this post.
I would like to know what is the fastest way to turn a simple Python script into a basic web app.
For example, say I would like to create a web app that takes a keyword from the user and display the most retweeted tweet on Twitter. If I write a python script that is capable of performing that task using Twitter's API, how would I go about turning it into a web app for people to access?
I have looked at frameworks such as Django, but it would take me weeks or months to learn how to use it. I just need something quick and simple. Any such alternatives?
Make a CGI script out of it. You basically get the request information from the webserver via environment variables and you print the desired HTML to stdout. There are helper libraries such as Werkzeug which help with abstracting away the handling of the environment variables by wrapping them in a Request object.
This technique is quite outdated and isn't normally used nowadays as the script has to be run on every request and thus incurs the startup cost all the time.
Nevertheless this may actually be a good solution for you because it is quick and every webserver supports it.
My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.
tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.
You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.
To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:
from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')
If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.
You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.
Also, take a look at the django-dynamic-scraper package:
Django Dynamic Scraper (DDS) is an app for Django build on top of the
scraping framework Scrapy. While preserving many of the features of
Scrapy it lets you dynamically create and manage spiders via the
Django admin interface.
I have so many tasks to be done via scrapy on prodcution server.
As my manager wants to add or remove the urls to scrap and he wants the web interface.
I am thinking of making the web app for that
I have found this link
https://github.com/holgerd77/django-dynamic-scraper/
I just want to know that can I use that in production or I can manually call scrapy in my django app and I don't need to use that app.
I have tried that and looks alright to me. They also have good documentaion written there
They are good if you are new and want to get things going. but i think once you get used to the scrapy and django in detail you almost don't need it
Python modules, including scrapy, are most of the time plug-and-play so yes, you should be able to do it with scrapy itself. And you will probably benefit from clearer documentation.