My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.
tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.
You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.
To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:
from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')
If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.
You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.
Also, take a look at the django-dynamic-scraper package:
Django Dynamic Scraper (DDS) is an app for Django build on top of the
scraping framework Scrapy. While preserving many of the features of
Scrapy it lets you dynamically create and manage spiders via the
Django admin interface.
Related
I'm developing a web crawler in python using Django framework. i want it to work like a web-app. Means if I open in two different browser tabs, they should work individually, each having its own data (crawled + queued links). Both of them should start crawling from separate URL and continue their work.
currently i have designed very simple version of it. it is working in one tab, does not work in another browser tab. I have even tried opening a new window of chrome but same results.
I'm not sure what feature or library i should use for that purpose. can somebody help me?
You can pass some key in the URL:
URL PATTER<your_domain>/crowled/<P>
you can open each URL in different TAB
TAB1: <your_domain>/crowled/abcd
TAB2: <your_domain>/crowled/xyz
OR you can send some key on request.GET
I would create default page for your app which is a form to accept one or more URLs to crawl.
When the 'submit' button is pressed the list of URLs is stored in the database and a background process, using something such as celery, works through the queue of URLs.
You don't say anything about how the results of the crawl are to be stored/presented, so I'm assuming you just want to kickstart the crawl and the pages are stored in some way by the code crawling the sites - with no response sent to the web page.
I need to write a unit test to the Scrapy spider. The problem is, the only way I know on how to call Scrapy spider programmatically is through scrapy.crawler.CrawlerProcess which creates subprocess, then Twisted reactor, etc. For simple unit test it's a massive overkill.
What I want to do is simply create a request, load project settings somehow, send it and process the response.
Is there a way to do it properly?
EDIT.
I checked Scrapy Unit Testing, but the whole point of the test is to check how some xpaths in the database map on the current state of the website. I'm ok with the online testing, I need it actually.
(Then it's becoming more like an integration test, but whatever.)
I have got a basic Django web application running on Heroku. I would like to add a spider to crawl some webs (e.g with Scrapy) based on a scheduled task ( e.g. via APScheduler ) to get some tables of Django databases loaded with collected data.
Does anybody know of documentation or examples for the basis to achieve this kind of integration? I find it very hard to figure it out.
I have not used Scrapy at all, but I'm actually working with APScheduler and it's very simple to use. So my first guess would be to use a BackgroundScheduler (inside your Django app) and add a job to it that would execute a callable "spider" periodically.
The thing here is how could you embed a Scrapy project inside your Django app so you can access one of its "spiders" and effectively use it as a callable in your scheduled job.
I'm maybe not helping much, but I'm just trying to give you some kickstart orientation. I'm pretty sure that if you carefully read the Scrapy's documentation you'll make your way.
Best.
I'm VERY new to Python and I'm attempting to integrate Scrapy with Django.
Here is what I'm trying to make happen:
User submits URL to be scraped
URL is scraped
Scraped data is returned in screen to user
User assigns attributes (if necessary) then saves it to database.
What is the best way to accomplish this? I've played with Django Dynamic Scraper, but I think I'm better off maintaining control over Scrapy for this.
holding on django request while scraping another website may not be the best idea, this flow is better done asynchronously, meaning, release django request and have another process to handle the scrapying, I guess its not an easy thing to achieve for newcomers, but try to bear with me.
flow should look like this:
user submit a request to scrape some data from another website
spider crawl start on a different process than django, user request is released
spider pipelines items to some data store (database)
the user loop on asking for that data, django update the user based on the data inserted to the data store
shooting a scrapy spider can be done by launching it straight from python code, using a tool like celery, also see django and celery, or by launching it in a new process using python's subprocess, or even better, using scrapyd to manage those spiders
I have so many tasks to be done via scrapy on prodcution server.
As my manager wants to add or remove the urls to scrap and he wants the web interface.
I am thinking of making the web app for that
I have found this link
https://github.com/holgerd77/django-dynamic-scraper/
I just want to know that can I use that in production or I can manually call scrapy in my django app and I don't need to use that app.
I have tried that and looks alright to me. They also have good documentaion written there
They are good if you are new and want to get things going. but i think once you get used to the scrapy and django in detail you almost don't need it
Python modules, including scrapy, are most of the time plug-and-play so yes, you should be able to do it with scrapy itself. And you will probably benefit from clearer documentation.