I'm VERY new to Python and I'm attempting to integrate Scrapy with Django.
Here is what I'm trying to make happen:
User submits URL to be scraped
URL is scraped
Scraped data is returned in screen to user
User assigns attributes (if necessary) then saves it to database.
What is the best way to accomplish this? I've played with Django Dynamic Scraper, but I think I'm better off maintaining control over Scrapy for this.
holding on django request while scraping another website may not be the best idea, this flow is better done asynchronously, meaning, release django request and have another process to handle the scrapying, I guess its not an easy thing to achieve for newcomers, but try to bear with me.
flow should look like this:
user submit a request to scrape some data from another website
spider crawl start on a different process than django, user request is released
spider pipelines items to some data store (database)
the user loop on asking for that data, django update the user based on the data inserted to the data store
shooting a scrapy spider can be done by launching it straight from python code, using a tool like celery, also see django and celery, or by launching it in a new process using python's subprocess, or even better, using scrapyd to manage those spiders
Related
I'm developing a web crawler in python using Django framework. i want it to work like a web-app. Means if I open in two different browser tabs, they should work individually, each having its own data (crawled + queued links). Both of them should start crawling from separate URL and continue their work.
currently i have designed very simple version of it. it is working in one tab, does not work in another browser tab. I have even tried opening a new window of chrome but same results.
I'm not sure what feature or library i should use for that purpose. can somebody help me?
You can pass some key in the URL:
URL PATTER<your_domain>/crowled/<P>
you can open each URL in different TAB
TAB1: <your_domain>/crowled/abcd
TAB2: <your_domain>/crowled/xyz
OR you can send some key on request.GET
I would create default page for your app which is a form to accept one or more URLs to crawl.
When the 'submit' button is pressed the list of URLs is stored in the database and a background process, using something such as celery, works through the queue of URLs.
You don't say anything about how the results of the crawl are to be stored/presented, so I'm assuming you just want to kickstart the crawl and the pages are stored in some way by the code crawling the sites - with no response sent to the web page.
I was trying to use Scrapy to scrape some website about 70k items. but every time after it scraped about 200 items, theis error will pop up for the rest:
scrapy] DEBUG: Ignoring response <404 http://www.somewebsite.com/1234>: HTTP status code is not handled or not allowed
I believe it is because my spider got blocked by the website, and I tried using random user agent suggested here but it doesn't solve the problem at all. Is there any good suggestions?
If you're being blocked your spider is probably hitting the site too often or too fast.
In addition to a random user agent you can try setting the CONCURRENT_REQUESTS and DOWNLOAD_DELAY options in settings.py. The default is fairly aggressive and will hammer a site.
The other options you have are using proxies or use AWS with nano instances, they get a new IP each reboot.
Remember that scraping is at best a gray area, you absolutely need to respect the site owners. The best way is obviously to seek permission from the owner but failing that you need to make sure your scraping efforts don't stand out from the usual browsing patterns or you'll get blocked in no time.
Some sites use fairly sophisticated techniques to identify scrapers including cookies and javascript as well as just request patterns and time on site etc. There are also a number of cloud based anti-scraping solutions such as distil or shieldsquare which if you're up against you'll need to put in a lot of effort to make your spider look human!
Can you force someone to answer your questions or give you information? Neither can you force a web server. At best you can try to impersonate a client that the web server will answer to. To do that you need to figure out the criteria the server uses to decide whether or not to answer the request, then you can (try to) form a request that will meet the criteria.
My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.
tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.
You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.
To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:
from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')
If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.
You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.
Also, take a look at the django-dynamic-scraper package:
Django Dynamic Scraper (DDS) is an app for Django build on top of the
scraping framework Scrapy. While preserving many of the features of
Scrapy it lets you dynamically create and manage spiders via the
Django admin interface.
I have got a basic Django web application running on Heroku. I would like to add a spider to crawl some webs (e.g with Scrapy) based on a scheduled task ( e.g. via APScheduler ) to get some tables of Django databases loaded with collected data.
Does anybody know of documentation or examples for the basis to achieve this kind of integration? I find it very hard to figure it out.
I have not used Scrapy at all, but I'm actually working with APScheduler and it's very simple to use. So my first guess would be to use a BackgroundScheduler (inside your Django app) and add a job to it that would execute a callable "spider" periodically.
The thing here is how could you embed a Scrapy project inside your Django app so you can access one of its "spiders" and effectively use it as a callable in your scheduled job.
I'm maybe not helping much, but I'm just trying to give you some kickstart orientation. I'm pretty sure that if you carefully read the Scrapy's documentation you'll make your way.
Best.
Actually i am confused with the terminology. I am studying the scrapy and i think its for crawling the website and extract some data.
But i want to make some python programs which does something like the actual users does. I mean like automating tasks.
E,g Go to www.myblah.com and then get the cheapest product in some category and if that is less than my preset amount , then send me email.
Now i dont know whether these type of things come under crawling or something else.
Can i do that in scrapy or we have other libraries for doing those kind of tasks.
Scrapy is framework that can be used to create a bot or a crawler (aka spider). A crawler is specific kind of bot, but a bot isn't necessarily a crawler. Crawlers are defined by being designed to explore the graph of pages (nodes) and their embedded URLs (edges) although they may be restricted from following particular URLs.
Automating tasks is the work of a bot. Whether Scrapy will work for that will depend on what information is needed and how actions have to be taken. Many sites are heavy on javascript these days, so if the bot can't execute javascript and correctly provide cookies it may not be able to get the information to it's task. Some web automation tasks may require a browser plug-in or even GUI automation tools may be needed.