After years of reluctantly coding scrapers as a mish-mash of regexp and BeautifulSoup, etc. I found Scrapy, which I pretty much count as this year's Christmas present to myself! It is natural to use, and it seems to have been built to make practically everything elegant and reusable.
But I am in a situation I am not sure how to tackle: my spider crawls and scrapes a listing page A, from which I generate a set of items. But for each item, I need to fetch a distinct complementary link (constructed from some of the scraped information, but not explicitly a link on the page which Scrapy could follow) to obtain additional information.
My question is in two parts: what is the protocol to fetch an URL outside of the crawling process? how do I build items from several sources in an elegant way?
This has partially been asked (and answered) in a previous question on StackOverflow. But I am more interested in what the philosophy of Scrapy is supposed to be in this usage case---surely not an unforeseen possibility? I wonder if this is one of the things the Pipelines are destined to be used for (adding information from a secondary source deduced from the primary information is an instance "post-processing"), but what is the best way to do it, to not completely mess up Scrapy's efficient asynchronous organization?
what is the protocol to fetch an URL outside of the crawling process?
When you create a Request giving it an url, it doesn't matter where you've taken the url to download from. You can extract it from the page, or build somehow else.
how do I build items from several sources in an elegant way?
Use Request.meta
Related
How would you go about crawling a website such that you can index every page when there is only really a search bar for navagation like the following sites.
https://plejehjemsoversigten.dk/
https://findadentist.ada.org/
Do people just brute force the search queries, or is there a method that's usually implemented to index these kinds of websites?
There could be several ways to approach your issue (however if the owner of a resource does not want the resource to be crawled, that might be really challenging)
Check robots.txt of a resource. It might give you a clue on the site structure.
Check sitemap.xml of a resource. It might give URLs of the pages a resource owner wishes to be public
Use alternative indexers (like google). Use advanced syntax narrowing the scope of search to a particular site (like site:your.domain)
Use breaches in site design. For example first site from your list does not have a minimal search string so that you can search for, say, a and get 800 results containing a. Then list remaining letters.
Having search result crawl all the links on the search result items pages since there often might be related pages listed.
I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.
[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.
I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.
I'm working on a web crawler (using scrapy) that uses 2 different spiders:
Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data.
Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled).
Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?
Solutions I thought about:
Moving all functionality to spider 1. But that might get really messy, spider 1 code is already very long and complicated, I'd like to keep this functionality separate, if possible.
Saving the links to the database like it was suggested in Pass scraped URL's from one spider to another
Is there a better way?
I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.
Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.
You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.
I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.
I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.
I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?
do I need to customize my code
Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.
I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.
want to know how price-comparison sites scrape data from all the online stores?
Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.
The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.
See, as an example, Autoscraping tool provided by Scrapinghub:
Autoscraping is a tool to scrape web sites without any programming
knowledge. You just annotate web pages visually (with a point and
click tool) to indicate where each field is on the page and
Autoscraping will scrape any similar page from the site.
And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.
I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:
{
"price": "//div[#class='price']/text()",
"description": "//div[#class='title']/span[2]/text()"
}
The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.
We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.
See also:
Creating a generic scrapy spider
Using one Scrapy spider for several websites
What is the best practice for writing maintainable web scrapers?
For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis
I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html