I am writing web spiders to scrap some products form websites using scrapy framework in python.
I was wondering what's the best practices to calculate the coverage and missing items of the written spiders.
What i'm using right now is logging cases that's was unable to parse or raises exceptions.
As an example for that: when i expect a specific format for a price of a product or an address of a place and i find that my written Regular expressions doesn't match the scrapped strings. or when my xpath selectors for specific data returns nothing.
Sometimes also when products are listed in one page or multiple ones i use curl and grep to roughly calculate the number of products. but i was wondering if there's better practices to handle this.
The common approach is, yes, to use logging to log the error and exit the callback by returning nothing.
Example (product price is required):
loader = ProductLoader(ProductItem(), response=response)
loader.add_xpath('price', '//span[#class="price"]/text()')
if not loader.get_output_value('price'):
log.msg("Error fetching product price", level=log.ERROR)
return
You can also use signals to catch and log all kind of exceptions happened while crawling, see:
how to process all kinds of exception in a scrapy project, in errback and callback?
This basically follows the Easier to ask for forgiveness than permission principle when you let the spider fail and catch and process the error in a single, one particular place - a signal handler.
Other thoughts:
you can even place the response urls and error tracebacks into a database for a following review - this is still "logging", but in a structured manner which can be more convenient to go through later
a good idea might be to create custom exceptions to represent different crawling errors, for instance: MissingRequiredFieldError, InvalidFieldFormatError which you can raise in case crawled fields haven't passed validation.
Related
I need to scrape data from a list of domain given in Excel;
The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com) and data from similartech (https://www.similartech.com/websites/lepetitballon.com).
I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.
Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?
Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.
Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item classes.
Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.
Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.
I have a working crawler in Scrapy+Splash. It launchs a spider on many pages. Each page contains a list of links. For each page the spider download the page then, some of the pages linked from that(not recursively). All the pages are saved on the file system. The system works flawlessy. At the moment I'm refactoring it to add some DB interaction.
I'm not using items, nor Item Pipelines.
What are the benefits of using them?
Adding some info:
The purpose of my crawler is to download entire pages (in html, png, or converted to txt using a library). As soon as the spider has the response to save, it passes it to a library that encapsulate all the io ops(File system and DB). So in this way, it is simpler than use items(with boilerplate for conversion) and pipelines.
So where is my doubt?
I don't know the way scrapy works internally well enough. The way the crawler is implemented the io ops are executed into the thread of the spider. So each spider takes longer to execute. As opposite if I move the io ops into the pipeline, maybe(?) scrapy can schedule its jobs better, executing them separately from the crawling job. Will there be any real performance difference?
In my opinion, using pipelines is just following the separation of concerns principle. Your spider can do many things, but it's core function is to extract information from web pages. The rest can (and possibly should) be refactored into a pipeline or an extension.
It might not be such an issue if you have one spider for one web site. But imagine you have a Scrapy project with hundreds of spiders for semantically similar web sites and you want to apply the same logic for each item -- take a page snapshot, check for duplicates, store in database etc. And now imagine the maintenance hell if you had all the logic in each of the spiders and had to change that logic.
Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.
My initial thought was to use the process_links parameter on a rule and generate items in the function that it points to. However, whereas the callback parameter points to a function that is an item generator, the process_links paramter works more like a filter. In the callback function you yield items and they are automaticaly collected and put in the pipeline. In the process_links function you return a list of links. You don't generate items.
I could just make a database connection in the process_links function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.
I could try to pass items from the process_links function to the callback function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.
In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback function? What's the process that supplies the links to, and receives the links from, the process_links function? The one that takes requests and returns responses?
From my point of view, I write code in a spider which genreates items. The items are automatically read and moved through a pipeline. I can create code in the pipeline and the items will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items get moved through the pipeline.
Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py. It contains the starts_requests() and make_requests_from_url() functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.
Going from the opposite direction, I can see that when I execute the command scrapy crawl... I'm calling crawl.py which in turn calls self.crawler_process.start() in crawler.py. That starts a Twisted reactor. There is also core/engine.py which is another collection of functions which look as though they are designed to control the operation of the spiders.
Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.
Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.
This is how Scrapy works in short:
You have Spiders which are responsible for crawling sites. You can use separate spiders for separate sites/tasks.
You provide one or more start urls to the spider. You can provide them as a list or use the start_requests method
When we run a spider using Scrapy, it takes these URLs and fetches the HTML response. The response is passed to the callback on the spider class. You can explicitly define a callback when using the start_requests method. If you don't, Scrapy will use the parse method as the callback.
You can extract whatever data you need from the HTML. The response object you get in the parse callback allows you do extract the data using css selectors or xpath.
If you find the data from the response, you can construct the Items and yield them. If you need to go to another page, you can yield scrapy.Request.
If you yield a dictionary or Item object, Scrapy will send those through the registered pipelines. If you yield scrapy.Request, the request would be further parsed and the response will be fed back to a callback. Again you can define a separate callback or use the default one.
In the pipelines, your data (dictionary or Item) go through the pipeline processors. In the pipelines you can store them in database or whatever you want to do.
So in short:
In parse method or in any method inside the spider, we would extract and yield our data so they are sent through the pipelines.
In the pipelines, you do the actual processing.
Here's a simple spider and pipeline example: https://gist.github.com/masnun/e85b38a00a74737bb3eb
I started using Scrapy not so long ago and I had some of your doubts myself (also considering I started with Python overall), but now it works for me, so don’t get discouraged – it’s a nice framework.
First, I would not get too worried at this stage about the details behind the framework, but rather start writing some basic spiders yourself.
Some of really key concepts are:
Start_urls – they define an initial URL (or URLs), where you will further look either for text or for further links to crawl. Let’s say you want to start from e.g. http://x.com
Parse(self.response) method – this will be the first method that will be processed that will give you Response of http://x.com. (basically its HTML markup)
You can use Xpath or CSS selectors to extract information from this markup e.g. a = Response.Xpath(‘//div[#class=”foo”]/#href’) will extract the link to a page (e.g. http://y.com)
If you want to extract the text of the link, so literally "http://y.com" you just yield (return) an item within Parse(self.response) method. So your final statement in this method will be yield item. If you want to go deeper and dwell to http://y.com your final statement will be scrapy.Request(a, callback= self.parse_final) - parse_final being here an example of the callback to the parse_final(self.response) method.
Then you can extract the elements of html of http://y.com as the final call in parse_final(self.response) method, or keep repeating the process to dig for further links in the page structure
Pipelines are for processing items. So when items get yielded, they are by default just printed on the screen. So in pipelines you can redirect them either to csv file, database etc.
The entire process gets more complex, when you start getting more links in each of the methods, based on various conditions you call various callbacks etc. I think you should start with getting this concept first, before going to pipelines. The examples from Scrapy are somewhat difficult to get, but once you get the idea it is really nice and not that complicated in the end.
I need to crawl upto 10,000 websites
since every website is unique with its own HTML structure and requires its own logic of XPATH & creating and delegating Request objects. I'm tempted to create a unique spider for each website
But is this the best way forward?. Should i perhaps have a single spider and add all the 10,000 websites in the start_urls and allowed_domains, write scraping libraries and go for it?
which is the best practice in regards to this?
I faced a similar problem, and I took a middle road.
Much of the data you will encounter will (likely) be handled the same way when you finally process it. That means much of the logic you need can be reused. The specifics include where to look for data and how to transform it into a common format. I suggest the following:
Create your MainSpider class, containing most of the logic and tasks that you need.
For each site, subclass MainSpider and define logic modules as required.
main_spider.py
class MainSpider(object):
# Do things here
def get_links(url)
return links
spider_mysite.py
from main_spider import MainSpider
class SpiderMysite(MainSpider):
def get_data(links):
for link in links:
# Do more stuff.
Hope it helps.
How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.
A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.
The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.