Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.
My initial thought was to use the process_links parameter on a rule and generate items in the function that it points to. However, whereas the callback parameter points to a function that is an item generator, the process_links paramter works more like a filter. In the callback function you yield items and they are automaticaly collected and put in the pipeline. In the process_links function you return a list of links. You don't generate items.
I could just make a database connection in the process_links function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.
I could try to pass items from the process_links function to the callback function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.
In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback function? What's the process that supplies the links to, and receives the links from, the process_links function? The one that takes requests and returns responses?
From my point of view, I write code in a spider which genreates items. The items are automatically read and moved through a pipeline. I can create code in the pipeline and the items will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items get moved through the pipeline.
Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py. It contains the starts_requests() and make_requests_from_url() functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.
Going from the opposite direction, I can see that when I execute the command scrapy crawl... I'm calling crawl.py which in turn calls self.crawler_process.start() in crawler.py. That starts a Twisted reactor. There is also core/engine.py which is another collection of functions which look as though they are designed to control the operation of the spiders.
Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.
Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.
This is how Scrapy works in short:
You have Spiders which are responsible for crawling sites. You can use separate spiders for separate sites/tasks.
You provide one or more start urls to the spider. You can provide them as a list or use the start_requests method
When we run a spider using Scrapy, it takes these URLs and fetches the HTML response. The response is passed to the callback on the spider class. You can explicitly define a callback when using the start_requests method. If you don't, Scrapy will use the parse method as the callback.
You can extract whatever data you need from the HTML. The response object you get in the parse callback allows you do extract the data using css selectors or xpath.
If you find the data from the response, you can construct the Items and yield them. If you need to go to another page, you can yield scrapy.Request.
If you yield a dictionary or Item object, Scrapy will send those through the registered pipelines. If you yield scrapy.Request, the request would be further parsed and the response will be fed back to a callback. Again you can define a separate callback or use the default one.
In the pipelines, your data (dictionary or Item) go through the pipeline processors. In the pipelines you can store them in database or whatever you want to do.
So in short:
In parse method or in any method inside the spider, we would extract and yield our data so they are sent through the pipelines.
In the pipelines, you do the actual processing.
Here's a simple spider and pipeline example: https://gist.github.com/masnun/e85b38a00a74737bb3eb
I started using Scrapy not so long ago and I had some of your doubts myself (also considering I started with Python overall), but now it works for me, so don’t get discouraged – it’s a nice framework.
First, I would not get too worried at this stage about the details behind the framework, but rather start writing some basic spiders yourself.
Some of really key concepts are:
Start_urls – they define an initial URL (or URLs), where you will further look either for text or for further links to crawl. Let’s say you want to start from e.g. http://x.com
Parse(self.response) method – this will be the first method that will be processed that will give you Response of http://x.com. (basically its HTML markup)
You can use Xpath or CSS selectors to extract information from this markup e.g. a = Response.Xpath(‘//div[#class=”foo”]/#href’) will extract the link to a page (e.g. http://y.com)
If you want to extract the text of the link, so literally "http://y.com" you just yield (return) an item within Parse(self.response) method. So your final statement in this method will be yield item. If you want to go deeper and dwell to http://y.com your final statement will be scrapy.Request(a, callback= self.parse_final) - parse_final being here an example of the callback to the parse_final(self.response) method.
Then you can extract the elements of html of http://y.com as the final call in parse_final(self.response) method, or keep repeating the process to dig for further links in the page structure
Pipelines are for processing items. So when items get yielded, they are by default just printed on the screen. So in pipelines you can redirect them either to csv file, database etc.
The entire process gets more complex, when you start getting more links in each of the methods, based on various conditions you call various callbacks etc. I think you should start with getting this concept first, before going to pipelines. The examples from Scrapy are somewhat difficult to get, but once you get the idea it is really nice and not that complicated in the end.
Related
How do I handle multiple redirects to one site with different parameters in the HTTP body? In my code, I have a loop with HTTP redirect... It supposes to loop and do redirects with different parameters so many times as I have different parameters. But it just does one-time redirect and goes to this web site, so I end up with only one redirect instead of multiple ones. I really do interested in simple sequential redirects, nothing difficult like parallel multi-threading. My code in view looks like this:
for code in codes:
print(code)
base_url = 'https://base_url/'
code_part = 'code={}'.format(code)
url = '{}?{}'.format(base_url, code_part)
return redirect(url)
I thought about enveloping this into the parent-child function, which will process itself so many times as the structure of the list goes, but I think I will end up with the same result as normal for loop. Also, I saw redirects application, but I am not sure if it helps me with this exact task. And it doesn't matter how I implement it, but as soon as I call redirect, the program quits to the external web site and function stops.
update
I was asked to provide more of the code, so it helps to give the answer to my question. But that is the thing, that only relevant code is in my view function, which I included in the question, and I'm thinking at this time how to approach the question, so I don't have any other code at this time. Unfortunately :( Any push to the right direction would be very helpful. Thank you!
update
Unfortunately, redirects app for Django doesn't suit to do many queries to one site with different parameters. It suppose to handle 404 mistake, and creates moved permanently link in its table...
The only problem i see in your code was the indentation.
I just rewrote your code so it is a little simpler.
Maybe the problem is in the method called redirect() that you are returning in the end of your loop.
for code in codes:
print(code)
return redirect(f'https://base_url/?{code}')
And also when you return something then you actually quits the whole function. This means that whatever function your for loop is a part of, actually stops, and the for loop will never go any further than the first itteration.
So if you want the results from each code then you should write this instead:
results = []
for code in codes:
print(code)
results.append(redirect(f'https://base_url/?{code}'))
return results
I need to scrape data from a list of domain given in Excel;
The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com) and data from similartech (https://www.similartech.com/websites/lepetitballon.com).
I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.
Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?
Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.
Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item classes.
Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.
Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.
I have a working crawler in Scrapy+Splash. It launchs a spider on many pages. Each page contains a list of links. For each page the spider download the page then, some of the pages linked from that(not recursively). All the pages are saved on the file system. The system works flawlessy. At the moment I'm refactoring it to add some DB interaction.
I'm not using items, nor Item Pipelines.
What are the benefits of using them?
Adding some info:
The purpose of my crawler is to download entire pages (in html, png, or converted to txt using a library). As soon as the spider has the response to save, it passes it to a library that encapsulate all the io ops(File system and DB). So in this way, it is simpler than use items(with boilerplate for conversion) and pipelines.
So where is my doubt?
I don't know the way scrapy works internally well enough. The way the crawler is implemented the io ops are executed into the thread of the spider. So each spider takes longer to execute. As opposite if I move the io ops into the pipeline, maybe(?) scrapy can schedule its jobs better, executing them separately from the crawling job. Will there be any real performance difference?
In my opinion, using pipelines is just following the separation of concerns principle. Your spider can do many things, but it's core function is to extract information from web pages. The rest can (and possibly should) be refactored into a pipeline or an extension.
It might not be such an issue if you have one spider for one web site. But imagine you have a Scrapy project with hundreds of spiders for semantically similar web sites and you want to apply the same logic for each item -- take a page snapshot, check for duplicates, store in database etc. And now imagine the maintenance hell if you had all the logic in each of the spiders and had to change that logic.
I need to crawl upto 10,000 websites
since every website is unique with its own HTML structure and requires its own logic of XPATH & creating and delegating Request objects. I'm tempted to create a unique spider for each website
But is this the best way forward?. Should i perhaps have a single spider and add all the 10,000 websites in the start_urls and allowed_domains, write scraping libraries and go for it?
which is the best practice in regards to this?
I faced a similar problem, and I took a middle road.
Much of the data you will encounter will (likely) be handled the same way when you finally process it. That means much of the logic you need can be reused. The specifics include where to look for data and how to transform it into a common format. I suggest the following:
Create your MainSpider class, containing most of the logic and tasks that you need.
For each site, subclass MainSpider and define logic modules as required.
main_spider.py
class MainSpider(object):
# Do things here
def get_links(url)
return links
spider_mysite.py
from main_spider import MainSpider
class SpiderMysite(MainSpider):
def get_data(links):
for link in links:
# Do more stuff.
Hope it helps.
I am writing web spiders to scrap some products form websites using scrapy framework in python.
I was wondering what's the best practices to calculate the coverage and missing items of the written spiders.
What i'm using right now is logging cases that's was unable to parse or raises exceptions.
As an example for that: when i expect a specific format for a price of a product or an address of a place and i find that my written Regular expressions doesn't match the scrapped strings. or when my xpath selectors for specific data returns nothing.
Sometimes also when products are listed in one page or multiple ones i use curl and grep to roughly calculate the number of products. but i was wondering if there's better practices to handle this.
The common approach is, yes, to use logging to log the error and exit the callback by returning nothing.
Example (product price is required):
loader = ProductLoader(ProductItem(), response=response)
loader.add_xpath('price', '//span[#class="price"]/text()')
if not loader.get_output_value('price'):
log.msg("Error fetching product price", level=log.ERROR)
return
You can also use signals to catch and log all kind of exceptions happened while crawling, see:
how to process all kinds of exception in a scrapy project, in errback and callback?
This basically follows the Easier to ask for forgiveness than permission principle when you let the spider fail and catch and process the error in a single, one particular place - a signal handler.
Other thoughts:
you can even place the response urls and error tracebacks into a database for a following review - this is still "logging", but in a structured manner which can be more convenient to go through later
a good idea might be to create custom exceptions to represent different crawling errors, for instance: MissingRequiredFieldError, InvalidFieldFormatError which you can raise in case crawled fields haven't passed validation.