Scrapy: why use pipelines? - python

I have a working crawler in Scrapy+Splash. It launchs a spider on many pages. Each page contains a list of links. For each page the spider download the page then, some of the pages linked from that(not recursively). All the pages are saved on the file system. The system works flawlessy. At the moment I'm refactoring it to add some DB interaction.
I'm not using items, nor Item Pipelines.
What are the benefits of using them?
Adding some info:
The purpose of my crawler is to download entire pages (in html, png, or converted to txt using a library). As soon as the spider has the response to save, it passes it to a library that encapsulate all the io ops(File system and DB). So in this way, it is simpler than use items(with boilerplate for conversion) and pipelines.
So where is my doubt?
I don't know the way scrapy works internally well enough. The way the crawler is implemented the io ops are executed into the thread of the spider. So each spider takes longer to execute. As opposite if I move the io ops into the pipeline, maybe(?) scrapy can schedule its jobs better, executing them separately from the crawling job. Will there be any real performance difference?

In my opinion, using pipelines is just following the separation of concerns principle. Your spider can do many things, but it's core function is to extract information from web pages. The rest can (and possibly should) be refactored into a pipeline or an extension.
It might not be such an issue if you have one spider for one web site. But imagine you have a Scrapy project with hundreds of spiders for semantically similar web sites and you want to apply the same logic for each item -- take a page snapshot, check for duplicates, store in database etc. And now imagine the maintenance hell if you had all the logic in each of the spiders and had to change that logic.

Related

How to scrape data faster with selenium and django

I am working on a web scraping project. In this project, I have written the necessary code to scrape the required information from a website using python and selenium. All of this code resides in multiple methods in a class. This code is saved as scraper.py.
When I execute this code, the program takes sometime between 6 and 10 seconds to extract all the necessary information from the website.
I wanted to create a UI for this project. I used django to create the UI. In the webapp, there is a form which when submitted opens a new browser window and starts the scraping process.
I access the scraper.py file in django views, where depending on the form inputs, the scraping occurs. While this works fine, the execution is very slow and takes almost 2 minutes to finish running.
How do I make the execution of the code faster using django faster? can you point me some tutorial on how to convert the scraper.py code into an api that django can access? will this help in making the code faster?
Thanks in advance
Few tiny tips,
How is your scraper.py working in the first place? Does it simply print the site links/details, or store it in a text file, or return them? What exactly happens in it?
If you wish to use your scraper.py as an "API" write your scraper.py code within a function that returns the details of your scraped site as a dictionary. Django's views.py can easily handle such dictionaries and send it over to your frontend HTML to replace the parts written in Jinja2.
Further speed can be achieved (in case your scraper does larger jobs) by using multi-threading and/or multi-processing. Do explore both :)

Python Scrapy - How to scrape from 2 different website at the same time?

I need to scrape data from a list of domain given in Excel;
The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com) and data from similartech (https://www.similartech.com/websites/lepetitballon.com).
I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.
Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?
Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.
Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item classes.
Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.
Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.

Understand the scrapy framework architecture

Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.
My initial thought was to use the process_links parameter on a rule and generate items in the function that it points to. However, whereas the callback parameter points to a function that is an item generator, the process_links paramter works more like a filter. In the callback function you yield items and they are automaticaly collected and put in the pipeline. In the process_links function you return a list of links. You don't generate items.
I could just make a database connection in the process_links function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.
I could try to pass items from the process_links function to the callback function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.
In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback function? What's the process that supplies the links to, and receives the links from, the process_links function? The one that takes requests and returns responses?
From my point of view, I write code in a spider which genreates items. The items are automatically read and moved through a pipeline. I can create code in the pipeline and the items will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items get moved through the pipeline.
Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py. It contains the starts_requests() and make_requests_from_url() functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.
Going from the opposite direction, I can see that when I execute the command scrapy crawl... I'm calling crawl.py which in turn calls self.crawler_process.start() in crawler.py. That starts a Twisted reactor. There is also core/engine.py which is another collection of functions which look as though they are designed to control the operation of the spiders.
Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.
Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.
This is how Scrapy works in short:
You have Spiders which are responsible for crawling sites. You can use separate spiders for separate sites/tasks.
You provide one or more start urls to the spider. You can provide them as a list or use the start_requests method
When we run a spider using Scrapy, it takes these URLs and fetches the HTML response. The response is passed to the callback on the spider class. You can explicitly define a callback when using the start_requests method. If you don't, Scrapy will use the parse method as the callback.
You can extract whatever data you need from the HTML. The response object you get in the parse callback allows you do extract the data using css selectors or xpath.
If you find the data from the response, you can construct the Items and yield them. If you need to go to another page, you can yield scrapy.Request.
If you yield a dictionary or Item object, Scrapy will send those through the registered pipelines. If you yield scrapy.Request, the request would be further parsed and the response will be fed back to a callback. Again you can define a separate callback or use the default one.
In the pipelines, your data (dictionary or Item) go through the pipeline processors. In the pipelines you can store them in database or whatever you want to do.
So in short:
In parse method or in any method inside the spider, we would extract and yield our data so they are sent through the pipelines.
In the pipelines, you do the actual processing.
Here's a simple spider and pipeline example: https://gist.github.com/masnun/e85b38a00a74737bb3eb
I started using Scrapy not so long ago and I had some of your doubts myself (also considering I started with Python overall), but now it works for me, so don’t get discouraged – it’s a nice framework.
First, I would not get too worried at this stage about the details behind the framework, but rather start writing some basic spiders yourself.
Some of really key concepts are:
Start_urls – they define an initial URL (or URLs), where you will further look either for text or for further links to crawl. Let’s say you want to start from e.g. http://x.com
Parse(self.response) method – this will be the first method that will be processed that will give you Response of http://x.com. (basically its HTML markup)
You can use Xpath or CSS selectors to extract information from this markup e.g. a = Response.Xpath(‘//div[#class=”foo”]/#href’) will extract the link to a page (e.g. http://y.com)
If you want to extract the text of the link, so literally "http://y.com" you just yield (return) an item within Parse(self.response) method. So your final statement in this method will be yield item. If you want to go deeper and dwell to http://y.com your final statement will be scrapy.Request(a, callback= self.parse_final) - parse_final being here an example of the callback to the parse_final(self.response) method.
Then you can extract the elements of html of http://y.com as the final call in parse_final(self.response) method, or keep repeating the process to dig for further links in the page structure
Pipelines are for processing items. So when items get yielded, they are by default just printed on the screen. So in pipelines you can redirect them either to csv file, database etc.
The entire process gets more complex, when you start getting more links in each of the methods, based on various conditions you call various callbacks etc. I think you should start with getting this concept first, before going to pipelines. The examples from Scrapy are somewhat difficult to get, but once you get the idea it is really nice and not that complicated in the end.

Scrapy crawler - creating a 10,000 spiders or one spider crawling 10,000 domains?

I need to crawl upto 10,000 websites
since every website is unique with its own HTML structure and requires its own logic of XPATH & creating and delegating Request objects. I'm tempted to create a unique spider for each website
But is this the best way forward?. Should i perhaps have a single spider and add all the 10,000 websites in the start_urls and allowed_domains, write scraping libraries and go for it?
which is the best practice in regards to this?
I faced a similar problem, and I took a middle road.
Much of the data you will encounter will (likely) be handled the same way when you finally process it. That means much of the logic you need can be reused. The specifics include where to look for data and how to transform it into a common format. I suggest the following:
Create your MainSpider class, containing most of the logic and tasks that you need.
For each site, subclass MainSpider and define logic modules as required.
main_spider.py
class MainSpider(object):
# Do things here
def get_links(url)
return links
spider_mysite.py
from main_spider import MainSpider
class SpiderMysite(MainSpider):
def get_data(links):
for link in links:
# Do more stuff.
Hope it helps.

How to call scrapy in a python utillity

I want to get data from the web in real-time and have used scrapy to extract the information to build a python utility. The problem is that the data is static while the information will change in time.
I wanted to know if it is viable to call my scrapy spider when the utility is invoked so that when the utility is called for the first time, the data at that time is stored as JSON with the user which will change when the user calls it the next time.
Please let me know if there is an alternative to it.
Thanks in advance.
Edit-1: To make it clear, the data that I have extracted will change over time. Here is a link to my previous question about building the spider: How to scrape contents from multiple tables in a webpage. The problem is that as the league progresses, the fixtures' status will change (completed or not yet completed). I want the users to get real-time scraped data.
Edit-2: What I previously did was calling my spider separately and using the JSON generated for the purpose of my utillity. For the users to have real-time data, when they use it on the terminal, should I push the scrapy code into the main repository that will be uploaded to PyPI and call the spider in the main function of the .py file? Is this possible? What are its alternatives, if any?
your could start your scrapy from code when you (or your user) need:
from scrapy import cmdline
SCRAPY_SPIDER_NAME = 'spyder_name' # spyder name to start scraping
cmdline.execute("scrapy crawl {}".format(SCRAPY_SPIDER_NAME))

Categories

Resources