I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.
I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.
I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.
Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?
As paul t. suggested:
class MySpider(CrawlSpider):
def start_requests(self):
...
yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
...
def parse(self, response):
category = response.meta['category']
...
You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.
Same thing if you need to pass data from a parse function to a parse_item, for instance.
Related
This is the sitemap of the website I'm crawling. The 3rd and 4th <sitemap> nodes have the urls which goes to the item details. Is there any way to apply crawling logic only to those
nodes? (like selecting them by their indices)
class MySpider(SitemapSpider):
name = 'myspider'
sitemap_urls = [
'https://www.dfimoveis.com.br/sitemap_index.xml',
]
sitemap_rules = [
('/somehow targeting the 3rd and 4th node', 'parse_item')
]
def parse_item(self, response):
# scraping the item
You don't need to use SitemapSpider, just use regex and standard spider.
def start_requests(self):
sitemap = 'https://www.dfimoveis.com.br/sitemap_index.xml'
yield scrapy.Request(url=sitemap, callback=self.parse_sitemap)
def parse_sitemap(self, response):
sitemap_links = re.findall(r"<loc>(.*?)</loc>", response.text, re.DOTALL)
sitemap_links = sitemap_links[2:4] # Only 3rd and 4th nodes.
for sitemap_link in sitemap_links:
yield scrapy.Request(url=sitemap_link, callback=self.parse)
Scrapy’s Spider subclasses, including SitemapSpider are meant to make very common scenarios very easy.
You want to do something that is rather uncommon, so you should read the source code of SitemapSpider, try to understand what it does, and either subclass SitemapSpider overriding the behavior you want to change or directly write your own spider from scratch based on the code of SitemapSpider.
For example I want to crawl three similar urls:
https://example.com/book1
https://example.com/book2
https://example.com/book3
What I want is in the pipeline.py, that I create 3 files named book1, book2 and book3, and write the 3 books' data correctly and separately
In the spider.py, I know the three books' name which as the file name, but not in the pipeline.py
They have a same structure, so I decide to code like below:
class Book_Spider(scrapy.Spider):
def start_requests(self):
for url in urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
# item handling
yield item
Now, how can I do?
Smith, If you want to know book name in pipeline.py. there are two options either you make a item field for book_file_name and populate it accordingly as you want. or you can extract it from url field which is also a item field and can access in pipline.py
I have a set of start urls, like below:
start_urls = [www.example.com,www.example.com/ca,wwww.example.com/ap]
Now I have written code for extracting all the urls occurring inside each start_urls like below:
rules = (Rule(
LinkExtractor(
allow_domains = ('example.com'),
attrs = ('href'),
tags = ('a'),
deny = (),
deny_extensions = (),
unique = True,
),
callback = 'parseHtml', follow = True),)
In the parseHtml function, I am parsing the the content of the links.
Now in the above sites, I have common links occurring. For those common links I need to have some sort of identification to be done based on the start_urls.
How can I achieve this using scrappy?
You could not use the CrawlSpider and pass the start_url information yourself from start_requests through all your callbacks
You could create a Spider Middleware to handle start_requests to do the same but without doing it directly on the spider, you can find a similar behaviour here
I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!
This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()
How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))
I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
’http://www.example.com/1.html’,
’http://www.example.com/2.html’,
’http://www.example.com/3.html’,
]
def parse(self, response):
for h3 in response.xpath(’//h3’).extract():
yield MyItem(title=h3)
for url in response.xpath(’//a/#href’).extract():
yield scrapy.Request(url, callback=self.parse)
I know, the parse method must return an item or/and request, but where are these return values returned to?
One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider, it has Rule with callback. What about this callback's return value? where to ? same as parse()?
I am very confused on Scrapy procedure, even i read the document....
According to the documentation:
The parse() method is in charge of processing the response and
returning scraped data (as Item objects) and more URLs to follow (as
Request objects).
In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler which pipes the requests to the Downloader for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback method).
The whole data-flow process is described in the Architecture Overview page in a very detailed manner.
Hope that helps.