How to change depth limit while crawling with scrapy?

How to change depth limit while crawling with scrapy? - python

I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:
def start_requests(self):
if isinstance(self.vuln, context.GenericVulnerability):
yield Request(
self.vuln.base_url,
callback=self.determine_aliases,
meta=self._normal_meta,
)
else:
for url in self.vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
#inline_requests
def determine_aliases(self, response):
vulns = [self.vuln]
processed_vulns = set()
while vulns:
vuln = vulns.pop()
if vuln.vuln_id is not self.vuln.vuln_id:
response = yield Request(vuln.base_url)
processed_vulns.add(vuln.vuln_id)
aliases = context.create_vulns(*list(self.parse(response)))
for alias in aliases:
if alias.vuln_id in processed_vulns:
continue
if isinstance(alias, context.GenericVulnerability):
vulns.append(alias)
else:
logger.info("Alias discovered: %s", alias.vuln_id)
self.cves.add(alias)
yield from self._generate_requests_for_vulns()
def _generate_requests_for_vulns(self):
for vuln in self.cves:
for url in vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.
determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.
As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.
The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.
Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.

I ended up solving this by creating a middleware to reset the depth to 0. I pass a meta argument in the request with "reset_depth" as True, upon which the middleware alters the request's depth parameter.
class DepthResetMiddleware(object):
def process_spider_output(self, response, result, spider):
for r in result:
if not isinstance(r, Request):
yield r
continue
if (
"depth" in r.meta
and "reset_depth" in r.meta
and r.meta["reset_depth"]
):
r.meta["depth"] = 0
yield r
The Request should be yielded from the spider somehow like this:
yield Request(url, meta={"reset_depth": True})
Then add the middleware to your settings. The order matters, as this middleware should be executed before the DepthMiddleware is. Since the default DepthMiddleware order is 900, I set DepthResetMiddleware's order to 850 in my CrawlerProcess like so:
"SPIDER_MIDDLEWARES": {
"patchfinder.middlewares.DepthResetMiddleware": 850
}
Don't know if this is the best solution but it works. Another option is to perhaps extend DepthMiddleware and add this functionality there.

Related

Scrapy passing requests along

I previously used some code like this to visit a page and change the url around a bit to generate a second request which gets passed to a second parse method:
from scrapy.http import Request
def parse_final_page(self, response):
# do scraping here:
def get_next_page(self, response, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
yield req
def parse(self, response):
if 'substring' in response.url:
new_url = 'some_new_url'
yield from self.get_next_page(response, new_url)
else:
pass
# continue..
# scraping items
# yield
This snippet is pretty old (2 years or so) and i'm currently using Scrapy 2.2, although i'm not sure if that's relevant. Note that get_next_page gets called, but parse_final_page never runs, which I don't get...
Why is parse_final_page not being called? Or more to the point.. is there an easier way for me to just generate a new request on the fly? I would prefer to not use a middleware or change start_urls in this context.

1 - "Why is parse_final_page not being called?"
Your script works fine for me on Scrapy v2.2.1, so its probably an issue with the specific request you're trying to make.
2 - "...is there an easier way for me to just generate a new request on the fly?"
You could try this variation where you return the request from the get_next_page callback, instead of yielding it (note I removed the from keyword and did not send the response object to the callback):
def parse(self, response):
if 'substring' in response.url:
new_url = ''
yield self.get_next_page(new_url)
else:
# continue..
# scraping items
# yield
def get_next_page(self, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
return req
def parse_final_page(self, response):
# do scraping here:

Scrapy get pre-redirect url

I've a crawler running without troubles but i need to get the start_url and not the redirected one.
The problem is i'm using rules to pass parameters to the URL ( like field-keywords=xxxxx ) and finally get the correct url.
The parse function starts getting the item attributes without any troubles but when i want the start URL ( the true one ) it stores the redirected one ...
I've tryed:
response.url
response.request.meta.get('redirect_urls')
Both returns the final url ( the redirected one ) and not the start_url.
Some one know why, or has any clue ?
Thanks in advance.

use a Spider Middleware to keep track of the start url from every response:
from scrapy import Request
class StartRequestsMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
request.meta.update(start_url=request.url)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_url=response.meta['start_url'],
)
yield output
keep track of the start_url every response comes from with:
response.meta['start_url']

Have you tried response.request.url? I personally would override the start_requests method adding the original url in the meta, something like:
yield Request(url, meta={'original_request': url})
And then extract it using response.meta['original_request']

Urls expire due to timestamp authentication in scrapy

I was trying to crawl amazon grocery uk, and to get the grocery categories, I was using the Associate Product Advertising api. My requests get enqueued however as the requests have an expiry of 15 mins, some requests are crawled after 15 mins of being enqueued which means they get expired by the time they are crawled and yield a 400 error. I was thinking of a solution of enqueueing requests in a batch, but even that will fail if the implementation controls processing them in batches as the problem is preparing the request in batches as opposed to processing them in batches. Unfortunately, Scrapy has little documentation for this use case, so how can requests be prepared in batches?
from scrapy.spiders import XMLFeedSpider
from scrapy.utils.misc import arg_to_iter
from scrapy.loader.processors import TakeFirst
from crawlers.http import AmazonApiRequest
from crawlers.items import (AmazonCategoryItemLoader)
from crawlers.spiders import MySpider
class AmazonCategorySpider(XMLFeedSpider, MySpider):
name = 'amazon_categories'
allowed_domains = ['amazon.co.uk', 'ecs.amazonaws.co.uk']
marketplace_domain_name = 'amazon.co.uk'
download_delay = 1
rotate_user_agent = 1
grocery_node_id = 344155031
# XMLSpider attributes
iterator = 'xml'
itertag = 'BrowseNodes/BrowseNode/Children/BrowseNode'
def start_requests(self):
return arg_to_iter(
AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=self.grocery_node_id),
meta=dict(ancestor_node_id=self.grocery_node_id)
))
def parse(self, response):
response.selector.remove_namespaces()
has_children = bool(response.xpath('//BrowseNodes/BrowseNode/Children'))
if not has_children:
return response.meta['category']
# here the request should be configurable to allow batching
return super(AmazonCategorySpider, self).parse(response)
def parse_node(self, response, node):
category = response.meta.get('category')
l = AmazonCategoryItemLoader(selector=node)
l.add_xpath('name', 'Name/text()')
l.add_value('parent', category)
node_id = l.get_xpath('BrowseNodeId/text()', TakeFirst(), lambda x: int(x))
l.add_value('node_id', node_id)
category_item = l.load_item()
return AmazonApiRequest(
qargs=dict(Operation='BrowseNodeLookup',
BrowseNodeId=node_id),
meta=dict(ancestor_node_id=node_id,
category=category_item)
)

One way of doing this:
Since there are two places where you yield requests you can leverage priority attribute to prioritise requests coming from parse method:
class MySpider(Spider):
name = 'myspider'
def start_requests(self):
for url in very_long_list:
yield Request(url)
def parse(self, response):
for url in short_list:
yield Reuest(url, self.parse_item, priority=1000)
def parse_item(self, response):
# parse item
In this example scrapy will prioritize requests coming out from parse which will allow you to avoid the time limit.
See more on Request.priority:
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
on scrapy docs

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.

If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Crawling multiple starting urls with different depth

I'm trying to get Scrapy 0.12 to change it's "maximum depth" setting for different url in the start_urls variable in the spider.
If I understand correctly the documentation there's no way because the DEPTH_LIMIT setting is global for the entire framework and there's no notion of "requests originated from the initial one".
Is there a way to circumvent this? Is it possible to have multiple instances of the same spider initialized with each starting url and different depth limits?

Sorry, looks like i didn't understand you question correctly from the beginning. Correcting my answer:
Responses have depth key in meta. You can check it and take appropriate action.
class MySpider(BaseSpider):
def make_requests_from_url(self, url):
return Request(url, dont_filter=True, meta={'start_url': url})
def parse(self, response):
if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
# do something here for exceeding limit for this start url
else:
# find links and yield requests for them with passing the start url
yield Request(other_url, meta={'start_url': response.meta['start_url']})
http://doc.scrapy.org/en/0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to change depth limit while crawling with scrapy? - python

Related

Scrapy passing requests along

Scrapy get pre-redirect url

Urls expire due to timestamp authentication in scrapy

Multiple pages per item in Scrapy

Crawling multiple starting urls with different depth

Categories

Resources