Crawling multiple starting urls with different depth

Crawling multiple starting urls with different depth - python

I'm trying to get Scrapy 0.12 to change it's "maximum depth" setting for different url in the start_urls variable in the spider.
If I understand correctly the documentation there's no way because the DEPTH_LIMIT setting is global for the entire framework and there's no notion of "requests originated from the initial one".
Is there a way to circumvent this? Is it possible to have multiple instances of the same spider initialized with each starting url and different depth limits?

Sorry, looks like i didn't understand you question correctly from the beginning. Correcting my answer:
Responses have depth key in meta. You can check it and take appropriate action.
class MySpider(BaseSpider):
def make_requests_from_url(self, url):
return Request(url, dont_filter=True, meta={'start_url': url})
def parse(self, response):
if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
# do something here for exceeding limit for this start url
else:
# find links and yield requests for them with passing the start url
yield Request(other_url, meta={'start_url': response.meta['start_url']})
http://doc.scrapy.org/en/0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

Related

How to change depth limit while crawling with scrapy?

I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:
def start_requests(self):
if isinstance(self.vuln, context.GenericVulnerability):
yield Request(
self.vuln.base_url,
callback=self.determine_aliases,
meta=self._normal_meta,
)
else:
for url in self.vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
#inline_requests
def determine_aliases(self, response):
vulns = [self.vuln]
processed_vulns = set()
while vulns:
vuln = vulns.pop()
if vuln.vuln_id is not self.vuln.vuln_id:
response = yield Request(vuln.base_url)
processed_vulns.add(vuln.vuln_id)
aliases = context.create_vulns(*list(self.parse(response)))
for alias in aliases:
if alias.vuln_id in processed_vulns:
continue
if isinstance(alias, context.GenericVulnerability):
vulns.append(alias)
else:
logger.info("Alias discovered: %s", alias.vuln_id)
self.cves.add(alias)
yield from self._generate_requests_for_vulns()
def _generate_requests_for_vulns(self):
for vuln in self.cves:
for url in vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.
determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.
As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.
The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.
Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.

I ended up solving this by creating a middleware to reset the depth to 0. I pass a meta argument in the request with "reset_depth" as True, upon which the middleware alters the request's depth parameter.
class DepthResetMiddleware(object):
def process_spider_output(self, response, result, spider):
for r in result:
if not isinstance(r, Request):
yield r
continue
if (
"depth" in r.meta
and "reset_depth" in r.meta
and r.meta["reset_depth"]
):
r.meta["depth"] = 0
yield r
The Request should be yielded from the spider somehow like this:
yield Request(url, meta={"reset_depth": True})
Then add the middleware to your settings. The order matters, as this middleware should be executed before the DepthMiddleware is. Since the default DepthMiddleware order is 900, I set DepthResetMiddleware's order to 850 in my CrawlerProcess like so:
"SPIDER_MIDDLEWARES": {
"patchfinder.middlewares.DepthResetMiddleware": 850
}
Don't know if this is the best solution but it works. Another option is to perhaps extend DepthMiddleware and add this functionality there.

Running Scrapy for multiple times on same URL

I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)

dont_filter
indicates that this request should not be filtered by the scheduler.
This is used when you want to perform an identical request multiple
times, to ignore the duplicates filter. Use it with care, or you will
get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse please read Parse Method

How to yield url with orders to let scrapy crawl

here is my code:
def parse(self, response):
selector = Selector(response)
sites = selector.xpath("//h3[#class='r']/a/#href")
for index, site in enumerate(sites):
url = result.group(1)
print url
yield Request(url = site.extract(),callback = self.parsedetail)
def parsedetail(self,response):
print response.url
...
obj = Store.objects.filter(id=store_obj.id,add__isnull=True)
if obj:
obj.update(add=add)
in def parse
scarpy will get urls from google
the url output like:
www.test.com
www.hahaha.com
www.apple.com
www.rest.com
but when it yield to def parsedetail
the url is not with order maybe it will become :
www.rest.com
www.test.com
www.hahaha.com
www.apple.com
is there any way I can let the yield url with order to send to def parsedetail ?
Because I need to crawl www.test.com first.(the data the top url provide in google search is more correctly)
if there is no data in it.
I will go next url until update the empty field .(www.hahaha.com ,www.apple.com,www.rest.com)
Please guide me thank you!

By default, the order which the Scrapy requests are scheduled and sent is not defined. But, you can control it using priority keyword argument:
priority (int) – the priority of this request (defaults to 0). The
priority is used by the scheduler to define the order used to process
requests. Requests with a higher priority value will execute earlier.
Negative values are allowed in order to indicate relatively
low-priority.
You can also make the crawling synchronous by passing around the callstack inside the meta dictionary, as an example see this answer.

python scrapy parse() function, where is the return value returned to?

I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
’http://www.example.com/1.html’,
’http://www.example.com/2.html’,
’http://www.example.com/3.html’,
]
def parse(self, response):
for h3 in response.xpath(’//h3’).extract():
yield MyItem(title=h3)
for url in response.xpath(’//a/#href’).extract():
yield scrapy.Request(url, callback=self.parse)
I know, the parse method must return an item or/and request, but where are these return values returned to?
One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider, it has Rule with callback. What about this callback's return value? where to ? same as parse()?
I am very confused on Scrapy procedure, even i read the document....

According to the documentation:
The parse() method is in charge of processing the response and
returning scraped data (as Item objects) and more URLs to follow (as
Request objects).
In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler which pipes the requests to the Downloader for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback method).
The whole data-flow process is described in the Architecture Overview page in a very detailed manner.
Hope that helps.

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.

If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling multiple starting urls with different depth - python

Related

How to change depth limit while crawling with scrapy?

Running Scrapy for multiple times on same URL

How to yield url with orders to let scrapy crawl

python scrapy parse() function, where is the return value returned to?

Multiple pages per item in Scrapy

Categories

Resources