Scrapy: Can't restart start_requests() properly - python

I have a scraper that initiates two pages - one of them is the main page, and the other is a .js file which containt long and lat coordinates I need to extract, because I need them later in the parsing process. I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its items.
For this purpose I am using the priority parameter in the Request method and I am saying that I want my .js page to be processed first. This works, but only around 70% of the time (must be due to the Scrapy's asynchronous requests). The rest 30% of the time I end up in my parse method trying to parse the .js long/lat coordinates, but having passed the main website page, so it's impossible to parse them.
For this reason, I tried to fix it this way:
when in parse() method, check which n-th url is that, if it is the first one and is not the .js one, restart the spider. However, when I restart the spider the next time it passes correctly the .js first, but after its processing the spider finished work and exits the script without an error as if it were completed.
Why is that happening, what is the difference with the processing of the pages when I restart the spider compared to when I just start it, and how can I fix this problem?
This is the code with sample outputs in both scenarios when I was trying to debug what is being executed and why it stops when being restarted.
class QuotesSpider(Spider):
name = "bot"
url_id = 0
home_url = 'https://website.com'
longitude = None
latitude = None
def __init__(self, cat=None):
self.cat = cat.replace("-", " ")
def start_requests(self):
print ("Starting spider")
self.start_urls = [
self.home_url,
self.home_url+'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
print ("Inside parse")
if self.url_id == 0 and response.url == self.home_url:
self.alert("Loaded main page before long/lat page, restarting", False)
for _ in self.start_requests():
yield _
else:
print ("Everything is good, url id is", str(self.url_id))
self.url_id +=1
if self.longitude is None:
for _ in self.parse_long_lat(response):
yield _
else:
print ("Calling parse cats")
for cat in self.parse_cats(response):
yield cat
def parse_long_lat(self, response):
print ("called long lat")
try:
self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)',
response.text).group(1)
self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)',
response.text).group(1)
print ("Extracted coords")
yield None
except AttributeError as e:
self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
yield None
def parse_cats(self, response):
pass
""" Parsing links code goes here """
Output when the spider starts correctly, gets first the .js page and second starts parsing the cats:
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats
And the script goes on and parses everything fine.
Output when the spider starts incorrectly, gets first the main page and restarts start_requests():
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
And the script stops its execution without and error as if it were completed.
P.S. If this matters, I did mention that the processing URL in the start_requests() is processed reversed order, but I find this natural due the the loop sequence and I expect the priority param to do its job (as it does it most of the time and it should do it as per Scrapy's docs).

As to why your Spider doesn't continue in the "restarting" case; you probably run afoul of duplicate requests being filtered/dropped. Since the page has already been visited, Scrapy thinks it's done.
So you would have to re-send these requests with a dont_filter=True argument:
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
# ^^^^^^^^^^^^^^^^ notice us forcing the Dupefilter to
# ignore duplicate requests to these pages
As to a better solution instead of this hacky approach, consider using InitSpider (for example, other methods exist). This guarantees your "initial" work got done and can be depended on.
(For some reason the class was never documented in the Scrapy docs, but it's a relatively simple Spider subclass: do some initial work, before starting the actual run.)
And here is a code-example for that:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
class QuotesSpider(InitSpider):
name = 'quotes'
allowed_domains = ['website.com']
start_urls = ['https://website.com']
# Without this method override, InitSpider behaves like Spider.
# This is used _instead of_ start_requests. (Do not override start_requests.)
def init_request(self):
# The last request that finishes the initialization needs
# to have the `self.initialized()` method as callback.
url = self.start_urls[0] + '/js-file-with-long-lat.js'
yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)
def parse_long_lat(self, response):
""" The callback for our init request. """
print ("called long lat")
# do some work and maybe return stuff
self.latitude = None
self.longitude = None
#yield stuff_here
# Finally, start our run.
return self.initialized()
# Now we are "initialized", will process `start_urls`
# and continue from there.
def parse(self, response):
print ("Inside parse")
print ("Everything is good, do parse_cats stuff here")
which would result in output like this:
2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)

So I finally handled it with a workaround:
I check what is the response.url received in parse() and based on that I send the further parsing to a corresponding method:
def start_requests(self):
self.start_urls = [
self.home_url,
self.home_url + 'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
if response.url != self.home_url:
for _ in self.parse_long_lat(response):
yield _
else:
for cat in self.parse_cats(response):
yield cat

Related

How to change depth limit while crawling with scrapy?

I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:
def start_requests(self):
if isinstance(self.vuln, context.GenericVulnerability):
yield Request(
self.vuln.base_url,
callback=self.determine_aliases,
meta=self._normal_meta,
)
else:
for url in self.vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
#inline_requests
def determine_aliases(self, response):
vulns = [self.vuln]
processed_vulns = set()
while vulns:
vuln = vulns.pop()
if vuln.vuln_id is not self.vuln.vuln_id:
response = yield Request(vuln.base_url)
processed_vulns.add(vuln.vuln_id)
aliases = context.create_vulns(*list(self.parse(response)))
for alias in aliases:
if alias.vuln_id in processed_vulns:
continue
if isinstance(alias, context.GenericVulnerability):
vulns.append(alias)
else:
logger.info("Alias discovered: %s", alias.vuln_id)
self.cves.add(alias)
yield from self._generate_requests_for_vulns()
def _generate_requests_for_vulns(self):
for vuln in self.cves:
for url in vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.
determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.
As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.
The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.
Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.
I ended up solving this by creating a middleware to reset the depth to 0. I pass a meta argument in the request with "reset_depth" as True, upon which the middleware alters the request's depth parameter.
class DepthResetMiddleware(object):
def process_spider_output(self, response, result, spider):
for r in result:
if not isinstance(r, Request):
yield r
continue
if (
"depth" in r.meta
and "reset_depth" in r.meta
and r.meta["reset_depth"]
):
r.meta["depth"] = 0
yield r
The Request should be yielded from the spider somehow like this:
yield Request(url, meta={"reset_depth": True})
Then add the middleware to your settings. The order matters, as this middleware should be executed before the DepthMiddleware is. Since the default DepthMiddleware order is 900, I set DepthResetMiddleware's order to 850 in my CrawlerProcess like so:
"SPIDER_MIDDLEWARES": {
"patchfinder.middlewares.DepthResetMiddleware": 850
}
Don't know if this is the best solution but it works. Another option is to perhaps extend DepthMiddleware and add this functionality there.

How does Scrapy save crawl state?

I am able to save my crawl state and Scrapy successfully continues from where I cut it off. I have kept the start_urls constant each time I restart the spider i.e. the order and the list of start_urls fed each time that the spider is restarted is constant. But I need to do a random shuffle of my start_urls as I have URLs from different domains and well as in from same domain but as they are in order, the crawl delay is significantly slowing down my crawl speed. My list is 10s of millions and I have already crawled a million URLs. So I wouldn't want to jeopardize anything or restart the crawl.
I have seen that requests.seen holds what looks like hashed values of the URLs that have been visited. And from Scrapy code I am certain that it's used to filter duplicates. But I am not sure what either spider.state or requests.queue does to help with saving state or restarting the crawl.
You can write these requests to a txt file while seperating them when request is called by callback or errback.
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse,
errback=self.err,
dont_filter=True)
def parse(self, response):
with open('successful_requests.txt', 'a') as out:
out.write(response.url + '\n')
def err(self, failure):
with open('failed_requests.txt', 'a') as out:
out.write(response.url + ' ' + str(failure) + '\n')
To reach statea of requests, just read these txt files.

Scrapy yield request from one spider to another

I have the following code:
#FirstSpider.py
class FirstSpider(scrapy.Spider):
name = 'first'
start_urls = ['https://www.basesite.com']
next_urls = []
def parse(self, response):
for url in response.css('bunch > of > css > here'):
self.next_urls.append(url.css('more > css > here'))
l = Loader(item=Item(), selector=url.css('more > css'))
l.add_css('add', 'more > css')
...
...
yield l.load_item()
for url in self.next_urls:
new_urls = self.start_urls[0] + url
yield scrapy.Request(new_urls, callback=SecondSpider.parse_url)
#SecondSpider.py
class SecondSpider(scrapy.Spider):
name = 'second'
start_urls = ['https://www.basesite.com']
def parse_url(self):
"""Parse team data."""
return self
# self is a HtmlResponse not a 'response' object
def parse(self, response):
"""Parse all."""
summary = self.parse_url(response)
return summary
#ThirdSpider.py
class ThirdSpider(scrapy.Spider):
# take links from second spider, continue:
I want to be able to pass the url scraped in Spider 1 to Spider 2 (in a different script). I'm curious as to why when I do, the 'response' is a HtmlResponse and not a response object ( When doing something similar to a method in the same class as Spider 1; I don't have this issue )
What am i missing here? How do i just pass the original response(s) to the second spider? ( and from the second onto the third, etc..?)
You could use Redis as shared resource between all spiders https://github.com/rmax/scrapy-redis
Run all N spiders (don't close on idle state), so each of them will be connected to same Redis and waiting tasks(url, request headers) from there;
As the side-effect push task data to Redis from X_spider with specific key (Y_spider name).
What about using inheritance? "parse" function names should be different.
If your first spider inherits from the second, it will be able to set the callback to self.parse_function_spider2

How to handle large number of requests in scrapy?

I'm crawling around 20 million urls. But before the request is actually made the process gets killed due to excessive memory usage (4 GB RAM). How can I handle this in scrapy so that the process doesn't gets killed ?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
urls = []
for d in range(0,20000000):
link = "http://example.com/"+str(d)
urls.append(link)
start_urls = urls
def parse(self, response):
yield response
I think I found the workaround.
Add this method to your spider.
def start_requests(self):
for d in range(1,26999999):
yield scrapy.Request("http://example.com/"+str(d), self.parse)
you dont have to specify the start_urls in the starting.
It will start generating URLs and start sending asynchronous requests and the callback will be called when the scrapy gets the response.In the start the memory usage will be more but later on it will take constant memory.
Along with this you can use
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
By using this you can pause the spider and resume it any time by using the same command
and in order to save CPU (and log storage requirements)
use
LOG_LEVEL = 'INFO'
in settings.py of the scrapy project.
I believe creating a big list of urls to use as start_urls may be causing the problem.
How about doing this instead?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://example.com/0"]
def parse(self, response):
for d in xrange(1,20000000):
link = "http://example.com/"+str(d)
yield Request(url=link, callback=self.parse_link)
def parse_link(self, response):
yield response

How to stop scrapy spider but process all wanted items?

I have in my pipleline a method to check if the post date of the item is older then that found in mysql, so let lastseen be the newest datetime retrieved from database:
def process_item(self, item, spider):
if item['post_date'] < lastseen:
# set flag to close_spider
# raise DropItem("old item")
This code basically works except: I check the site on hourly basis just to get the new posts, if I don't stop the spider it will keep crawling on thousands of pages, if I stop the spider on flag, chances are few requests will not be processed, since they may came back in queue after spider closed, even though those might be newer in post date, having said that, is there a workaround for a more precise scraping?
Thanks,
Not sure if this fits your setup, but you can fetch lastseen from MySQL when initializing your spider and stop generating Requests in your callbacks when the response contains the item with postdate < lastseen, hence basically moving the logic to stop crawling directly inside the Spider instead of the pipeline.
It can sometimes be simpler to pass an argument to your spider
scrapy crawl myspider -a lastseen=20130715
and set property of your Spider to test in your callback (http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments)
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, lastseen=None):
self.lastseen = lastseen
# ...
def parse_new_items(self, reponse):
follow_next_page = True
# item fetch logic
for element in <some_selector>:
# get post_date
post_date = <extract post_date from element>
# check post_date
if post_date < self.lastseen:
follow_next_page = False
continue
item = MyItem()
# populate item...
yield item
# find next page to crawl
if follow_next_page:
next_page_url = ...
yield Request(url = next_page_url, callback=parse_new_items)

Categories

Resources