How does Scrapy save crawl state? - python

I am able to save my crawl state and Scrapy successfully continues from where I cut it off. I have kept the start_urls constant each time I restart the spider i.e. the order and the list of start_urls fed each time that the spider is restarted is constant. But I need to do a random shuffle of my start_urls as I have URLs from different domains and well as in from same domain but as they are in order, the crawl delay is significantly slowing down my crawl speed. My list is 10s of millions and I have already crawled a million URLs. So I wouldn't want to jeopardize anything or restart the crawl.
I have seen that requests.seen holds what looks like hashed values of the URLs that have been visited. And from Scrapy code I am certain that it's used to filter duplicates. But I am not sure what either spider.state or requests.queue does to help with saving state or restarting the crawl.

You can write these requests to a txt file while seperating them when request is called by callback or errback.
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse,
errback=self.err,
dont_filter=True)
def parse(self, response):
with open('successful_requests.txt', 'a') as out:
out.write(response.url + '\n')
def err(self, failure):
with open('failed_requests.txt', 'a') as out:
out.write(response.url + ' ' + str(failure) + '\n')
To reach statea of requests, just read these txt files.

Related

How does Scrapy proceed with the urls given in the urls variable under start_requests?

Just wondering why when I have url = ['site1', 'site2'] and I run scrapy from script using .crawl() twice, in a row like
def run_spiders():
process.crawl(Spider)
process.crawl(Spider)
the output is:
site1info
site1info
site2info
site2info
as opposed to
site1info
site2info
site1info
site2info
Because as soon as you call process.start(), requests are handled asynchronously. The order is not guaranteed.
In fact, even if you only call process.crawl() once, you may sometimes get:
site2info
site1info
To run spiders sequentially from Python, see this other answer.
start_request uses the yield functionality. yield queues the requests. To understand it fully read this StackOverflow answer.
Here is the code example of how it works with start_urls in the start_request method.
start_urls = [
"url1.com",
"url2.com",
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse)
For custom request ordering this priority feature can be used.
def start_requests(self):
yield scrapy.Request(self.start_urls[0], callback=self.parse)
yield scrapy.Request(self.start_urls[1], callback=self.parse, priority=1)
the one with the higher number of priority will be yielded first from the queue. By default, priority is 0.

Scrapy: Can't restart start_requests() properly

I have a scraper that initiates two pages - one of them is the main page, and the other is a .js file which containt long and lat coordinates I need to extract, because I need them later in the parsing process. I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its items.
For this purpose I am using the priority parameter in the Request method and I am saying that I want my .js page to be processed first. This works, but only around 70% of the time (must be due to the Scrapy's asynchronous requests). The rest 30% of the time I end up in my parse method trying to parse the .js long/lat coordinates, but having passed the main website page, so it's impossible to parse them.
For this reason, I tried to fix it this way:
when in parse() method, check which n-th url is that, if it is the first one and is not the .js one, restart the spider. However, when I restart the spider the next time it passes correctly the .js first, but after its processing the spider finished work and exits the script without an error as if it were completed.
Why is that happening, what is the difference with the processing of the pages when I restart the spider compared to when I just start it, and how can I fix this problem?
This is the code with sample outputs in both scenarios when I was trying to debug what is being executed and why it stops when being restarted.
class QuotesSpider(Spider):
name = "bot"
url_id = 0
home_url = 'https://website.com'
longitude = None
latitude = None
def __init__(self, cat=None):
self.cat = cat.replace("-", " ")
def start_requests(self):
print ("Starting spider")
self.start_urls = [
self.home_url,
self.home_url+'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
print ("Inside parse")
if self.url_id == 0 and response.url == self.home_url:
self.alert("Loaded main page before long/lat page, restarting", False)
for _ in self.start_requests():
yield _
else:
print ("Everything is good, url id is", str(self.url_id))
self.url_id +=1
if self.longitude is None:
for _ in self.parse_long_lat(response):
yield _
else:
print ("Calling parse cats")
for cat in self.parse_cats(response):
yield cat
def parse_long_lat(self, response):
print ("called long lat")
try:
self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)',
response.text).group(1)
self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)',
response.text).group(1)
print ("Extracted coords")
yield None
except AttributeError as e:
self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
yield None
def parse_cats(self, response):
pass
""" Parsing links code goes here """
Output when the spider starts correctly, gets first the .js page and second starts parsing the cats:
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats
And the script goes on and parses everything fine.
Output when the spider starts incorrectly, gets first the main page and restarts start_requests():
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
And the script stops its execution without and error as if it were completed.
P.S. If this matters, I did mention that the processing URL in the start_requests() is processed reversed order, but I find this natural due the the loop sequence and I expect the priority param to do its job (as it does it most of the time and it should do it as per Scrapy's docs).
As to why your Spider doesn't continue in the "restarting" case; you probably run afoul of duplicate requests being filtered/dropped. Since the page has already been visited, Scrapy thinks it's done.
So you would have to re-send these requests with a dont_filter=True argument:
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
# ^^^^^^^^^^^^^^^^ notice us forcing the Dupefilter to
# ignore duplicate requests to these pages
As to a better solution instead of this hacky approach, consider using InitSpider (for example, other methods exist). This guarantees your "initial" work got done and can be depended on.
(For some reason the class was never documented in the Scrapy docs, but it's a relatively simple Spider subclass: do some initial work, before starting the actual run.)
And here is a code-example for that:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
class QuotesSpider(InitSpider):
name = 'quotes'
allowed_domains = ['website.com']
start_urls = ['https://website.com']
# Without this method override, InitSpider behaves like Spider.
# This is used _instead of_ start_requests. (Do not override start_requests.)
def init_request(self):
# The last request that finishes the initialization needs
# to have the `self.initialized()` method as callback.
url = self.start_urls[0] + '/js-file-with-long-lat.js'
yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)
def parse_long_lat(self, response):
""" The callback for our init request. """
print ("called long lat")
# do some work and maybe return stuff
self.latitude = None
self.longitude = None
#yield stuff_here
# Finally, start our run.
return self.initialized()
# Now we are "initialized", will process `start_urls`
# and continue from there.
def parse(self, response):
print ("Inside parse")
print ("Everything is good, do parse_cats stuff here")
which would result in output like this:
2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)
So I finally handled it with a workaround:
I check what is the response.url received in parse() and based on that I send the further parsing to a corresponding method:
def start_requests(self):
self.start_urls = [
self.home_url,
self.home_url + 'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
if response.url != self.home_url:
for _ in self.parse_long_lat(response):
yield _
else:
for cat in self.parse_cats(response):
yield cat

Running Scrapy for multiple times on same URL

I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
dont_filter
indicates that this request should not be filtered by the scheduler.
This is used when you want to perform an identical request multiple
times, to ignore the duplicates filter. Use it with care, or you will
get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse please read Parse Method

How to handle large number of requests in scrapy?

I'm crawling around 20 million urls. But before the request is actually made the process gets killed due to excessive memory usage (4 GB RAM). How can I handle this in scrapy so that the process doesn't gets killed ?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
urls = []
for d in range(0,20000000):
link = "http://example.com/"+str(d)
urls.append(link)
start_urls = urls
def parse(self, response):
yield response
I think I found the workaround.
Add this method to your spider.
def start_requests(self):
for d in range(1,26999999):
yield scrapy.Request("http://example.com/"+str(d), self.parse)
you dont have to specify the start_urls in the starting.
It will start generating URLs and start sending asynchronous requests and the callback will be called when the scrapy gets the response.In the start the memory usage will be more but later on it will take constant memory.
Along with this you can use
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
By using this you can pause the spider and resume it any time by using the same command
and in order to save CPU (and log storage requirements)
use
LOG_LEVEL = 'INFO'
in settings.py of the scrapy project.
I believe creating a big list of urls to use as start_urls may be causing the problem.
How about doing this instead?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://example.com/0"]
def parse(self, response):
for d in xrange(1,20000000):
link = "http://example.com/"+str(d)
yield Request(url=link, callback=self.parse_link)
def parse_link(self, response):
yield response

Is it possible to crawl multiple start_urls list simultaneously

I have 3 URL files all of them have same structure so same spider can be used for all lists.
A special need is that all three need to be crawled simultaneously.
is it possible to crawl them simultaneously without creating multiple spiders?
I believe this answer
start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]
in Scrap multiple urls with scrapy only joins two list, but not to run them at the same time.
Thanks very much
use start_requests instead of start_urls ... this will work for u
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
for page in range(1,20):
yield self.make_requests_from_url('https://www.example.com/page-%s' %page)

Categories

Resources