I'm crawling around 20 million urls. But before the request is actually made the process gets killed due to excessive memory usage (4 GB RAM). How can I handle this in scrapy so that the process doesn't gets killed ?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
urls = []
for d in range(0,20000000):
link = "http://example.com/"+str(d)
urls.append(link)
start_urls = urls
def parse(self, response):
yield response
I think I found the workaround.
Add this method to your spider.
def start_requests(self):
for d in range(1,26999999):
yield scrapy.Request("http://example.com/"+str(d), self.parse)
you dont have to specify the start_urls in the starting.
It will start generating URLs and start sending asynchronous requests and the callback will be called when the scrapy gets the response.In the start the memory usage will be more but later on it will take constant memory.
Along with this you can use
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
By using this you can pause the spider and resume it any time by using the same command
and in order to save CPU (and log storage requirements)
use
LOG_LEVEL = 'INFO'
in settings.py of the scrapy project.
I believe creating a big list of urls to use as start_urls may be causing the problem.
How about doing this instead?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://example.com/0"]
def parse(self, response):
for d in xrange(1,20000000):
link = "http://example.com/"+str(d)
yield Request(url=link, callback=self.parse_link)
def parse_link(self, response):
yield response
Related
My goal is to print something from the parse method when I iterate through the for loop in get_membership_no method.
I am using python3.8.5, Scrapy 1.7.3 when I run the code mentioned bellow I get "Filtered offsite request".
Here is the console output.
And here is my code.
import scrapy
import json
class BasisMembersSpider(scrapy.Spider):
name = 'basis'
allowed_domains = ['www.basis.org.bd']
def start_requests(self):
yield scrapy.Request(url="https://basis.org.bd/get-member-list?page=1&team=", callback=self.get_membership_no)
def get_membership_no(self, response):
data_array = json.loads(response.body)['data']
for data in data_array:
yield scrapy.Request(url='https://basis.org.bd/get-company-profile/{0}'.format(data['membership_no']), callback=self.parse)
def parse(self, response):
print("I want to get this line on console. thank you.")
The reason for this behavior is that you set allowed_domains = ['www.basis.org.bd'], which blocks requests to basis.org.bd.
You can either leave allowed_domains out completely or extend your list of allowed domains like this:
allowed_domains = ['www.basis.org.bd', 'basis.org.bd']
See the documentation for allowed_domains here for more information.
Removing "www." from allowed_domains worked for me.
I found this article really helpful.
I'm new to scrapy and I'm trying to practice with and example, I want to run scrapy spiders sequentially but when I use the code from the documentation
(https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) while using crawler process it doesn't work. The spiders opens and close instantly without scraping data from the website. But when I run the spiders alone using "scrapy crawl" it works. I don't understand why spider scrape datas while I call it alone and doesn't scrape datas while I try to run it sequentially. If someone could help me with that it would be great.
Here's the code that I'm using:
class APASpider(scrapy.Spider):
name = 'APA_test'
allowed_domains = ['some_domain.com']
start_urls = ['startin_url']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
def parse(self, response):
for href in response.xpath('//a[#class="product-link"]/#href').extract():
yield SplashRequest(response.urljoin(href),self.parse_produits,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
for pages in response.xpath('//*[#id="loadmore"]/#href'):
yield SplashRequest(response.urljoin(pages.extract()),self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
def parse_produits(self,response):
Nom = response.xpath("//h1/text()").extract()
Poids = response.xpath('//p[#class="description"]/text()').extract()
item_APA = APAitem()
item_APA["Titre"] = Nom
item_APA["Poids"] = Poids
yield item_APA
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(APASpider)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Thank you
It's hard to tell exactly what is the issue there considering there are no log messages provided in the question.
That being said, I'll still try to answer as I've had this same issue a while ago.
There is this issue with scrapy_splash concerning local last_response = entries[#entries].response on Splash scripts. I'm assuming that you have it on your script, as I did.
The workaround I used was to check if history is not empty before taking last entry. (as suggested by github user kmike).
I have the following code:
#FirstSpider.py
class FirstSpider(scrapy.Spider):
name = 'first'
start_urls = ['https://www.basesite.com']
next_urls = []
def parse(self, response):
for url in response.css('bunch > of > css > here'):
self.next_urls.append(url.css('more > css > here'))
l = Loader(item=Item(), selector=url.css('more > css'))
l.add_css('add', 'more > css')
...
...
yield l.load_item()
for url in self.next_urls:
new_urls = self.start_urls[0] + url
yield scrapy.Request(new_urls, callback=SecondSpider.parse_url)
#SecondSpider.py
class SecondSpider(scrapy.Spider):
name = 'second'
start_urls = ['https://www.basesite.com']
def parse_url(self):
"""Parse team data."""
return self
# self is a HtmlResponse not a 'response' object
def parse(self, response):
"""Parse all."""
summary = self.parse_url(response)
return summary
#ThirdSpider.py
class ThirdSpider(scrapy.Spider):
# take links from second spider, continue:
I want to be able to pass the url scraped in Spider 1 to Spider 2 (in a different script). I'm curious as to why when I do, the 'response' is a HtmlResponse and not a response object ( When doing something similar to a method in the same class as Spider 1; I don't have this issue )
What am i missing here? How do i just pass the original response(s) to the second spider? ( and from the second onto the third, etc..?)
You could use Redis as shared resource between all spiders https://github.com/rmax/scrapy-redis
Run all N spiders (don't close on idle state), so each of them will be connected to same Redis and waiting tasks(url, request headers) from there;
As the side-effect push task data to Redis from X_spider with specific key (Y_spider name).
What about using inheritance? "parse" function names should be different.
If your first spider inherits from the second, it will be able to set the callback to self.parse_function_spider2
I'm interested in using Scrapy-Redis to store scraped items in Redis. In particular, the Redis-based request duplicates filter seems like a useful feature.
To start off, I adapted the spider at https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider as follows:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}}
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
where I generated the project using scrapy startproject tutorial at the command line and defined QuoteItem in items.py as
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Basically, I've implemented the settings in the "Usage" section of the README in the settings per-spider and made the spider yield an Item object instead of a regular Python dictionary. (I figured this would be necessary to trigger the Item Pipeline).
Now, if I crawl the spider using scrapy crawl quotes from the command line and then do redis-cli, I see a quotes:items key:
127.0.0.1:6379> keys *
1) "quotes:items"
which is a list of length 20:
127.0.0.1:6379> llen quotes:items
(integer) 20
If I run scrapy crawl quotes again, the length of the list doubles to 40:
127.0.0.1:6379> llen quotes:items
(integer) 40
However, I would expect the length of quotes:items to still be 20, since I have simply re-scraped the same pages. Am I doing something wrong here?
Scrapy-redis doesn't filter duplicate items automatically.
The (requests) dupefilter is about the requests in a crawl. What you want seems to be something similar to the deltafetch middleware: https://github.com/scrapy-plugins/scrapy-deltafetch
You would need to adapt deltafetch to work with a distributed storage, perhaps redis' bitmap feature will fit this case.
Here is how I fixed the problem in the end. First of all, as pointed out to me in a separate question, How to implement a custom dupefilter in Scrapy?, using the start_urls class variable results in an implementation of start_requests in which the yielded Request objects have dont_filter=True. To disable this and use the default dont_filter=False instead, I implemented start_requests directly:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter',
'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}
}
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').extract_first()
item['author'] = quote.css('small.author::text').extract_first()
item['tags'] = quote.css('div.tags a.tag::text').extract()
yield item
Secondly, as pointed out by Rolando, the fingerprints aren't by default persisted across different crawls. To implement this, I subclassed Scrapy-Redis' RFPDupeFilter class:
import scrapy_redis.dupefilter
from scrapy_redis.connection import get_redis_from_settings
class RedisDupeFilter(scrapy_redis.dupefilter.RFPDupeFilter):
#classmethod
def from_settings(cls, settings):
server = get_redis_from_settings(settings)
key = "URLs_seen" # Use a fixed key instead of one containing a timestamp
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server=server, key=key, debug=debug)
def request_seen(self, request):
added = self.server.sadd(self.key, request.url)
return added == 0
def clear(self):
pass # Don't delete the key from Redis
The main differences are (1) the key is set to a fixed value (not one containing a time stamp) and (2) the clear method, which in Scrapy-Redis' implementation deletes the key from Redis, is effectively disabled.
Now, when I run scrapy crawl quotes the second time, I see the expected log output
2017-05-05 15:13:46 [scrapy_redis.dupefilter] DEBUG: Filtered duplicate request <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
and no items are scraped.
I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.
I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}
your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please