My Scrapy spider has a bunch of independent target links to crawl.
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
Each link multiple pages that will be followed. i.e.
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?
Related
I have written a program to crawl a single website and scrape certain data. I would like to speed up its execution by using ProcessingPoolExecutor. However, I am having trouble understanding how I can convert from single threaded to concurrent.
Specifically, when creating a job (via ProcessPoolExecutor.submit()), can I pass a class/object and args instead of a function and args?
And, if so, how do return data from those jobs to the queue for tracking visited pages AND a structure for holding scraped content?
I have been using this as a jumping off point, as well as reviewing the Queue and concurrent.futures docs (with, frankly, the latter going a bit over my head). I've also Googled/Youtubed/SO'ed around quite a bit to no avail.
from queue import Queue, Empty
from concurrent.futures import ProcessPoolExecutor
class Scraper:
"""
Scrapes a single url
"""
def __init__(self, url):
self.url = url # url of page to scrape
self.internal_urls = None
self.content = None
self.scrape()
def scrape(self):
"""
Method(s) to request a page, scrape links from that page
to other pages, and finally scrape actual content from the current page
"""
# assume that code in this method would yield urls linked in current page
self.internal_urls = set(scraped_urls)
# and that code in this method would scrape a bit of actual content
self.content = {'content1': content1, 'content2': content2, 'etc': etc}
class CrawlManager:
"""
Manages a multiprocess crawl and scrape of a single site
"""
def __init__(self, seed_url):
self.seed_url = seed_url
self.pool = ProcessPoolExecutor(max_workers=10)
self.processed_urls = set([])
self.queued_urls = Queue()
self.queued_urls.put(self.seed_url)
self.data = {}
def crawl(self):
while True:
try:
# get a url from the queue
target_url = self.queued_urls.get(timeout=60)
# check that the url hasn't already been processed
if target_url not in self.processed_urls:
# add url to the processed list
self.processed_urls.add(target_url)
print(f'Processing url {target_url}')
# passing an object to the
# ProcessPoolExecutor... can this be done?
job = self.pool.submit(Scraper, target_url)
"""
How do I 1) return the data from each
Scraper instance into self.data?
and 2) put scraped links to self.queued_urls?
"""
except Empty:
print("All done.")
except Exception as e:
print(e)
if __name__ == '__main__':
crawler = CrawlManager('www.mywebsite.com')
crawler.crawl()
For anyone who comes across this page, I was able figure this out for myself.
Per #brad-solomon's advice, I switched from ProcessPoolExecutor to ThreadPoolExecutor to manage the concurrent aspects of this script (see his comment for further details).
W.r.t. the original question, the key was to utilize the add_done_callback method of the ThreadPoolExecutor in conjunction with a modification to Scraper.scrape and a new method CrawlManager.proc_scraper_results as in the following:
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
class Scraper:
"""
Scrapes a single url
"""
def __init__(self, url):
self.url = url # url of page to scrape
self.internal_urls = None
self.content = None
self.scrape()
def scrape(self):
"""
Method(s) to request a page, scrape links from that page
to other pages, and finally scrape actual content from the current page
"""
# assume that code in this method would yield urls linked in current page
self.internal_urls = set(scraped_urls)
# and that code in this method would scrape a bit of actual content
self.content = {'content1': content1, 'content2': content2, 'etc': etc}
# these three items will be passed to the callback
# function with in a future object
return self.internal_urls, self.url, self.content
class CrawlManager:
"""
Manages a multiprocess crawl and scrape of a single website
"""
def __init__(self, seed_url):
self.seed_url = seed_url
self.pool = ThreadPoolExecutor(max_workers=10)
self.processed_urls = set([])
self.queued_urls = Queue()
self.queued_urls.put(self.seed_url)
self.data = {}
def proc_scraper_results(self, future):
# get the items of interest from the future object
internal_urls, url, content = future._result[0], future._result[1], future._result[2]
# assign scraped data/content
self.data[url] = content
# also add scraped links to queue if they
# aren't already queued or already processed
for link_url in internal_urls:
if link_url not in self.to_crawl.queue and link_url not in self.processed_urls:
self.to_crawl.put(link_url)
def crawl(self):
while True:
try:
# get a url from the queue
target_url = self.queued_urls.get(timeout=60)
# check that the url hasn't already been processed
if target_url not in self.processed_urls:
# add url to the processed list
self.processed_urls.add(target_url)
print(f'Processing url {target_url}')
# add a job to the ThreadPoolExecutor (note, unlike original question, we pass a method, not an object)
job = self.pool.submit(Scraper(target_url).scrape)
# to add_done_callback we send another function, this one from CrawlManager
# when this function is itself called, it will be pass a `future` object
job.add_done_callback(self.proc_scraper_results)
except Empty:
print("All done.")
except Exception as e:
print(e)
if __name__ == '__main__':
crawler = CrawlManager('www.mywebsite.com')
crawler.crawl()
The result of this is a very significant reduction in duration of this program.
Using Scrapy, I am trying to scrape a link network from Wikipedia across all languages. Each Wikipedia page should contain a link to a Wikidata item that uniquely identifies the topic of the page across all languages. The process I am trying to implement looks like this:
First, extract the Wikidata link from each page (the "source" link).
Iterate through the remaining links on the page.
For each link, send a request to the corresponding page (the "target" link), with a new callback function.
Extract the Wikidata link from the corresponding target page.
Iterate through all the links on the target page and call back to the original parse function.
Basically, I want to skip over the intermediate link on a given source page and instead grab its corresponding Wikidata link.
Here is the (semi-working) code that I have so far:
from urllib.parse import urljoin, urlparse
from scrapy import Spider
from wiki_network.items import WikiNetworkItem
WD = \
"//a/#href[contains(., 'wikidata.org/wiki/Special:EntityPage') \
and not(contains(., '#'))][1]"
TARGETS = \
"//a/#href[contains(., '/wiki/') \
and not(contains(., 'wikidata')) \
and not(contains(., 'wikimedia'))]"
class WikiNetworkSpider(Spider):
name = "wiki_network"
allowed_domains = ["wikipedia.org"]
start_urls = ["https://gl.wikipedia.org/wiki/Jacques_Derrida"]
filter = re.compile(r"^.*(?!.*:[^_]).*wiki.*")
def parse(self, response):
# Extract the Wikidata link from the "source" page
source = response.xpath(WD).extract_first()
# Extract the set of links from the "source" page
targets = response.xpath(TARGETS).extract()
if source:
source_title = response.xpath("//h1/text()").extract_first()
for target in targets:
if self.filter.match(str(target)) is not None:
item = WikiNetworkItem()
item["source"] = source
item["source_domain"] = urlparse(response.url).netloc
item["refer"] = response.url
item["source_title"] = source_title
# Yield a request to the target page
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item})
def parse_wikidata(self, response):
item = WikiNetworkItem(response.meta["item"])
wikidata_target = response.xpath(WD).extract_first()
if wikidata_target:
# Return current item
yield self.item_helper(item, wikidata_target, response)
# Harvest next set of links
for s in response.xpath(TARGETS).extract():
if self.filter.match(str(s)) is not None:
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item})
def item_helper(self, item, wikidata, response):
print()
print("Target: ", wikidata)
print()
if item["source"] != wikidata:
target_title = response.xpath("//h1/text()").extract_first()
item["target"] = wikidata
item["target_title"] = target_title
item["target_domain"] = urlparse(response.url).netloc
item["target_wiki"] = response.url
print()
print("Target: ", target_title)
print()
return item
The spider runs and scrapes links for a while (the scraped item count typically reaches 620 or so), but eventually it builds up a massive queue, stops scraping altogether, and just continues to crawl. Should I expect it to begin scraping again at some point?
It seems as though there should be an easy way to do this kind of second-level scraping in Scrapy, but the other questions I've read so far seem to be mostly about how to handle paging in Scrapy, but not how to "fold" a link in this way.
As long you spider has no issue, what you really want is that when you run
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item})
It should yield sooner then the queued requests of below type
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item})
If you look at the documentation
https://doc.scrapy.org/en/latest/topics/request-response.html
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
So you will use
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item}, priority=1)
and
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item}, priority=-1)
This will make sure that the scraper gives priority to links which will results in data to scraped first
I have this code available from my previous experiment.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://example.com/']
def parse(self, response):
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
I am not understanding how to modify this code to take input as a list of URL from a text file (May be 200+ domains) and check the HTTP status of the domains and store it in a file. I am trying this to check whether the domains are live or not.
What I am expecting to have output is:
example.com,200
example1.com,300
example2.com,503
I want to give file as an input to scrapy script and it should give me the above output. I have tried to look at the questions: How to detect HTTP response status code and set a proxy accordingly in scrapy? and Scrapy and response status code: how to check against it?
But find no luck. Hence, I am thinking to modify my code and get it done. How I can do that? Please help me.
For each response object you could be able to get the url and status code thx to response object properties. So for each link you send request to, you can get the status code using response.status.
Does it work as you want like that ?
def parse(self, response):
#file choosen to get output (appending mode):
file.write(u"%s : %s\n" % (response.url, response.status))
#if response.status in [400, ...]: do smthg
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.
I have in my pipleline a method to check if the post date of the item is older then that found in mysql, so let lastseen be the newest datetime retrieved from database:
def process_item(self, item, spider):
if item['post_date'] < lastseen:
# set flag to close_spider
# raise DropItem("old item")
This code basically works except: I check the site on hourly basis just to get the new posts, if I don't stop the spider it will keep crawling on thousands of pages, if I stop the spider on flag, chances are few requests will not be processed, since they may came back in queue after spider closed, even though those might be newer in post date, having said that, is there a workaround for a more precise scraping?
Thanks,
Not sure if this fits your setup, but you can fetch lastseen from MySQL when initializing your spider and stop generating Requests in your callbacks when the response contains the item with postdate < lastseen, hence basically moving the logic to stop crawling directly inside the Spider instead of the pipeline.
It can sometimes be simpler to pass an argument to your spider
scrapy crawl myspider -a lastseen=20130715
and set property of your Spider to test in your callback (http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments)
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, lastseen=None):
self.lastseen = lastseen
# ...
def parse_new_items(self, reponse):
follow_next_page = True
# item fetch logic
for element in <some_selector>:
# get post_date
post_date = <extract post_date from element>
# check post_date
if post_date < self.lastseen:
follow_next_page = False
continue
item = MyItem()
# populate item...
yield item
# find next page to crawl
if follow_next_page:
next_page_url = ...
yield Request(url = next_page_url, callback=parse_new_items)