Using Scrapy, I am trying to scrape a link network from Wikipedia across all languages. Each Wikipedia page should contain a link to a Wikidata item that uniquely identifies the topic of the page across all languages. The process I am trying to implement looks like this:
First, extract the Wikidata link from each page (the "source" link).
Iterate through the remaining links on the page.
For each link, send a request to the corresponding page (the "target" link), with a new callback function.
Extract the Wikidata link from the corresponding target page.
Iterate through all the links on the target page and call back to the original parse function.
Basically, I want to skip over the intermediate link on a given source page and instead grab its corresponding Wikidata link.
Here is the (semi-working) code that I have so far:
from urllib.parse import urljoin, urlparse
from scrapy import Spider
from wiki_network.items import WikiNetworkItem
WD = \
"//a/#href[contains(., 'wikidata.org/wiki/Special:EntityPage') \
and not(contains(., '#'))][1]"
TARGETS = \
"//a/#href[contains(., '/wiki/') \
and not(contains(., 'wikidata')) \
and not(contains(., 'wikimedia'))]"
class WikiNetworkSpider(Spider):
name = "wiki_network"
allowed_domains = ["wikipedia.org"]
start_urls = ["https://gl.wikipedia.org/wiki/Jacques_Derrida"]
filter = re.compile(r"^.*(?!.*:[^_]).*wiki.*")
def parse(self, response):
# Extract the Wikidata link from the "source" page
source = response.xpath(WD).extract_first()
# Extract the set of links from the "source" page
targets = response.xpath(TARGETS).extract()
if source:
source_title = response.xpath("//h1/text()").extract_first()
for target in targets:
if self.filter.match(str(target)) is not None:
item = WikiNetworkItem()
item["source"] = source
item["source_domain"] = urlparse(response.url).netloc
item["refer"] = response.url
item["source_title"] = source_title
# Yield a request to the target page
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item})
def parse_wikidata(self, response):
item = WikiNetworkItem(response.meta["item"])
wikidata_target = response.xpath(WD).extract_first()
if wikidata_target:
# Return current item
yield self.item_helper(item, wikidata_target, response)
# Harvest next set of links
for s in response.xpath(TARGETS).extract():
if self.filter.match(str(s)) is not None:
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item})
def item_helper(self, item, wikidata, response):
print()
print("Target: ", wikidata)
print()
if item["source"] != wikidata:
target_title = response.xpath("//h1/text()").extract_first()
item["target"] = wikidata
item["target_title"] = target_title
item["target_domain"] = urlparse(response.url).netloc
item["target_wiki"] = response.url
print()
print("Target: ", target_title)
print()
return item
The spider runs and scrapes links for a while (the scraped item count typically reaches 620 or so), but eventually it builds up a massive queue, stops scraping altogether, and just continues to crawl. Should I expect it to begin scraping again at some point?
It seems as though there should be an easy way to do this kind of second-level scraping in Scrapy, but the other questions I've read so far seem to be mostly about how to handle paging in Scrapy, but not how to "fold" a link in this way.
As long you spider has no issue, what you really want is that when you run
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item})
It should yield sooner then the queued requests of below type
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item})
If you look at the documentation
https://doc.scrapy.org/en/latest/topics/request-response.html
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
So you will use
yield Request(url=urljoin(response.url, str(target)), \
callback=self.parse_wikidata, \
meta={"item": item}, priority=1)
and
yield Request(url=urljoin(response.url, str(s)), \
callback=self.parse, meta={"item": item}, priority=-1)
This will make sure that the scraper gives priority to links which will results in data to scraped first
Related
My Scrapy spider has a bunch of independent target links to crawl.
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
Each link multiple pages that will be followed. i.e.
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?
This is related to the previous question I wrote here. I am trying to pull the same data from multiple pages on the same domain. A small explanation, I'm trying to pull data like offensive yards, turnovers, etc from a bunch of different box scores on a main page. Pulling the data from individual pages is working properly as is generation of the urls but when I try to have the spider cycle through all of the pages nothing is returned. I've looked through many other questions people have asked and the documentation and I can't figure out what is not working. Code is below. Thanks to anyone who's able to help in advance.
import scrapy
from scrapy import Selector
from nflscraper.items import NflscraperItem
class NFLScraperSpider(scrapy.Spider):
name = "pfr"
allowed_domains = ['www.pro-football-reference.com/']
start_urls = [
"http://www.pro-football-reference.com/years/2015/games.htm"
#"http://www.pro-football-reference.com/boxscores/201510110tam.htm"
]
def parse(self,response):
for href in response.xpath('//a[contains(text(),"boxscore")]/#href'):
item = NflscraperItem()
url = response.urljoin(href.extract())
request = scrapy.Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
def parse_dir_contents(self,response):
item = response.meta['item']
# Code to pull out JS comment - https://stackoverflow.com/questions/38781357/pro-football-reference-team-stats-xpath/38781659#38781659
extracted_text = response.xpath('//div[#id="all_team_stats"]//comment()').extract()[0]
new_selector = Selector(text=extracted_text[4:-3].strip())
# Item population
item['home_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[2]/td[last()]/text()').extract()[0].strip()
item['away_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[1]/td[last()]/text()').extract()[0].strip()
item['home_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[2]/text()').extract()[0].strip()
item['away_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[1]/text()').extract()[0].strip()
item['home_dyds'] = item['away_oyds']
item['away_dyds'] = item['home_oyds']
item['home_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[2]/text()').extract()[0].strip()
item['away_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[1]/text()').extract()[0].strip()
yield item
The subsequent requests you make are filtered as offsite, fix your allowed_domains setting:
allowed_domains = ['pro-football-reference.com']
Worked for me.
I've been trying to implement a parse function.
Essentially I figured out through the scrapy shell that
response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')).extract()[0]
gives me the url directing me to the next page. So I tried following the instructions with next_page. I took a look around stack overflow and it seems that everyone uses rule(LinkExtractor... which I don't believe I need to use. I'm pretty sure I'm doing it completely wrong though. I originally had a for loop that added every link I wanted to visit in the start_urls because I knew it was all in the form of *p1.html, *p2.html .. etc. but I want to make this smarter.
def parse(self, response):
items = []
for sel in response.xpath('//div[#class="Message"]'):
itemx = mydata()
itemx['information'] = sel.extract()
items.append(itemx)
with open('log.txt', 'a') as f:
f.write('\ninformation: ' + itemx.get('information')
#URL of next page response.xpath('//*[#id="PagerAfter"]/a[last()]/#href').extract()[0]
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
if (response.url != response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')):
if next_page:
yield Request(response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')[0], self.parse)
return items
but does not work I get a
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
^SyntaxError: invalid syntax
error. Additionally I know that the yield Request part is wrong. I want to recursively call and recursively add each scrape of each page into the list items.
Thank you!
here is my code:
def parse(self, response):
selector = Selector(response)
sites = selector.xpath("//h3[#class='r']/a/#href")
for index, site in enumerate(sites):
url = result.group(1)
print url
yield Request(url = site.extract(),callback = self.parsedetail)
def parsedetail(self,response):
print response.url
...
obj = Store.objects.filter(id=store_obj.id,add__isnull=True)
if obj:
obj.update(add=add)
in def parse
scarpy will get urls from google
the url output like:
www.test.com
www.hahaha.com
www.apple.com
www.rest.com
but when it yield to def parsedetail
the url is not with order maybe it will become :
www.rest.com
www.test.com
www.hahaha.com
www.apple.com
is there any way I can let the yield url with order to send to def parsedetail ?
Because I need to crawl www.test.com first.(the data the top url provide in google search is more correctly)
if there is no data in it.
I will go next url until update the empty field .(www.hahaha.com ,www.apple.com,www.rest.com)
Please guide me thank you!
By default, the order which the Scrapy requests are scheduled and sent is not defined. But, you can control it using priority keyword argument:
priority (int) – the priority of this request (defaults to 0). The
priority is used by the scheduler to define the order used to process
requests. Requests with a higher priority value will execute earlier.
Negative values are allowed in order to indicate relatively
low-priority.
You can also make the crawling synchronous by passing around the callstack inside the meta dictionary, as an example see this answer.
Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.