scrapy: how to have a response parsed by multiple parser functions? - python

I'd like to do something special to those each one of the landing urls in start_urls, and then the spider'd follow all the nextpages and crawl deeper. So my code's roughly like this:
def init_parse(self, response):
item = MyItem()
# extract info from the landing url and populate item fields here...
yield self.newly_parse(response)
yield item
return
parse_start_url = init_parse
def newly_parse(self, response):
item = MyItem2()
newly_soup = BeautifulSoup(response.body)
# parse, return or yield items
return item
The code won't work because spider only allows return item, request or None but I yield self.newly_parse, so how can I achieve this in scrapy?
My not so elegant solution:
put the init_parse function inside newly_parse and implement an is_start_url check in the beginning, if response.url is inside start_urls, we'll go through the init_parse procedure.
Another ugly solution
Separate out the code where # parse, return or yield items happens and make it a class method or generator, and call this method or generator both inside init_parse and newly_parse.

If you're going to yield multiple items under newly_parse your line under init_parse should be:
for item in self.newly_parse(response):
yield item
as self.newly_parse will return a generator which you will need to iterate through first as scrapy won't recognize it.

Related

Scrapy/Python yield and continue processing possible?

I am trying this sample code
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
yield scrapy.item.Item()
for url in self.urls:
return Request(url)
But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.
I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?
You ask why the second code does not work, but I don’t think you fully understand why the first code works :)
The for loop of your first code only loops once.
What is happening is:
self.parse() is called for the URL in self.start_urls.
self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().
When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).
The last step repeats in a loop, but it is not the for loop that does it.
You can change your original code to this and it will work the same way:
def parse(self, response):
try:
return next(self.urls)
except StopIteration:
pass
Because to call items/requests it should be generator function.
You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.
The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

crawling nested pages using a single spider class and ItemLoader

I'm trying to crawl through a webpage and extract a group of links then crawl through the webpages of those links and get the data by returning it through an item loader but having trouble. This is the code I have problem with in my spider class:
def parse(self, response):
#--initialize selector--
for s in response.css(SECTION_SELECTOR):
#--populate object attributes--
yield scrapy.Request( self.link, callback=self.parse_single )
def parse_single(self, response):
#--initialize selector--
for p in response.css(SINGLE_SELECTOR):
#--populate item_loader (l)--
yield l.load_item()
The problem with this approach is that only the items in the last s item are being returned, and the same items are iterated over and over when I save them as a csv file. The loop items in the second parse method are output as hoped with no duplicates.
If I try to switch the yield with return on the first parse method, only the first s item is executed, and again the second parse prints items to file as expected, but only the first s row is outputted with no duplicates.
Please someone explain to me how I can get the code to iterate over all the items in the first loop so they print in the second.

Scrapy not filling object on request

Here is my code
spider.py
def parse(self,response):
item=someItem()
cuv=Vitae()
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
items.py
class someItem(Item):
cuv=Field()
class Vitae(Item):
link=Field()
No errors are displayed!
It adds the object "cuv" to "item" but attributes to "cuv" are never added, what am I missing here?
Why you use scrapy.Item inside another one?
Try using a simple python dict inside your item['cuv']. And try to move request.meta to scrapy.Request constructor argument.
And you should use yield instead of return
def parse(self,response):
item=someItem()
request=scrapy.Request(url, meta={'item': item} callback=self.cvsearch)
yield request
def cvsearch(self, response):
item=response.meta['item']
item['cuv'] = {'link':response.url}
yield item
I am not a very good explainer but I'll try to explain what's wrong best I can
Scrapy is asynchronous meaning there is no order which requests are executed. Let's take a look at this piece of code
def parse(self,response):
item=someItem()
cuv={}
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
logging.error(item['cuv']) #this will return null [1]
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
[1]-this is because this line will execute before cvsearch is done which you can't control. To solve this you have to do a cascade for multiple requests
def parse(self,response):
item=someItem()
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
request=scrapy.Request(url, callback=self.another)
yield request
def another (self, response)
item=response.meta['item']
yield item
To fully grasp this concept I advise to take a look at multithreading. Please add anything that I missed!

Scrapy Not Returning After Yielding a Request

Similar to the person here: Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback, I am having difficulty accessing the list of items I build in my callback function. I have tried building the list both in the parse function (but doesn't work because the callback hasn't returned), and the callback, but neither have worked for me.
I am trying to return all the items that I build from these requests. Where do I call "return items" such that the item has been fully processed? I am trying to replicate the tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html#using-our-item)
Thanks!!
The relevant code is below:
class ASpider(Spider):
items = []
...
def parse(self, response):
input_file = csv.DictReader(open("AFile.txt"))
x = 0
for row in input_file:
yield Request("ARequest",
cookies = {"arg1":"1", "arg2":row["arg2"], "style":"default", "arg3":row["arg3"]}, callback = self.aftersubmit, dont_filter = True)
def aftersubmit(self, response):
hxs = Selector(response)
# Create new object..
item = AnItem()
item['Name'] = "jsc"
return item
You need to return (or yield) an item from the aftersubmitcallback method. Quote from docs:
In the callback function, you parse the response (web page) and return
either Item objects, Request objects, or an iterable of both.
def aftersubmit(self, response):
hxs = Selector(response)
item = AnItem()
item['Name'] = "jsc"
return item
Note that this particular Item instance doesn't make sense since you haven't really put anything from the response into it's fields.

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Categories

Resources