Multiple pages per item in Scrapy - python

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.

If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Related

Scrapy item enriching from multiple websites

I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!
I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.

Proper way of collecting data from multiple sources for a single item

This is a thing I've been encountering very often lately. I am supposed scrape data from multiple requests for a single item.
I've been using request meta to accumulate data between requests like this;
def parse_data(self, response):
data = 'something'
yield scrapy.Request(
url='url for another page for scraping images',
method='GET',
meta={'data': data}
)
def parse_images(self, response):
images = ['some images']
data = response.meta['data']
yield scrapy.Request(
url='url for another page for scraping more data',
method='GET',
meta={'images': images, 'data': data}
)
def parse_more(self, response):
more_data = 'more data'
images = response.meta['images']
data = response.meta['data']
yield item
In the last parse method, I scrape the final needed data and yield the item. However, this approach looks awkward to me. Is there any better way to scrape webpages like those or am I doing this correctly?
it's quite regular and correct approach keeping in mind that scrapy is async framework.
If you wish to have more plain code structure you can you use scrapy-inline-requests
But it will require more hassle than using meta from my perspective.
This is the proper way of tracking your item throughout requests. What I would do differently though is actually just set the item values like so:
item['foo'] = bar
item['bar'] = foo
yield scrapy.Request(url, callback=self.parse, meta={'item':item})
With this approach you only have to send one thing the item itself through each time. There will be some instances where this isnt desirable.

Can a scrapy callback function point to the same function in which the request is spawned

I am using Scrapy to crawl a site.
I have code similar to this:
class mySpider(scrapy.Spider):
def start_requests(self):
yield SplashRequest(url=example_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
def parse(self, response):
try:
self.extract_data_from_page(response)
if (next_link_still_on_page(response):
next_url = grok_next_url(response)
yield SplashRequest(url=next_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
except Exception:
pass
def extract_data_from_page(self, response):
pass
def next_link_still_on_page(self,response):
pass
def grok_next_url(self, response):
pass
In the parse() method, the callback function is parse() is this to be frowned upon (e.g. a logic bug causing potential stack overflow?).
You can use the same callback. From a technical perspective it isn't an issue. Especially if the yielded request is of the same nature as the current one, then it should ideally reuse the same logic.
However, from a person-who-has-to-read-the-source-code perspective, it is better to have separate parsers for different tasks or page types (the whole single responsibility principle).
Let me illustrate with an example. Let's say you have a listing website (jobs, products, whatever) and you have two main classes of URLs:
Search result pages: .../search?q=something&page=2
Item pages: .../item/abc
The search result page contains pagination links and items. Such a page would spawn two kinds of requests to:
Parse the next page
Parse the item
The Item page will not spawn another request.
So now you can stuff all of that into the same parser and use it for every request:
def parse(self, response):
if 'search' in response.url:
for item in response.css('.item'):
# ...
yield Request(item_url, callback=self.parse)
# ...
yield Request(next_url, callback=self.parse)
if 'item' in response.url:
yield {
'title': response.css('...'),
# ...
}
That is obviously a very condensed example, but as it grows it will become harder to follow along.
Instead, break up the different page parsers:
def parse_search(self, response):
for item in response.css('.items'):
yield Request(item_url, callback=self.parse_item)
next_url = response.css('...').get()
yield Request(next_url, callback=self.parse_search)
def parse_item(self, response):
yield {
'title': response.css('...'),
# ...
}
So basically, if it's a matter of "another of the same kind of page" then it's normal to use the same callback in order to reuse the same logic. If the next request requires a different kind of parsing, rather make a separate parser.

scrapy: how to have a response parsed by multiple parser functions?

I'd like to do something special to those each one of the landing urls in start_urls, and then the spider'd follow all the nextpages and crawl deeper. So my code's roughly like this:
def init_parse(self, response):
item = MyItem()
# extract info from the landing url and populate item fields here...
yield self.newly_parse(response)
yield item
return
parse_start_url = init_parse
def newly_parse(self, response):
item = MyItem2()
newly_soup = BeautifulSoup(response.body)
# parse, return or yield items
return item
The code won't work because spider only allows return item, request or None but I yield self.newly_parse, so how can I achieve this in scrapy?
My not so elegant solution:
put the init_parse function inside newly_parse and implement an is_start_url check in the beginning, if response.url is inside start_urls, we'll go through the init_parse procedure.
Another ugly solution
Separate out the code where # parse, return or yield items happens and make it a class method or generator, and call this method or generator both inside init_parse and newly_parse.
If you're going to yield multiple items under newly_parse your line under init_parse should be:
for item in self.newly_parse(response):
yield item
as self.newly_parse will return a generator which you will need to iterate through first as scrapy won't recognize it.

Scrapy Not Returning After Yielding a Request

Similar to the person here: Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback, I am having difficulty accessing the list of items I build in my callback function. I have tried building the list both in the parse function (but doesn't work because the callback hasn't returned), and the callback, but neither have worked for me.
I am trying to return all the items that I build from these requests. Where do I call "return items" such that the item has been fully processed? I am trying to replicate the tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html#using-our-item)
Thanks!!
The relevant code is below:
class ASpider(Spider):
items = []
...
def parse(self, response):
input_file = csv.DictReader(open("AFile.txt"))
x = 0
for row in input_file:
yield Request("ARequest",
cookies = {"arg1":"1", "arg2":row["arg2"], "style":"default", "arg3":row["arg3"]}, callback = self.aftersubmit, dont_filter = True)
def aftersubmit(self, response):
hxs = Selector(response)
# Create new object..
item = AnItem()
item['Name'] = "jsc"
return item
You need to return (or yield) an item from the aftersubmitcallback method. Quote from docs:
In the callback function, you parse the response (web page) and return
either Item objects, Request objects, or an iterable of both.
def aftersubmit(self, response):
hxs = Selector(response)
item = AnItem()
item['Name'] = "jsc"
return item
Note that this particular Item instance doesn't make sense since you haven't really put anything from the response into it's fields.

Categories

Resources