how to parse multiple pages with scrapy - python

I keep getting an error: invaled syntax for
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
and I cannot seem to figure out why it is giving me that error, since as far as i can tell it is the same syntax as all of the other 1.add_xpath() methods. my other question is how do I request other pages. basically I am going through one big page and having it go through each link on the page, then once it is done with the page I want it to go to the next (button) for the next large page, but I don't know how to do that.
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a[#class="title"]/#href').extract():
yield Request(url, callback=self.description_page)
for url_2 in hxs.select('//a[#class="POINTER"]/#href').extract():
yield Request(url_2, callback=self.description_page)
def description_page(self, response):
l = XPathItemLoader(item=TvspiderItem(), response=response)
l.add_xpath('title', '//div[#class="m show_head"]/h1/text()')
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
1.add_xpath('description', '//div[#class="description"]/span')
1.add_xpath('rating', '//div[#class="score"]/text()')
1.add_xpath('imageSrc', '//div[#class="image_bg"]/img/#src')
return l.load_item()
any help on this would be greatly appreciated. I am still a bit of a noob when it comes to python and scrapy.

def description_page(self, response):
l = XPathItemLoader(item=TvspiderItem(), response=response)
l.add_xpath('title', '//div[#class="m show_head"]/h1/text()')
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
1.add_xpath('description', '//div[#class="description"]/span')
1.add_xpath('rating', '//div[#class="score"]/text()')
1.add_xpath('imageSrc', '//div[#class="image_bg"]/img/#src')
return l.load_item()
You have digit 1 instead of variable name l.

Related

Scrapy/Python yield and continue processing possible?

I am trying this sample code
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
yield scrapy.item.Item()
for url in self.urls:
return Request(url)
But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.
I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?
You ask why the second code does not work, but I don’t think you fully understand why the first code works :)
The for loop of your first code only loops once.
What is happening is:
self.parse() is called for the URL in self.start_urls.
self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().
When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).
The last step repeats in a loop, but it is not the for loop that does it.
You can change your original code to this and it will work the same way:
def parse(self, response):
try:
return next(self.urls)
except StopIteration:
pass
Because to call items/requests it should be generator function.
You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.
The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

Can a scrapy callback function point to the same function in which the request is spawned

I am using Scrapy to crawl a site.
I have code similar to this:
class mySpider(scrapy.Spider):
def start_requests(self):
yield SplashRequest(url=example_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
def parse(self, response):
try:
self.extract_data_from_page(response)
if (next_link_still_on_page(response):
next_url = grok_next_url(response)
yield SplashRequest(url=next_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
except Exception:
pass
def extract_data_from_page(self, response):
pass
def next_link_still_on_page(self,response):
pass
def grok_next_url(self, response):
pass
In the parse() method, the callback function is parse() is this to be frowned upon (e.g. a logic bug causing potential stack overflow?).
You can use the same callback. From a technical perspective it isn't an issue. Especially if the yielded request is of the same nature as the current one, then it should ideally reuse the same logic.
However, from a person-who-has-to-read-the-source-code perspective, it is better to have separate parsers for different tasks or page types (the whole single responsibility principle).
Let me illustrate with an example. Let's say you have a listing website (jobs, products, whatever) and you have two main classes of URLs:
Search result pages: .../search?q=something&page=2
Item pages: .../item/abc
The search result page contains pagination links and items. Such a page would spawn two kinds of requests to:
Parse the next page
Parse the item
The Item page will not spawn another request.
So now you can stuff all of that into the same parser and use it for every request:
def parse(self, response):
if 'search' in response.url:
for item in response.css('.item'):
# ...
yield Request(item_url, callback=self.parse)
# ...
yield Request(next_url, callback=self.parse)
if 'item' in response.url:
yield {
'title': response.css('...'),
# ...
}
That is obviously a very condensed example, but as it grows it will become harder to follow along.
Instead, break up the different page parsers:
def parse_search(self, response):
for item in response.css('.items'):
yield Request(item_url, callback=self.parse_item)
next_url = response.css('...').get()
yield Request(next_url, callback=self.parse_search)
def parse_item(self, response):
yield {
'title': response.css('...'),
# ...
}
So basically, if it's a matter of "another of the same kind of page" then it's normal to use the same callback in order to reuse the same logic. If the next request requires a different kind of parsing, rather make a separate parser.

What changes need to be done to get HTTP Status code of domain using Scrapy?

I have this code available from my previous experiment.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://example.com/']
def parse(self, response):
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
I am not understanding how to modify this code to take input as a list of URL from a text file (May be 200+ domains) and check the HTTP status of the domains and store it in a file. I am trying this to check whether the domains are live or not.
What I am expecting to have output is:
example.com,200
example1.com,300
example2.com,503
I want to give file as an input to scrapy script and it should give me the above output. I have tried to look at the questions: How to detect HTTP response status code and set a proxy accordingly in scrapy? and Scrapy and response status code: how to check against it?
But find no luck. Hence, I am thinking to modify my code and get it done. How I can do that? Please help me.
For each response object you could be able to get the url and status code thx to response object properties. So for each link you send request to, you can get the status code using response.status.
Does it work as you want like that ?
def parse(self, response):
#file choosen to get output (appending mode):
file.write(u"%s : %s\n" % (response.url, response.status))
#if response.status in [400, ...]: do smthg
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Scrapy follow link as well as encoding error

I've been trying to implement a parse function.
Essentially I figured out through the scrapy shell that
response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')).extract()[0]
gives me the url directing me to the next page. So I tried following the instructions with next_page. I took a look around stack overflow and it seems that everyone uses rule(LinkExtractor... which I don't believe I need to use. I'm pretty sure I'm doing it completely wrong though. I originally had a for loop that added every link I wanted to visit in the start_urls because I knew it was all in the form of *p1.html, *p2.html .. etc. but I want to make this smarter.
def parse(self, response):
items = []
for sel in response.xpath('//div[#class="Message"]'):
itemx = mydata()
itemx['information'] = sel.extract()
items.append(itemx)
with open('log.txt', 'a') as f:
f.write('\ninformation: ' + itemx.get('information')
#URL of next page response.xpath('//*[#id="PagerAfter"]/a[last()]/#href').extract()[0]
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
if (response.url != response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')):
if next_page:
yield Request(response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')[0], self.parse)
return items
but does not work I get a
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
^SyntaxError: invalid syntax
error. Additionally I know that the yield Request part is wrong. I want to recursively call and recursively add each scrape of each page into the list items.
Thank you!

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Categories

Resources