scrapy accessing inner URLs

scrapy accessing inner URLs - python

I have a url in start_urls array as below:
start_urls = [
'https://www.ebay.com/sch/tp_peacesports/m.html?_nkw=&_armrs=1&_ipg=&_from='
]
def parse(self, response):
shop_title = self.getShopTitle(response)
sell_count = self.getSellCount(response)
self.shopParser(response, shop_title, sell_count)
def shopParser(self, response, shop_title, sell_count):
items = EbayItem()
items['shop_title'] = shop_title
items['sell_count'] = sell_count
if sell_count > 0:
item_links = response.xpath('//ul[#id="ListViewInner"]/li/h3/a/#href').extract()
for link in item_links:
items['item_price'] = response.xpath('//span[#itemprop="price"]/text()').extract_first()
yield items
now in shopParser() inside for loop I have different link and I need to have different response than the original response from start_urls, how I can achive that ?

You need to call requests to new pages, otherwise you will not get any new html. Try something like:
def parse(self, response):
shop_title = response.meta.get('shop_title', self.getShopTitle(response))
sell_count = response.meta.get('sell_count', self.getSellCount(response))
# here you logic with item parsing
if sell_count > 0:
item_links = response.xpath('//ul[#id="ListViewInner"]/li/h3/a/#href').extract()
# yield requests to next pages
for link in item_links:
yield scrapy.Request(response.urljoin(link), meta={'shop_title': shop_title, 'sell_count': sell_count})
These new requests will also be parsed by parse function. Or you can set another callback, if needed.

Related

How to scrape a site usin different rules for a spider?

I separated the spider from the crawler. I need to extract some data from a website using Python Scrapy using different conditions to get results. So I have the functions in first file:
def parse(self, response):
xpath = '//div[#class="proinfor"]//div[#class="prolist_casinforimg"]/a/#href'
urls = response.xpath(xpath).extract()
for url in urls:
url = url.replace("//", "", 1)
yield scrapy.Request(response.urljoin(url),
callback=self.parse_requem)
yield scrapy.Request(response.urljoin(url),
callback=self.parse_obj)
def parse_requem(self, response):
...
yield scrapy.Request(callback=self.parse_item)
def parse_item(self, response):
parser = BaseParser(response)
return parser.construct_item()
def parse_obj(self, response):
parser = BaseParser(response)
return parser.construct()
And the code in the BaseParser class:
def parse_price(self):
Price = response.body
return Price
def parse_ex(self):
exists = self.xpath('//text()').extract_first()
return exists
def construct(self):
item = dict()
item['ex'] = self.parse_ex()
return item
def construct_item(self):
item = dict()
item['price'] = self.parse_price()
return item
As you can see, I'm trying to separate the data retrieval logic, but instead, I'm only getting the execution result from a single function.
How to separate the parsing logic for a spider?

How to send another request and get result in scrapy parse function?

I'm analyzing an HTML page which has a two level menu.
When the top-level menu changed, there's an AJAX request sent to get second-level menu item. When the top and second menu are both selected, then refresh the content.
What I need is sending another request and get the submenu response in the scrapy's parse function. So I can iterate the submenu, build scrapy.Request per submenu item.
The pseudo code like this:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH')
second_level_menu_items = ## HERE I NEED TO SEND A REQUEST AND GET RESULT, PARSED TO ITME VALUE LIST
for second_menu_item in second_level_menu_items:
yield scrapy.Request(response.urljoin(content_request_url + '？top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item), callback=self.parse_content)
How can I do this?
Using requests lib directly? Or some other feature provided by scrapy?

The recommended approach here is to create another callback (parse_second_level_menus?) to handle the response for the second level menu items and in there, create the requests to the content pages.
Also, you can use the request.meta attribute to pass data between callback methods (more info here).
It could be something along these lines:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH').get()
yield scrapy.Request(
some_url,
callback=self.parse_second_level_menus,
# pass the top_level_menu value to the other callback
meta={'top_menu': top_level_menu},
)
def parse_second_level_menus(self, response):
# read the data passed in the meta by the first callback
top_level_menu = response.meta.get('top_menu')
second_level_menu_items = response.xpath('...').getall()
for second_menu_item in second_level_menu_items:
url = response.urljoin(content_request_url + '？top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item)
yield scrapy.Request(
url,
callback=self.parse_content
)
def parse_content(self, response):
...
Yet another approach (less recommended in this case) would be using this library: https://github.com/rmax/scrapy-inline-requests

Simply use dont_filter=True for your Request
example:
def start_requests(self):
return [Request(url=self.base_url, callback=self.parse_city)]
def parse_city(self, response):
for next_page in response.css('a.category'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=url, callback=self.parse_something_else, dont_filter=True)
def parse_something_else(self, response):
for next_page in response.css('#contentwrapper > div > div > div.component > table > tbody > tr:nth-child(2) > td > form > table > tbody > tr'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=next_page, callback=self.parse, dont_filter=True)
def parse(self, response):
pass

How to handle ajax data with scrapy

I'm making a web spider with scrapy and there comes a problem:I tried to get a group of html data.And it contains the id i need to send ajax request.However,when I tried to get the ajax data together with other data I got with the html , it just goes wrong.How could I solve it?Here's my code:
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/review/best"]
def parse(self, response):
for review in response.css(".review-item"):
rev = Review()
rev['reviewer'] = review.css("a[property='v:reviewer']::text").extract_first()
rev['rating'] = review.css("span[property='v:rating']::attr(class)").extract_first()
rev['title'] = review.css(".main-bd>h2>a::text").extract_first()
number = review.css("::attr(id)").extract_first()
f = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,
callback=self.parse_full_passage)
rev['comment'] = f
yield rev
def parse_full_passage(self, response):
r = json.loads(response.body_as_unicode())
html = r['html']
yield html

You need to fully parse your HTML first and next pass it as a meta to the JSON's callback:
yield scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,callback=self.parse_full_passage, meta={'rev': rev} )
And next in your JSON's callback:
def parse_full_passage(self, response):
rev = response.meta["rev"]
r = json.loads(response.body_as_unicode())
.....
yield rev

I would try this:
response = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number)
jsonresponse = json.loads(response.body_as_unicode())
rev['comment'] = jsonresponse['html']
You might want to extract stuff from the html field if this is what you need. Alternatively work with this url

Combine FormRequest and CrawlSpider

I need to apply FormRequest [From here][1]:
#Request = FormRequest.from_response(
# response,
# formname='frmSearch',
# formdata={'classtype': 'of'},
# #callback=self.parse_links,
# dont_filter=True,
#
# )
For link in start_urls and to all pages that I get from the rules in my СrawlSpider.
class QuokaSpider(CrawlSpider):
name = 'quoka'
allowed_domains = ['www.quoka.de']
start_urls = ['http://www.quoka.de/immobilien/bueros-gewerbeflaechen/']
curr_page = 0
rules = (Rule(LinkExtractor(allow=(r'.+'), restrict_xpaths = [u'//li[#class="arr-rgt active"]',]),
follow=True, callback='parse_links'),
)
def _url(self, url):
return 'http://www.quoka.de' + url
def parse_links(self, response):
hxs = Selector(response)
lnks = hxs.xpath('//a[contains(#class, "img-lmtr") and contains(#class, "multi") or contains(#class, "single")]/#href').extract()
filters = hxs.xpath(u'//div[#class="modal-title"]/text()').extract()
for fil in filters:
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"+fil+"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
for url in lnks:
request = Request(self._url(url), callback=self.parse_object)
yield request
def parse_object(self, response):
item = AnbieterItem()
hxs = Selector(response)
item['Beschreibung'] = hxs.xpath(u'//div[#class="text"]/text()').extract()
# item['Kleinanzeigen_App'] = '1'
# item['Preis'] = '1'
return item
If I try to use "start_request" to the filter, the spider does not use pages from the rules.
How can I solve this problem and apply this filter to start url and urls from rules?

I don't know how to combine CrawlSpider Rules with FormRequest but I'd like to suggest that you replace the CrawlSpider with a generic Spider and create the Requests manually.
The Rule in your code does only take care of following the pagination (as far as i can see). To replace that you could use something like in the following code sample:
import scrapy
class TestSpider(scrapy.Spider):
name = 'quoka'
start_urls = ['http://www.quoka.de/immobilien/bueros-gewerbeflaechen']
def parse(self, response):
request = scrapy.FormRequest.from_response(
response,
formname='frmSearch',
formdata={'classtype': 'of'},
callback=self.parse_filtered
)
print request.body
yield request
def parse_filtered(self,response):
resultList = response.xpath('//div[#id="ResultListData"]/ul/li')
for resultRow in resultList:
xpath_Result_Details = './/div[#class="q-col n2"]/a'
# Check if row has details
if resultRow.xpath(xpath_Result_Details):
result_Details = resultRow.xpath(xpath_Result_Details)
# If YES extract details
title = result_Details.xpath('./#title').extract()
href = result_Details.xpath('./#href').extract()[0]
# Code to request detail pages goes here ...
print title, href
# Use this instead of CrawlSpider to follow the pagination links
xpath_NextPage = '//div[#class="rslt-pagination"]//li[#class="arr-rgt active"]/a'
if response.xpath(xpath_NextPage):
nextPage_href = response.xpath(xpath_NextPage + '/#href').extract()[0]
nextPage_url = 'http://www.quoka.de/immobilien/bueros-gewerbeflaechen' + nextPage_href
nextPage_num = response.xpath(xpath_NextPage + '/#data-qng-page').extract()[0]
# request = scrapy.Request(nextPage_url, callback=self.parse_filtered)
# Create request with formdata ...
request = scrapy.FormRequest.from_response(
response,
formname='frmNaviSearch',
formdata={'pageno': nextPage_num},
callback=self.parse_filtered
)
yield request

Python Scrapy, return from child page to carry on scraping

My spider function is on a page and I need to go to a link and get some data from that page to add to my item but I need to go to various pages from the parent page without creating more items. How would I go about doing that because from what I can read in the documentation I can only go in a linear fashion:
parent page > next page > next page
But I need to:
parent page > next page
> next page
> next page

You should return Request instances and pass item around in meta. And you would have to make it in a linear fashion and build a chain of requests and callbacks. In order to achieve it, you can pass around a list of requests for completing an item and return an item from the last callback:
def parse_main_page(self, response):
item = MyItem()
item['main_url'] = response.url
url1 = response.xpath('//a[#class="link1"]/#href').extract()[0]
request1 = scrapy.Request(url1, callback=self.parse_page1)
url2 = response.xpath('//a[#class="link2"]/#href').extract()[0]
request2 = scrapy.Request(url2, callback=self.parse_page2)
url3 = response.xpath('//a[#class="link3"]/#href').extract()[0]
request3 = scrapy.Request(url3, callback=self.parse_page3)
request.meta['item'] = item
request.meta['requests'] = [request2, request3]
return request1
def parse_page1(self, response):
item = response.meta['item']
item['data1'] = response.xpath('//div[#class="data1"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page2(self, response):
item = response.meta['item']
item['data2'] = response.xpath('//div[#class="data2"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page3(self, response):
item = response.meta['item']
item['data3'] = response.xpath('//div[#class="data3"]/text()').extract()[0]
return item
Also see:
How can i use multiple requests and pass items in between them in scrapy python
Almost Asynchronous Requests for Single Item Processing in Scrapy

Using the Scrapy Requests you can perform extra operations on the next URL in the scrapy.Request's callback .

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy accessing inner URLs - python

Related

How to scrape a site usin different rules for a spider?

How to send another request and get result in scrapy parse function?

How to handle ajax data with scrapy

Combine FormRequest and CrawlSpider

Python Scrapy, return from child page to carry on scraping

Categories

Resources