How to send another request and get result in scrapy parse function? - python

I'm analyzing an HTML page which has a two level menu.
When the top-level menu changed, there's an AJAX request sent to get second-level menu item. When the top and second menu are both selected, then refresh the content.
What I need is sending another request and get the submenu response in the scrapy's parse function. So I can iterate the submenu, build scrapy.Request per submenu item.
The pseudo code like this:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH')
second_level_menu_items = ## HERE I NEED TO SEND A REQUEST AND GET RESULT, PARSED TO ITME VALUE LIST
for second_menu_item in second_level_menu_items:
yield scrapy.Request(response.urljoin(content_request_url + '?top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item), callback=self.parse_content)
How can I do this?
Using requests lib directly? Or some other feature provided by scrapy?

The recommended approach here is to create another callback (parse_second_level_menus?) to handle the response for the second level menu items and in there, create the requests to the content pages.
Also, you can use the request.meta attribute to pass data between callback methods (more info here).
It could be something along these lines:
def parse(self, response):
top_level_menu = response.xpath('//TOP_LEVEL_MENU_XPATH').get()
yield scrapy.Request(
some_url,
callback=self.parse_second_level_menus,
# pass the top_level_menu value to the other callback
meta={'top_menu': top_level_menu},
)
def parse_second_level_menus(self, response):
# read the data passed in the meta by the first callback
top_level_menu = response.meta.get('top_menu')
second_level_menu_items = response.xpath('...').getall()
for second_menu_item in second_level_menu_items:
url = response.urljoin(content_request_url + '?top_level=' + top_level_menu + '&second_level_menu=' + second_menu_item)
yield scrapy.Request(
url,
callback=self.parse_content
)
def parse_content(self, response):
...
Yet another approach (less recommended in this case) would be using this library: https://github.com/rmax/scrapy-inline-requests

Simply use dont_filter=True for your Request
example:
def start_requests(self):
return [Request(url=self.base_url, callback=self.parse_city)]
def parse_city(self, response):
for next_page in response.css('a.category'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=url, callback=self.parse_something_else, dont_filter=True)
def parse_something_else(self, response):
for next_page in response.css('#contentwrapper > div > div > div.component > table > tbody > tr:nth-child(2) > td > form > table > tbody > tr'):
url = self.base_url + next_page.attrib['href']
self.log(url)
yield Request(url=next_page, callback=self.parse, dont_filter=True)
def parse(self, response):
pass

Related

scrapy accessing inner URLs

I have a url in start_urls array as below:
start_urls = [
'https://www.ebay.com/sch/tp_peacesports/m.html?_nkw=&_armrs=1&_ipg=&_from='
]
def parse(self, response):
shop_title = self.getShopTitle(response)
sell_count = self.getSellCount(response)
self.shopParser(response, shop_title, sell_count)
def shopParser(self, response, shop_title, sell_count):
items = EbayItem()
items['shop_title'] = shop_title
items['sell_count'] = sell_count
if sell_count > 0:
item_links = response.xpath('//ul[#id="ListViewInner"]/li/h3/a/#href').extract()
for link in item_links:
items['item_price'] = response.xpath('//span[#itemprop="price"]/text()').extract_first()
yield items
now in shopParser() inside for loop I have different link and I need to have different response than the original response from start_urls, how I can achive that ?
You need to call requests to new pages, otherwise you will not get any new html. Try something like:
def parse(self, response):
shop_title = response.meta.get('shop_title', self.getShopTitle(response))
sell_count = response.meta.get('sell_count', self.getSellCount(response))
# here you logic with item parsing
if sell_count > 0:
item_links = response.xpath('//ul[#id="ListViewInner"]/li/h3/a/#href').extract()
# yield requests to next pages
for link in item_links:
yield scrapy.Request(response.urljoin(link), meta={'shop_title': shop_title, 'sell_count': sell_count})
These new requests will also be parsed by parse function. Or you can set another callback, if needed.

scrapy crawl a set of links that might contains next pages

I want to:
Extract links for a certain page
For each link, I need some contents for that link, and the contents of 'next pages' of that link.
Then export it as json file(not important as far as I think regarding my problem)
Currently my spider is like this:
class mySpider(scrapy.Spider):
...
def parse(self, response):
for url in someurls:
yield scrapy.Request(url=url, callback=self.parse_next)
def parse_next(self, response):
for selector in someselectors:
yield { 'contents':...,
...}
nextPage = obtainNextPage()
if nextPage:
yield scrapy.Request(url=next_url, callback=self.parse_next)
The problem is for a set of links that the spider processed, the spider could only reach 'next page' for the last link of that set of links, I viewed that through selenium + chromedriver. For example, I have 10 links(from No.1 to No.10), my spider could only get the next pages for the No.10 link. I don't know if the problem occurred was because of some structural problem of my spider. Below is the full code:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['baidu.com']
start_urls = ['http://tieba.baidu.com']
main_url = 'http://tieba.baidu.com/f?kw=%E5%B4%94%E6%B0%B8%E5%85%83&ie=utf-8'
username = ""
password = ""
def __init__(self, username=username, password=password):
#options = webdriver.ChromeOptions()
#options.add_argument('headless')
#options.add_argument('window-size=1200x600')
self.driver = webdriver.Chrome()#chrome_options=options)
self.username = username
self.password = password
# checked
def logIn(self):
elem = self.driver.find_element_by_css_selector('#com_userbar > ul > li.u_login > div > a')
elem.click()
wait = WebDriverWait(self.driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#TANGRAM__PSP_10__footerULoginBtn')))
elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__footerULoginBtn')
elem.click()
elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__userName')
elem.send_keys(self.username)
elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__password')
elem.send_keys(self.password)
self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__submit').click()
# basic checked
def parse(self, response):
self.driver.get(response.url)
self.logIn()
# wait for hand input verify code
time.sleep(15)
self.driver.get('http://tieba.baidu.com/f?kw=%E5%B4%94%E6%B0%B8%E5%85%83&ie=utf-8')
for url in self.driver.find_elements_by_css_selector('a.j_th_tit')[:2]:
#new_url = response.urljoin(url)
new_url = url.get_attribute("href")
yield scrapy.Request(url=new_url, callback=self.parse_next)
# checked
def pageScroll(self, url):
self.driver.get(url)
SCROLL_PAUSE_TIME = 0.5
SCROLL_LENGTH = 1200
page_height = int(self.driver.execute_script("return document.body.scrollHeight"))
scrollPosition = 0
while scrollPosition < page_height:
scrollPosition = scrollPosition + SCROLL_LENGTH
self.driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
time.sleep(SCROLL_PAUSE_TIME)
time.sleep(1.2)
def parse_next(self, response):
self.log('I visited ' + response.url)
self.pageScroll(response.url)
for sel in self.driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
name = sel.find_element_by_css_selector('.d_name').text
try:
content = sel.find_element_by_css_selector('.j_d_post_content').text
except: content = ''
try: reply = sel.find_element_by_css_selector('ul.j_lzl_m_w').text
except: reply = ''
yield {'name': name, 'content': content, 'reply': reply}
#follow to next page
next_sel = self.driver.find_element_by_link_text("下一页")
next_url_name = next_sel.text
if next_sel and next_url_name == '下一页':
next_url = next_sel.get_attribute('href')
yield scrapy.Request(url=next_url, callback=self.parse_next)
Thanks for your help, and welcome any suggestions referring my code above
In reference to scraping content from one page, store it, and allow the spider to continue the crawl to the scrape and store items on subsequent pages. You should be configuring your items.py file with the item names and pass the items through each scrapy.Request using a meta.
You should check out https://github.com/scrapy/scrapy/issues/1138
To illustrate how this works, it goes something like this...
1. First, we set up the item.py file with the total items to be scraped on every page.
#items.py
import scrapy
class ScrapyProjectItem(scrapy.Item):
page_one_item = scrapy.Field()
page_two_item = scrapy.Field()
page_three_item = scrapy.Field()
Then its importing the items.py item class to you scrapy spider.
from scrapyproject.items import ScrapyProjectItem
The in your scraper, through each page iteration that has content you want, its initializing the items.py class the pass the items using 'meta' to the next request.
#spider.py
def parse(self, response):
# Initializing the item class
item = ScrapyProjectItem()
# Itemizing the... item lol
item['page_one_item'] = response.css("etcetc::").extract() # set desired attribute
# Here we pass the items to the next concurrent request
for url in someurls: # Theres a million ways to skin a cat, dont know your exact use case.
yield scrapy.Request(response.urljoin(url),
callback=self.parse_next, meta={'item': item})
def parse_next(self, response):
# We load the meta from the previous request
item = response.meta['item']
# We itemize
item['page_two_item'] = response.css("etcetc::").extract()
# We pass meta again to next request
for url in someurls:
yield scrapy.Request(response.urljoin(url),
callback=self.parse_again, meta={'item': item})
def parse_again(self, response):
# We load the meta from the previous request
item = response.meta['item']
# We itemize
item['page_three_item'] = response.css("etcetc::").extract()
# We pass meta again to next request
for url in someurls:
yield scrapy.Request(response.urljoin(url),
callback=self.parse_again, meta={'item': item})
# At the end of each iteration of the crawl loop we can yield the result
yield item
As to the problem about crawler only reaching the last link, I would like to have more info instead of guessing what the problem could be. In your "parse_next", you should add a "print(response.url)" to see if the pages are being reached at all? Im sorry if I didnt understand your problem and wasted everyones time lol.
EDIT
I think I understand better you issue ... You have a list of urls, and each urls has its own set of urls yes?
In your code, the "obtainNextPage()" might be the issue? I have in the past when encountering this type of case have had to use some xpath and/or regex magic to properly obtain the next pages. Im not sure what "obtainNextPage" is doing but... have you thought of parsing the content and use selector to find the next page?? For example.
class mySpider(scrapy.Spider):
...
def parse(self, response):
for url in someurls:
yield scrapy.Request(url=url, callback=self.parse_next)
def parse_next(self, response):
for selector in someselectors:
yield { 'contents':...,
...}
#nextPage = obtainNextPage()
next_page = response.xpath('//path/to/nextbutton/orPage'):
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse_next)
You should still add that "print(response.url)" to see if the url thats being requested is being called correctly, might be urljoin issue.

Scrapy returns repeated out of order results when using a for loop, but not when going link by link

I am attempting to use Scrapy to crawl a site. Here is my code:
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
start_urls = [
'http://www.irna.ir/en/services/161',
]
def parse(self, response):
for linknum in range(1, 15):
next_article = response.xpath('//*[#id="NewsImageVerticalItems"]/div[%d]/div[2]/h3/a/#href' % linknum).extract_first()
next_article = response.urljoin(next_article)
yield scrapy.Request(next_article)
for text in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]'):
yield {
'article': text.xpath('./text()').extract()
}
for tag in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_bodytext"]'):
yield {
'tag1': tag.xpath('./div[3]/p[1]/a/text()').extract(),
'tag2': tag.xpath('./div[3]/p[2]/a/text()').extract(),
'tag3': tag.xpath('./div[3]/p[3]/a/text()').extract(),
'tag4': tag.xpath('./div[3]/p[4]/a/text()').extract()
}
yield response.follow('http://www.irna.ir/en/services/161', callback=self.parse)
But this returns in the JSON a weird mixture of repeated items, out of order and often skipping links: https://pastebin.com/LVkjHrRt
However, when I set linknum to a single number, the code works fine.
Why is iterating changing my results?
As #TarunLalwani already stated, your current approach is not right. Basically you should:
In parse method, extract links to all articles on a page and yield requests for scraping them with a callback named e.g. parse_article.
Still in parse method, check that button for loading more articles is present and if so, yield a request for URL of a pattern http://www.irna.ir/en/services/161/pageN. (This can be found in browser's developer tools under XHR requests on network tab.)
Define parse_article method where you extract the article text and tags from details page and finally yield it as item.
Below is the final spider:
import scrapy
class IrnaSpider(scrapy.Spider):
name = 'irna'
base_url = 'http://www.irna.ir/en/services/161'
def start_requests(self):
yield scrapy.Request(self.base_url, meta={'page_number': 1})
def parse(self, response):
for article_url in response.css('.DataListContainer h3 a::attr(href)').extract():
yield scrapy.Request(response.urljoin(article_url), callback=self.parse_article)
page_number = response.meta['page_number'] + 1
if response.css('#MoreButton'):
yield scrapy.Request('{}/page{}'.format(self.base_url, page_number),
callback=self.parse, meta={'page_number': page_number})
def parse_article(self, response):
yield {
'text': ' '.join(response.xpath('//p[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]/text()').extract()),
'tags': [tag.strip() for tag in response.xpath('//div[#class="Tags"]/p/a/text()').extract() if tag.strip()]
}

Python Scrapy, return from child page to carry on scraping

My spider function is on a page and I need to go to a link and get some data from that page to add to my item but I need to go to various pages from the parent page without creating more items. How would I go about doing that because from what I can read in the documentation I can only go in a linear fashion:
parent page > next page > next page
But I need to:
parent page > next page
> next page
> next page
You should return Request instances and pass item around in meta. And you would have to make it in a linear fashion and build a chain of requests and callbacks. In order to achieve it, you can pass around a list of requests for completing an item and return an item from the last callback:
def parse_main_page(self, response):
item = MyItem()
item['main_url'] = response.url
url1 = response.xpath('//a[#class="link1"]/#href').extract()[0]
request1 = scrapy.Request(url1, callback=self.parse_page1)
url2 = response.xpath('//a[#class="link2"]/#href').extract()[0]
request2 = scrapy.Request(url2, callback=self.parse_page2)
url3 = response.xpath('//a[#class="link3"]/#href').extract()[0]
request3 = scrapy.Request(url3, callback=self.parse_page3)
request.meta['item'] = item
request.meta['requests'] = [request2, request3]
return request1
def parse_page1(self, response):
item = response.meta['item']
item['data1'] = response.xpath('//div[#class="data1"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page2(self, response):
item = response.meta['item']
item['data2'] = response.xpath('//div[#class="data2"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page3(self, response):
item = response.meta['item']
item['data3'] = response.xpath('//div[#class="data3"]/text()').extract()[0]
return item
Also see:
How can i use multiple requests and pass items in between them in scrapy python
Almost Asynchronous Requests for Single Item Processing in Scrapy
Using the Scrapy Requests you can perform extra operations on the next URL in the scrapy.Request's callback .

How do I merge results from target page to current page in scrapy?

Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.
Partially fill your item on the first page, and the put it in your request's meta. When the callback for the next page is called, it can take the partially filled request, put more data into it, and then return it.
An example from scrapy documntation
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
More information on passing the meta data and request objects is specifically described in this part of the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
This question is also related to: Scrapy: Follow link to get additional Item data?
A bit illustration of Scrapy documentation code
def start_requests(self):
yield scrapy.Request("http://www.example.com/main_page.html",callback=parse_page1)
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url ##extracts http://www.example.com/main_page.html
request = scrapy.Request("http://www.example.com/some_page.html",callback=self.parse_page2)
request.meta['my_meta_item'] = item ## passing item in the meta dictionary
##alternatively you can follow as below
##request = scrapy.Request("http://www.example.com/some_page.html",meta={'my_meta_item':item},callback=self.parse_page2)
return request
def parse_page2(self, response):
item = response.meta['my_meta_item']
item['other_url'] = response.url ##extracts http://www.example.com/some_page.html
return item

Categories

Resources