When I write parse() function, can I yield both a request and items for one single page?
I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).
I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?
Yes, you can yield both requests and items. From what I've seen:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item
I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(#href, "content")]/#href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(#class,'odd')]/text()").extract() ## this bitch grabs it
return item
from Steven Almeroth in google groups:
You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.
Try this:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item
Related
I'm creating a web scraper with scrapy and python. The page I'm scraping has each item structured as a card, I'm able to scrape some info from these cards (name, location), but I also want to get info that is reached by clicking on card > new page > click button on new page that opens form > scrape value from the form. How should I structure the parse function, do I need nested loops or separate functions ..?
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse)
for vc in response.css('div#vc-profile.container').extract():
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
the scraper is crawling, but outputting nothing
The best approach is creating an item object on the first page, scrape the needed data and save to the item. Again make a request to the new URL (card > new page > click the button to form) and pass the same item in there. Yielding the output from here will fix the issue.
You should probably split the scraper into 1 'parse' method and 1 'parse_item' method.
Your parse method goes through the page and yields the urls of the items for which you want to get the details. The parse_item method will get back the response from the parse function, and get the details for the specific item.
Difficult to say what it will look like without knowing the website, but it'll probably look more or less like this:
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse_item)
def parse_item(self, response)
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
I am attempting to use Scrapy to crawl a site. Here is my code:
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
start_urls = [
'http://www.irna.ir/en/services/161',
]
def parse(self, response):
for linknum in range(1, 15):
next_article = response.xpath('//*[#id="NewsImageVerticalItems"]/div[%d]/div[2]/h3/a/#href' % linknum).extract_first()
next_article = response.urljoin(next_article)
yield scrapy.Request(next_article)
for text in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]'):
yield {
'article': text.xpath('./text()').extract()
}
for tag in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_bodytext"]'):
yield {
'tag1': tag.xpath('./div[3]/p[1]/a/text()').extract(),
'tag2': tag.xpath('./div[3]/p[2]/a/text()').extract(),
'tag3': tag.xpath('./div[3]/p[3]/a/text()').extract(),
'tag4': tag.xpath('./div[3]/p[4]/a/text()').extract()
}
yield response.follow('http://www.irna.ir/en/services/161', callback=self.parse)
But this returns in the JSON a weird mixture of repeated items, out of order and often skipping links: https://pastebin.com/LVkjHrRt
However, when I set linknum to a single number, the code works fine.
Why is iterating changing my results?
As #TarunLalwani already stated, your current approach is not right. Basically you should:
In parse method, extract links to all articles on a page and yield requests for scraping them with a callback named e.g. parse_article.
Still in parse method, check that button for loading more articles is present and if so, yield a request for URL of a pattern http://www.irna.ir/en/services/161/pageN. (This can be found in browser's developer tools under XHR requests on network tab.)
Define parse_article method where you extract the article text and tags from details page and finally yield it as item.
Below is the final spider:
import scrapy
class IrnaSpider(scrapy.Spider):
name = 'irna'
base_url = 'http://www.irna.ir/en/services/161'
def start_requests(self):
yield scrapy.Request(self.base_url, meta={'page_number': 1})
def parse(self, response):
for article_url in response.css('.DataListContainer h3 a::attr(href)').extract():
yield scrapy.Request(response.urljoin(article_url), callback=self.parse_article)
page_number = response.meta['page_number'] + 1
if response.css('#MoreButton'):
yield scrapy.Request('{}/page{}'.format(self.base_url, page_number),
callback=self.parse, meta={'page_number': page_number})
def parse_article(self, response):
yield {
'text': ' '.join(response.xpath('//p[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]/text()').extract()),
'tags': [tag.strip() for tag in response.xpath('//div[#class="Tags"]/p/a/text()').extract() if tag.strip()]
}
I am using the below code to crawl through multiple links on a page and grab a list of data from each corresponding link:
import scrapy
class testSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://www.website.com']
def parse(self, response):
urls = response.css('div.subject_wrapper > a::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.getData)
def getData(self, response):
data = {'data': response.css('strong.data::text').extract()}
yield data
It works fine, but as it's returning a list of data for each link, when I output to CSV it looks like the following:
"dalegribel,Chad,Ninoovcov,dalegribel,Gotenks,sillydog22"
"kaylachic,jmargerum,kaylachic"
"Kempodancer,doctordbrew,Gotenks,dalegribel"
"Gotenks,dalegribel,jmargerum"
...
Is there any simple/efficient way of outputting the data as a single list of rows without any duplicates (the same data can appear on multiple pages), similar to the following?
dalegribel
Chad
Ninoovcov
Gotenks
...
I have tried using an array then looping over each element to get an output, but get an error saying yield only supports 'Request, BaseItem, dict or None'. Also, as I would be running this over approx 10k entries, i'm not sure if storing the data in an array would slow the scrape down too much. Thanks.
Not sure if it can be somehow done using Scrapy built-in methods, but the python way would be to create a set of unique elements, check for duplicates, and yeild only unique elements:
class testSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://www.website.com']
unique_data = set()
def parse(self, response):
urls = response.css('div.subject_wrapper > a::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.getData)
def getData(self, response):
data_list = response.css('strong.data::text').extract()
for elem in data_list:
if elem and (elem not in self.unique_data):
self.unique_data.add(elem)
yield {'data': elem}
I want to extract all the product url from the link "http://www.shopclues.com/diwali-mega-mall/hot-electronics-sale-fs/audio-systems-fs.html" using scrapy in python. Below is the function I'm using to do this:
def parse(self, response):
print("hello");
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#id="pagination_contents"]')
items = []
i=3
for site in sites:
item = DmozItem()
item['link'] = site.select('div[2]/div['+str(i)+']/a/#href').extract()
i=int(i)+1;
print i
items.append(item)
return items
The x-path of each product div is: //div[#id="pagination_contents"]/div[2]/div['+str(i)+']/a/#href
But I'm getting only one link and not all the products' url.
I think your problem is that hxs.select('//div[#id="pagination_contents"]') only returns one result and then you only do one iteration in the loop.
You can select all following <div> elements that contain an <a>, and loop over those:
sites = hxs.select('//div[#id="pagination_contents"]/div[2]/div[a]')
for site in sites:
## This loop will run 33 times in my test.
## Access to each link:
item['link'] = site.select('./a[2]/#href').extract()
Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.
Partially fill your item on the first page, and the put it in your request's meta. When the callback for the next page is called, it can take the partially filled request, put more data into it, and then return it.
An example from scrapy documntation
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
More information on passing the meta data and request objects is specifically described in this part of the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
This question is also related to: Scrapy: Follow link to get additional Item data?
A bit illustration of Scrapy documentation code
def start_requests(self):
yield scrapy.Request("http://www.example.com/main_page.html",callback=parse_page1)
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url ##extracts http://www.example.com/main_page.html
request = scrapy.Request("http://www.example.com/some_page.html",callback=self.parse_page2)
request.meta['my_meta_item'] = item ## passing item in the meta dictionary
##alternatively you can follow as below
##request = scrapy.Request("http://www.example.com/some_page.html",meta={'my_meta_item':item},callback=self.parse_page2)
return request
def parse_page2(self, response):
item = response.meta['my_meta_item']
item['other_url'] = response.url ##extracts http://www.example.com/some_page.html
return item