Scrapy recursive parse : what i am doing wrong here - python

I am trying to scrape aspx websites list view , hence structure of each page will be same and ( hence i am using recursive spider call's)
Error: ERROR: Spider must return Request, BaseItem or None, got 'list'
not sure what this error means ..
I am doing something wrong , very basic but can't identify ...point me in the right direction..Thanks
My Code:
name = "XYZscraper"
allowed_domains = ["xyz.com"]
def __init__(self):
self.start_urls = [
"xyz.com with aspx list viwe",
]
def parse(self, response):
sel = Selector(response)
if sel.xpath('//table/tr/td/form/table/tr'):
print "xpath is present"
elements = sel.xpath('//table/tr/td/form/table/tr')
else:
print "xpath not present "
print " going in with fallback xpath"
elements = sel.xpath('///table/tr')
counter = 1
nextPageAvailable = False # flat if netx page link is available or not
base_url = "xyz.com/"
try:
items = response.meta['item']
except Exception as e:
items = []
pass
no_of_row = len(elements)
for each_row in elements:
#first two row and last two row does not have data
#first and last row have link to previous and next page ...using first row for navigation
if counter == 1:
if each_row.xpath('td/a[1]/text()').extract()[0] == "Previous":
if each_row.xpath('td/a[2]/text()'):
if each_row.xpath('td/a[2]/text()').extract()[0] == "Next":
nextPageAvailable = True
elif each_row.xpath('td/a[1]/text()').extract()[0] == "Next":
nextPageAvailable = True
if counter > 2:
if counter < (no_of_row - 1):
item = myItem()
item['title'] = each_row.xpath('td/div/a/span/text()').extract()[0].encode('ascii', 'ignore') # Title
items.append(item)
counter += 1
if nextPageAvailable:
yield FormRequest.from_response(
response,
meta={'item': items},
formnumber=1,
formdata={
'__EVENTTARGET': 'ctl00$ctl10$EventsDG$ctl01$ctl01', #for request to navigate to next page in table
},
callback=self.parse # calling recursive function since signature of page will remain same just data is refreshed
)
else:
# when end of the list is arrived ...calling next functin to pop item ..may be !! does not work !!
self.popItems(response)
# does not work
# Error: python < 3.3 does not allow return with argument inside the generator
# return item
def popItems(self, response):
print "i am here"
items = ()
baseitem = response.meta['item']
items = baseitem
return items

Maybe you mean something like this:
else:
for item in self.popItems(response):
yield item
Or the shorter version:
else:
yield from self.popItems(response)

Related

Using Scrapy to add up numbers across several pages

I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don't need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs to pass arguments, and this is what I have so far.
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
def parse_numbers(self, response, total_count):
yield {
total_count = total_count,
}
def extract_with_css(query):
return response.css(query).get(default='').strip()
for number in response.css('div.numbers'):
yield {
'number': extract_with_css('span::text'),
total_count = total_count + int(number.replace(',',''))
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
request = scrapy.Request(next_page,
callback=self.parse_numbers,
cb_kwargs=dict(total_count=total_count))
yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
I don't see the need for passing data from one request to another.
The most obvious way I can think of to go about it would be as follows:
You collect the count of the page and yield the result as an item
You create an item pipeline that keeps track of the total count
When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, ...
Your spider would look something like this:
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers)
def parse_numbers(self, response):
numbers = response.css('div.numbers')
list_numbers = numbers.css('span::text').getall()
page_sum = sum(int(number) for number in list_numbers if number.strip())
yield {'page_sum': page_sum}
next_page = response.css('li.next a::attr("href")').get()
if next_page:
request = scrapy.Request(next_page,
callback=self.parse_numbers)
yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object):
def __init__(self):
# initialize the variable that keeps track of the total count
self.total_count = 0
def process_item(self, item, spider):
# every number yielded from your spider in page_sum will be added to the current total count
page_sum = item['page_sum']
self.total_count += page_sum
return item
def close_spider(self, spider):
# write the final count to a file
output = json.dumps(self.total_count)
with open('test_count_file.jl', 'w') as output_file:
output_file.write(output + '\n')

Scrapy: how to populate hierarchic items with multipel requests

This one is extension of Multiple nested request with scrapy
. Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?
To deal with (1) I considered this:
class CatLoader(ItemLoader):
def __int__(self, item=None, selector=None, response=None, parent=None, **context):
super(self.__class__, self).__init__(item, selector, response, parent, **context)
self.lock = threading.Lock()
self.counter = 0
def dec_counter(self):
self.lock.acquire()
self.counter += 1
self.lock.release()
Then in parser:
if len(urls) == 0:
self.logger.warning('Cat without items, url: ' + response.url)
item = cl.load_item()
yield item
cl.counter = len(urls)
for url in urls:
rq = Request(url, self.parse_item)
rq.meta['loader'] = cl
yield rq
And in parse_item() I can do:
def parse_item(self, response):
l = response.meta['loader']
l.dec_counter()
if l.counter == 0:
yield l.load_item()
BUT! To deal with 2 i neeed in each function do:
def parse_item(self, response):
try:
l = response.meta['loader']
finally:
l.dec_counter()
if l.counter == 0:
yield l.load_item()
Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?
UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So:
1) Articles must be inservet to table AFTER category.
2) Article must know ID of it's category to form relation
Same thing for images records

Scrapy Spider only generates one item per loop

Since I added another request at the end of the for loop, to test a link, the Spyder only generates Items for the first index of the loop.
def parse_product_page(self, response):
products = response.xpath('//div[#class="content"]//div[#class="tov-rows"]//div[#class="t-row"]')
for x, product in enumerate(products): #ERROR: Just gives an item for the first product
product_loader = VerbraucherweltProdukt()
product_loader['name'] = product.xpath(
'//div[#class="t-center"]//div[#class="t-name"]/text()').extract_first()
request = scrapy.Request(non_ref_link,callback=self.test_link, errback=self.test_link)
request.meta['item'] = product_loader
yield request
It all worked before when i just yielded the product item, but since the item is returned in the callback, i dont know where my problem lays.
The callback is just:
def test_link(self, response):
item = response.meta['item']
item['link_fehlerhaft'] = response.status
yield item
Also the full code, maybe the problem is anywhere else:
http://pastebin.com/tgL38zpD
Here's your culprit:
link = product.xpath('//div[#class="t-right"]//a/#href').extract_first()
You're not grounding your recursive xpath to the product node you have. To fix it simply pre append . to your xpath to indicate current node as root:
link = product.xpath('.//div[#class="t-right"]//a/#href').extract_first()

Scrapy: How to do I prevent a yield request with a conditional item value?

I'm parsing a list of urls, and I want to avoid saving some url resulted item on the condition of some its value. My code is something like this:
start_urls = [www.rootpage.com]
def parse(self,response):
item = CreatedItem()
url_list = response.xpath('somepath').extract()
for url in url_list:
request = scrapy.Request(item['url'],callback=self.parse_article)
request.meta['item'] = item
yield request
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract()
yield item
Now I want that in case item['parameterA'] follows a condition, there is no need to "yield request" (so that no saving for this url occurs). I tried add a conditional like:
if item['parameterA'] == 0:
continue
else:
yield item
but as expected it does not work, because scrapy continues the loop even before the request is performed.
From what I understand, you should make the decision inside the parse_article method:
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract_first()
if item['parameterA'] != "0":
yield item
Note the use of the extract_first() and the quotes around 0.

Passing a argument to a callback function [duplicate]

This question already has answers here:
Python Argument Binders
(7 answers)
Closed last month.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
item['hclass'] = response.request.url.split("/")[8].split('-')[-1]
item['server'] = response.request.url.split('/')[2].split('.')[0]
item['hardcore'] = len(response.request.url.split("/")[8].split('-')) == 3
item['seasonal'] = response.request.url.split("/")[6] == 'season'
item['rank'] = sel.xpath('td[#class="cell-Rank"]/text()').extract()[0].strip()
item['battle_tag'] = sel.xpath('td[#class="cell-BattleTag"]//a/text()').extract()[1].strip()
item['grift'] = sel.xpath('td[#class="cell-RiftLevel"]/text()').extract()[0].strip()
item['time'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
item['date'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile)
def parse_profile(self, response):
sel = Selector(response)
item = HeroItem()
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
Well, I'm scraping a whole table in the main parse method and I have taken several fields from that table. One of these fields is an url and I want to explore it to get a whole new bunch of fields. How can I pass my already created ITEM object to the callback function so the final item keeps all the fields?
As it is shown in the code above, I'm able to save the fields inside the url (code at the moment) or only the ones in the table (simply write yield item)
but I can't yield only one object with all the fields together.
I have tried this, but obviously, it doesn't work.
yield Request(url, callback=self.parse_profile(item))
def parse_profile(self, response, item):
sel = Selector(response)
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
This is what you'd use the meta Keyword for.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
# Item assignment here
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile, meta={'hero_item': item})
def parse_profile(self, response):
item = response.meta.get('hero_item')
item['weapon'] = response.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
yield item
Also note, doing sel = Selector(response) is a waste of resources and differs from what you did earlier, so I changed it. It's automatically mapped in the response as response.selector, which also has the convenience shortcut of response.xpath.
Here's a better way to pass args to callback function:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
I had a similar issue with Tkinter's extra argument passing, and found this solution to work (here: http://infohost.nmt.edu/tcc/help/pubs/tkinter/web/extra-args.html), converted to your problem:
def parse(self, response):
item = HeroItem()
[...]
def handler(self = self, response = response, item = item):
""" passing as default argument values """
return self.parse_profile(response, item)
yield Request(url, callback=handler)
#peduDev
Tried your approach but something failed due to an unexpected keyword.
scrapy_req = scrapy.Request(url=url,
callback=self.parseDetailPage,
cb_kwargs=dict(participant_id=nParticipantId))
def parseDetailPage(self, response, participant_id ):
.. Some code here..
yield MyParseResult (
.. some code here ..
participant_id = participant_id
)
Error reported
, cb_kwargs=dict(participant_id=nParticipantId)
TypeError: _init_() got an unexpected keyword argument 'cb_kwargs'
Any idea what caused the unexpected keyword argument other than perhaps an to old scrapy version?
Yep. I verified my own suggestion and after an upgrade it all worked as suspected.
sudo pip install --upgrade scrapy

Categories

Resources