I'm having a similar problem that the individual below had. I'm trying to pass an item using the meta attribute. I'm seeing the correct number of items outputted, but they are duplicates of a single item. Could someone help? I'm guessing mby the response to the previous individual's post that this should be an obvious fix.
https://github.com/scrapy/scrapy/issues/1257
def parse(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseChapter)
request.meta['item'] = # a dictionary containing some data of the just parsed page
yield request
def parseChapter(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseCategory)
request.meta['item'] = # a dictionary containing some data of the just parsed page
# print request.meta['item'] is good and different in every iteration
yield request
def parseCategory(self, response):
# print response.meta['item'] is not good because it displays the same value many times
# for every new call of parseChapter, meta['item'] received is always the same
# some treatment
Most likely, you modifying the item at each iteration of the for loop instead of creating a new one.
As a consequence all request are being sent with the same value. i.e. the last value of the item variable.
def parseChapter(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseCategory)
request.meta['item'] = my_item_dict.copy()
# print request.meta['item'] is good and different in every iteration
yield request
Related
I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don't need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs to pass arguments, and this is what I have so far.
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
def parse_numbers(self, response, total_count):
yield {
total_count = total_count,
}
def extract_with_css(query):
return response.css(query).get(default='').strip()
for number in response.css('div.numbers'):
yield {
'number': extract_with_css('span::text'),
total_count = total_count + int(number.replace(',',''))
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
request = scrapy.Request(next_page,
callback=self.parse_numbers,
cb_kwargs=dict(total_count=total_count))
yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
I don't see the need for passing data from one request to another.
The most obvious way I can think of to go about it would be as follows:
You collect the count of the page and yield the result as an item
You create an item pipeline that keeps track of the total count
When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, ...
Your spider would look something like this:
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers)
def parse_numbers(self, response):
numbers = response.css('div.numbers')
list_numbers = numbers.css('span::text').getall()
page_sum = sum(int(number) for number in list_numbers if number.strip())
yield {'page_sum': page_sum}
next_page = response.css('li.next a::attr("href")').get()
if next_page:
request = scrapy.Request(next_page,
callback=self.parse_numbers)
yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object):
def __init__(self):
# initialize the variable that keeps track of the total count
self.total_count = 0
def process_item(self, item, spider):
# every number yielded from your spider in page_sum will be added to the current total count
page_sum = item['page_sum']
self.total_count += page_sum
return item
def close_spider(self, spider):
# write the final count to a file
output = json.dumps(self.total_count)
with open('test_count_file.jl', 'w') as output_file:
output_file.write(output + '\n')
I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
If you use return, the function is terminated, the loop won't iterate to the next value and a single request will be sent to the Scrapy Engine. Replace it with yield so it returns a generator.
Since I added another request at the end of the for loop, to test a link, the Spyder only generates Items for the first index of the loop.
def parse_product_page(self, response):
products = response.xpath('//div[#class="content"]//div[#class="tov-rows"]//div[#class="t-row"]')
for x, product in enumerate(products): #ERROR: Just gives an item for the first product
product_loader = VerbraucherweltProdukt()
product_loader['name'] = product.xpath(
'//div[#class="t-center"]//div[#class="t-name"]/text()').extract_first()
request = scrapy.Request(non_ref_link,callback=self.test_link, errback=self.test_link)
request.meta['item'] = product_loader
yield request
It all worked before when i just yielded the product item, but since the item is returned in the callback, i dont know where my problem lays.
The callback is just:
def test_link(self, response):
item = response.meta['item']
item['link_fehlerhaft'] = response.status
yield item
Also the full code, maybe the problem is anywhere else:
http://pastebin.com/tgL38zpD
Here's your culprit:
link = product.xpath('//div[#class="t-right"]//a/#href').extract_first()
You're not grounding your recursive xpath to the product node you have. To fix it simply pre append . to your xpath to indicate current node as root:
link = product.xpath('.//div[#class="t-right"]//a/#href').extract_first()
I'm parsing a list of urls, and I want to avoid saving some url resulted item on the condition of some its value. My code is something like this:
start_urls = [www.rootpage.com]
def parse(self,response):
item = CreatedItem()
url_list = response.xpath('somepath').extract()
for url in url_list:
request = scrapy.Request(item['url'],callback=self.parse_article)
request.meta['item'] = item
yield request
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract()
yield item
Now I want that in case item['parameterA'] follows a condition, there is no need to "yield request" (so that no saving for this url occurs). I tried add a conditional like:
if item['parameterA'] == 0:
continue
else:
yield item
but as expected it does not work, because scrapy continues the loop even before the request is performed.
From what I understand, you should make the decision inside the parse_article method:
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract_first()
if item['parameterA'] != "0":
yield item
Note the use of the extract_first() and the quotes around 0.
I have the following url to begin with: http://somedomain.mytestsite.com/?offset=0. I'd like to loop through this url by incrementing offset parameter, let's say by 100 each time. Each time I recieve response I need to check some condition to decide whether I should run next iteration. For example:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
This works, but I can't help feeling something wrong with this code. Maybe I should use some scrapy's rules?
Following things could be improved:
1) dont keep items as spider attribute, you will consume extremely high amount of memory with bigger inputs, use python generators instead. When you use generators you can yield items and requests from one spider callback without any trouble.
2) start_requests are used at spider startup, there seems to be little need to overwrite them in your code, if you rename your method to parse (default method name executed as callback to start_requests) code will be more readable
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
probably even better version of your callback would be:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()