I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
If you use return, the function is terminated, the loop won't iterate to the next value and a single request will be sent to the Scrapy Engine. Replace it with yield so it returns a generator.
Related
I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don't need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs to pass arguments, and this is what I have so far.
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
def parse_numbers(self, response, total_count):
yield {
total_count = total_count,
}
def extract_with_css(query):
return response.css(query).get(default='').strip()
for number in response.css('div.numbers'):
yield {
'number': extract_with_css('span::text'),
total_count = total_count + int(number.replace(',',''))
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
request = scrapy.Request(next_page,
callback=self.parse_numbers,
cb_kwargs=dict(total_count=total_count))
yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
I don't see the need for passing data from one request to another.
The most obvious way I can think of to go about it would be as follows:
You collect the count of the page and yield the result as an item
You create an item pipeline that keeps track of the total count
When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, ...
Your spider would look something like this:
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers)
def parse_numbers(self, response):
numbers = response.css('div.numbers')
list_numbers = numbers.css('span::text').getall()
page_sum = sum(int(number) for number in list_numbers if number.strip())
yield {'page_sum': page_sum}
next_page = response.css('li.next a::attr("href")').get()
if next_page:
request = scrapy.Request(next_page,
callback=self.parse_numbers)
yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object):
def __init__(self):
# initialize the variable that keeps track of the total count
self.total_count = 0
def process_item(self, item, spider):
# every number yielded from your spider in page_sum will be added to the current total count
page_sum = item['page_sum']
self.total_count += page_sum
return item
def close_spider(self, spider):
# write the final count to a file
output = json.dumps(self.total_count)
with open('test_count_file.jl', 'w') as output_file:
output_file.write(output + '\n')
How to get results of scrapy request in a usable variable.
def parse_node(self,response,node):
yield Request('LINK',callback=self.parse_listing)
def parse_listing(self,response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "",'^'):
HERE=Request('LINK',callback=self.parse_agent)
print HERE
def parse_agent(self,response):
yield response.xpath('//node[#id="Email"]/text()').extract_first()
I am trying to get results from my HERE=Request('LINK',callback=self.parse_agent) and print them. The parse_agent should pick up an email but I would like to get it and use it inside parse_listing.
Based on your comments under the first answer, I think what you really need is using scrapy-inline-requests for the purpose (see the example there). Your code would look something like this:
def parse_node(self, response, node):
yield Request('LINK', callback=self.parse_listing)
#inline_requests
def parse_listing(self, response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "",'^'):
agent_response = yield Request('LINK')
email = agent_response.xpath('//node[#id="Email"]/text()').extract_first()
def parse_listing(self, response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "", '^'):
HERE = scrapy.Request('LINK', callback=self.parse_agent)
# call this req or something calls parse_agent(link)
yield HERE # this will yield to callback which will print or log
def parse_agent(self, response):
print response #response is the parsed page from HERE)
email = response.xpath('//node[#id="Email"]/text()').extract_first() #something
print email # logging is better
#import logging
#logging.log(logging.INFO, "info from page")
yield email #yield to whatever function
I'm parsing a list of urls, and I want to avoid saving some url resulted item on the condition of some its value. My code is something like this:
start_urls = [www.rootpage.com]
def parse(self,response):
item = CreatedItem()
url_list = response.xpath('somepath').extract()
for url in url_list:
request = scrapy.Request(item['url'],callback=self.parse_article)
request.meta['item'] = item
yield request
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract()
yield item
Now I want that in case item['parameterA'] follows a condition, there is no need to "yield request" (so that no saving for this url occurs). I tried add a conditional like:
if item['parameterA'] == 0:
continue
else:
yield item
but as expected it does not work, because scrapy continues the loop even before the request is performed.
From what I understand, you should make the decision inside the parse_article method:
def parse_article(self,response):
item = response.meta['item']
item['parameterA'] = response.xpath('somepath').extract_first()
if item['parameterA'] != "0":
yield item
Note the use of the extract_first() and the quotes around 0.
I have the following url to begin with: http://somedomain.mytestsite.com/?offset=0. I'd like to loop through this url by incrementing offset parameter, let's say by 100 each time. Each time I recieve response I need to check some condition to decide whether I should run next iteration. For example:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
This works, but I can't help feeling something wrong with this code. Maybe I should use some scrapy's rules?
Following things could be improved:
1) dont keep items as spider attribute, you will consume extremely high amount of memory with bigger inputs, use python generators instead. When you use generators you can yield items and requests from one spider callback without any trouble.
2) start_requests are used at spider startup, there seems to be little need to overwrite them in your code, if you rename your method to parse (default method name executed as callback to start_requests) code will be more readable
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
probably even better version of your callback would be:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
If the spider gets redirect, then it should do request again, but with different parameters.
The callback in second Request is not performed.
If I use different urls in start and checker methods, it's works fine. I think requests are using lazy loads and this is why my code isn't working, but not sure.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
def start(self, response):
return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})
def checker(self, response):
if response.status == 301:
return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
else:
return self.results(response)
def results(self, response):
# here I work with response
Not sure if you still need this but I have put together an example. If you have a specific website in mind, we can all definitely take a look at it.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
name = "TEST"
allowed_domains = ["example.com", "example.iana.org"]
def __init__(self, **kwargs):
super( TestSpider, self ).__init__(**kwargs)\
self.url = "http://www.example.com"
self.max_loop = 3
self.loop = 0 # We want it to loop 3 times so keep a class var
def start_requests(self):
# I'll write it out more explicitly here
print "OPEN"
checkRequest = Request(
url = self.url,
meta = {"test":"first"},
callback = self.checker
)
return [ checkRequest ]
def checker(self, response):
# I wasn't sure about a specific website that gives 302
# so I just used 200. We need the loop counter or it will keep going
if(self.loop<self.max_loop and response.status==200):
print "RELOOPING", response.status, self.loop, response.meta['test']
self.loop += 1
checkRequest = Request(
url = self.url,
callback = self.checker
).replace(meta = {"test":"not first"})
return [checkRequest]
else:
print "END LOOPING"
self.results(response) # No need to return, just call method
def results(self, response):
print "DONE" # Do stuff here
In settings.py, set this option
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
This is actually what turns off the filter for duplicate site requests. It's confusing because the BaseDupeFilter is not actually the default since it doesn't really filter anything. This means we will submit 3 different requests that will loop through the checker method. Also, I am using scrapy 0.16:
>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE