How can make callbacks in two sequentially Requests with scrapy - python

If the spider gets redirect, then it should do request again, but with different parameters.
The callback in second Request is not performed.
If I use different urls in start and checker methods, it's works fine. I think requests are using lazy loads and this is why my code isn't working, but not sure.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
def start(self, response):
return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})
def checker(self, response):
if response.status == 301:
return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
else:
return self.results(response)
def results(self, response):
# here I work with response

Not sure if you still need this but I have put together an example. If you have a specific website in mind, we can all definitely take a look at it.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
name = "TEST"
allowed_domains = ["example.com", "example.iana.org"]
def __init__(self, **kwargs):
super( TestSpider, self ).__init__(**kwargs)\
self.url = "http://www.example.com"
self.max_loop = 3
self.loop = 0 # We want it to loop 3 times so keep a class var
def start_requests(self):
# I'll write it out more explicitly here
print "OPEN"
checkRequest = Request(
url = self.url,
meta = {"test":"first"},
callback = self.checker
)
return [ checkRequest ]
def checker(self, response):
# I wasn't sure about a specific website that gives 302
# so I just used 200. We need the loop counter or it will keep going
if(self.loop<self.max_loop and response.status==200):
print "RELOOPING", response.status, self.loop, response.meta['test']
self.loop += 1
checkRequest = Request(
url = self.url,
callback = self.checker
).replace(meta = {"test":"not first"})
return [checkRequest]
else:
print "END LOOPING"
self.results(response) # No need to return, just call method
def results(self, response):
print "DONE" # Do stuff here
In settings.py, set this option
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
This is actually what turns off the filter for duplicate site requests. It's confusing because the BaseDupeFilter is not actually the default since it doesn't really filter anything. This means we will submit 3 different requests that will loop through the checker method. Also, I am using scrapy 0.16:
>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE

Related

Scrapy stops directly after I don't yield a request in parse

I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
If you use return, the function is terminated, the loop won't iterate to the next value and a single request will be sent to the Scrapy Engine. Replace it with yield so it returns a generator.

Get results of Scrapy Request

How to get results of scrapy request in a usable variable.
def parse_node(self,response,node):
yield Request('LINK',callback=self.parse_listing)
def parse_listing(self,response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "",'^'):
HERE=Request('LINK',callback=self.parse_agent)
print HERE
def parse_agent(self,response):
yield response.xpath('//node[#id="Email"]/text()').extract_first()
I am trying to get results from my HERE=Request('LINK',callback=self.parse_agent) and print them. The parse_agent should pick up an email but I would like to get it and use it inside parse_listing.
Based on your comments under the first answer, I think what you really need is using scrapy-inline-requests for the purpose (see the example there). Your code would look something like this:
def parse_node(self, response, node):
yield Request('LINK', callback=self.parse_listing)
#inline_requests
def parse_listing(self, response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "",'^'):
agent_response = yield Request('LINK')
email = agent_response.xpath('//node[#id="Email"]/text()').extract_first()
def parse_listing(self, response):
for agent in string.split(response.xpath('//node[#id="Agent"]/text()').extract_first() or "", '^'):
HERE = scrapy.Request('LINK', callback=self.parse_agent)
# call this req or something calls parse_agent(link)
yield HERE # this will yield to callback which will print or log
def parse_agent(self, response):
print response #response is the parsed page from HERE)
email = response.xpath('//node[#id="Email"]/text()').extract_first() #something
print email # logging is better
#import logging
#logging.log(logging.INFO, "info from page")
yield email #yield to whatever function

Scrapy: Wait for some urls to be parsed, then do something

I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time and finished_time attributes.
So I have something like:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right?
Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?
Here is trick
With each request, send batch_id, total_products_in_this_batch and processed_this_batch
and anywhere in any function check
for batch in Batches.objects.all():
processed_this_batch = 0
# TODO: Get some batch_id here
# TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`
for prod in batch.get_products():
processed_this_batch = processed_this_batch + 1
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })
And in anywhere in code, for any particular batch, check if processed_this_batch == total_products_in_this_batch then save batch
For this kind of deals you can use signal closed which you can bind a function to run when spider is done crawling.
I made some adaptations to #Umair suggestion and came up with a solution that works great for my case:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
This works fine as long as all the Requests yielded by start_requests have different url's.
If there are any duplicates, scrapy will filter them out and not call your parse method,
so you end up with counter['curr'] < counter['total'] and the batch status is left RUNNING forever.
As it turns out you can override scrapy's behaviour for duplicates.
First, we need to change settings.py to specify an alternative "duplicates filter" class:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
Then we create the MyDupeFilter class, that lets the spider know when there is a duplicate:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
Then we modify our spider to make it increment our counter when a duplicate is found:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
And we are good to go
This is my code. Two parser functions call the same AfterParserFinished() which counts the number of invocations to determine the time all parsers accomplished
countAccomplishedParsers: int = 0
def AfterParserFinished(self):
self.countAccomplishedParsers =self.countAccomplishedParsers+1
print self.countAccomplishedParsers #How many parsers have been accomplished
if self.countAccomplishedParsers == 2:
print("Accomplished: 2. Do something.")
def parse1(self, response):
self.AfterParserFinished()
pass
def parse2(self, response):
self.AfterParserFinished()
pass

iterate through url params template in Scrapy

I have the following url to begin with: http://somedomain.mytestsite.com/?offset=0. I'd like to loop through this url by incrementing offset parameter, let's say by 100 each time. Each time I recieve response I need to check some condition to decide whether I should run next iteration. For example:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
This works, but I can't help feeling something wrong with this code. Maybe I should use some scrapy's rules?
Following things could be improved:
1) dont keep items as spider attribute, you will consume extremely high amount of memory with bigger inputs, use python generators instead. When you use generators you can yield items and requests from one spider callback without any trouble.
2) start_requests are used at spider startup, there seems to be little need to overwrite them in your code, if you rename your method to parse (default method name executed as callback to start_requests) code will be more readable
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
probably even better version of your callback would be:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()

Passing a argument to a callback function [duplicate]

This question already has answers here:
Python Argument Binders
(7 answers)
Closed last month.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
item['hclass'] = response.request.url.split("/")[8].split('-')[-1]
item['server'] = response.request.url.split('/')[2].split('.')[0]
item['hardcore'] = len(response.request.url.split("/")[8].split('-')) == 3
item['seasonal'] = response.request.url.split("/")[6] == 'season'
item['rank'] = sel.xpath('td[#class="cell-Rank"]/text()').extract()[0].strip()
item['battle_tag'] = sel.xpath('td[#class="cell-BattleTag"]//a/text()').extract()[1].strip()
item['grift'] = sel.xpath('td[#class="cell-RiftLevel"]/text()').extract()[0].strip()
item['time'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
item['date'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile)
def parse_profile(self, response):
sel = Selector(response)
item = HeroItem()
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
Well, I'm scraping a whole table in the main parse method and I have taken several fields from that table. One of these fields is an url and I want to explore it to get a whole new bunch of fields. How can I pass my already created ITEM object to the callback function so the final item keeps all the fields?
As it is shown in the code above, I'm able to save the fields inside the url (code at the moment) or only the ones in the table (simply write yield item)
but I can't yield only one object with all the fields together.
I have tried this, but obviously, it doesn't work.
yield Request(url, callback=self.parse_profile(item))
def parse_profile(self, response, item):
sel = Selector(response)
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
This is what you'd use the meta Keyword for.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
# Item assignment here
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile, meta={'hero_item': item})
def parse_profile(self, response):
item = response.meta.get('hero_item')
item['weapon'] = response.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
yield item
Also note, doing sel = Selector(response) is a waste of resources and differs from what you did earlier, so I changed it. It's automatically mapped in the response as response.selector, which also has the convenience shortcut of response.xpath.
Here's a better way to pass args to callback function:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
I had a similar issue with Tkinter's extra argument passing, and found this solution to work (here: http://infohost.nmt.edu/tcc/help/pubs/tkinter/web/extra-args.html), converted to your problem:
def parse(self, response):
item = HeroItem()
[...]
def handler(self = self, response = response, item = item):
""" passing as default argument values """
return self.parse_profile(response, item)
yield Request(url, callback=handler)
#peduDev
Tried your approach but something failed due to an unexpected keyword.
scrapy_req = scrapy.Request(url=url,
callback=self.parseDetailPage,
cb_kwargs=dict(participant_id=nParticipantId))
def parseDetailPage(self, response, participant_id ):
.. Some code here..
yield MyParseResult (
.. some code here ..
participant_id = participant_id
)
Error reported
, cb_kwargs=dict(participant_id=nParticipantId)
TypeError: _init_() got an unexpected keyword argument 'cb_kwargs'
Any idea what caused the unexpected keyword argument other than perhaps an to old scrapy version?
Yep. I verified my own suggestion and after an upgrade it all worked as suspected.
sudo pip install --upgrade scrapy

Categories

Resources