I have the following url to begin with: http://somedomain.mytestsite.com/?offset=0. I'd like to loop through this url by incrementing offset parameter, let's say by 100 each time. Each time I recieve response I need to check some condition to decide whether I should run next iteration. For example:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
This works, but I can't help feeling something wrong with this code. Maybe I should use some scrapy's rules?
Following things could be improved:
1) dont keep items as spider attribute, you will consume extremely high amount of memory with bigger inputs, use python generators instead. When you use generators you can yield items and requests from one spider callback without any trouble.
2) start_requests are used at spider startup, there seems to be little need to overwrite them in your code, if you rename your method to parse (default method name executed as callback to start_requests) code will be more readable
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
probably even better version of your callback would be:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
Related
I am using Scrapy to go from page to page and collect numbers that are on a page. The pages are all similar in the way that I can use the same function to parse them. Simple enough, but I don't need each individual number on the pages, or even each number total from each page. I just need the total sum of all the numbers across all the pages I am visiting. The Scrapy documentation talks about using cb_kwargs to pass arguments, and this is what I have so far.
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers, cb_kwargs=dict(total_count=0))
def parse_numbers(self, response, total_count):
yield {
total_count = total_count,
}
def extract_with_css(query):
return response.css(query).get(default='').strip()
for number in response.css('div.numbers'):
yield {
'number': extract_with_css('span::text'),
total_count = total_count + int(number.replace(',',''))
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
request = scrapy.Request(next_page,
callback=self.parse_numbers,
cb_kwargs=dict(total_count=total_count))
yield request
I cut out things irrelevant to the question to make my code more clear. I feel like using a for loop to add up the numbers is okay, but how do I get that total value to the next page (if there is one) and then export it with the rest of the data at the end?
I don't see the need for passing data from one request to another.
The most obvious way I can think of to go about it would be as follows:
You collect the count of the page and yield the result as an item
You create an item pipeline that keeps track of the total count
When the scraping is finished, you have the total count in your item pipeline and you write it to a file, database, ...
Your spider would look something like this:
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
numbers_page = response.css('.numbers + a')
yield from response.follow(numbers_page, callback=self.parse_numbers)
def parse_numbers(self, response):
numbers = response.css('div.numbers')
list_numbers = numbers.css('span::text').getall()
page_sum = sum(int(number) for number in list_numbers if number.strip())
yield {'page_sum': page_sum}
next_page = response.css('li.next a::attr("href")').get()
if next_page:
request = scrapy.Request(next_page,
callback=self.parse_numbers)
yield request
For the item pipeline you can use logic like this:
class TotalCountPipeline(object):
def __init__(self):
# initialize the variable that keeps track of the total count
self.total_count = 0
def process_item(self, item, spider):
# every number yielded from your spider in page_sum will be added to the current total count
page_sum = item['page_sum']
self.total_count += page_sum
return item
def close_spider(self, spider):
# write the final count to a file
output = json.dumps(self.total_count)
with open('test_count_file.jl', 'w') as output_file:
output_file.write(output + '\n')
I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
If you use return, the function is terminated, the loop won't iterate to the next value and a single request will be sent to the Scrapy Engine. Replace it with yield so it returns a generator.
This one is extension of Multiple nested request with scrapy
. Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?
To deal with (1) I considered this:
class CatLoader(ItemLoader):
def __int__(self, item=None, selector=None, response=None, parent=None, **context):
super(self.__class__, self).__init__(item, selector, response, parent, **context)
self.lock = threading.Lock()
self.counter = 0
def dec_counter(self):
self.lock.acquire()
self.counter += 1
self.lock.release()
Then in parser:
if len(urls) == 0:
self.logger.warning('Cat without items, url: ' + response.url)
item = cl.load_item()
yield item
cl.counter = len(urls)
for url in urls:
rq = Request(url, self.parse_item)
rq.meta['loader'] = cl
yield rq
And in parse_item() I can do:
def parse_item(self, response):
l = response.meta['loader']
l.dec_counter()
if l.counter == 0:
yield l.load_item()
BUT! To deal with 2 i neeed in each function do:
def parse_item(self, response):
try:
l = response.meta['loader']
finally:
l.dec_counter()
if l.counter == 0:
yield l.load_item()
Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?
UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So:
1) Articles must be inservet to table AFTER category.
2) Article must know ID of it's category to form relation
Same thing for images records
I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time and finished_time attributes.
So I have something like:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right?
Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?
Here is trick
With each request, send batch_id, total_products_in_this_batch and processed_this_batch
and anywhere in any function check
for batch in Batches.objects.all():
processed_this_batch = 0
# TODO: Get some batch_id here
# TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`
for prod in batch.get_products():
processed_this_batch = processed_this_batch + 1
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })
And in anywhere in code, for any particular batch, check if processed_this_batch == total_products_in_this_batch then save batch
For this kind of deals you can use signal closed which you can bind a function to run when spider is done crawling.
I made some adaptations to #Umair suggestion and came up with a solution that works great for my case:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
This works fine as long as all the Requests yielded by start_requests have different url's.
If there are any duplicates, scrapy will filter them out and not call your parse method,
so you end up with counter['curr'] < counter['total'] and the batch status is left RUNNING forever.
As it turns out you can override scrapy's behaviour for duplicates.
First, we need to change settings.py to specify an alternative "duplicates filter" class:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
Then we create the MyDupeFilter class, that lets the spider know when there is a duplicate:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
Then we modify our spider to make it increment our counter when a duplicate is found:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
And we are good to go
This is my code. Two parser functions call the same AfterParserFinished() which counts the number of invocations to determine the time all parsers accomplished
countAccomplishedParsers: int = 0
def AfterParserFinished(self):
self.countAccomplishedParsers =self.countAccomplishedParsers+1
print self.countAccomplishedParsers #How many parsers have been accomplished
if self.countAccomplishedParsers == 2:
print("Accomplished: 2. Do something.")
def parse1(self, response):
self.AfterParserFinished()
pass
def parse2(self, response):
self.AfterParserFinished()
pass
I'm having a similar problem that the individual below had. I'm trying to pass an item using the meta attribute. I'm seeing the correct number of items outputted, but they are duplicates of a single item. Could someone help? I'm guessing mby the response to the previous individual's post that this should be an obvious fix.
https://github.com/scrapy/scrapy/issues/1257
def parse(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseChapter)
request.meta['item'] = # a dictionary containing some data of the just parsed page
yield request
def parseChapter(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseCategory)
request.meta['item'] = # a dictionary containing some data of the just parsed page
# print request.meta['item'] is good and different in every iteration
yield request
def parseCategory(self, response):
# print response.meta['item'] is not good because it displays the same value many times
# for every new call of parseChapter, meta['item'] received is always the same
# some treatment
Most likely, you modifying the item at each iteration of the for loop instead of creating a new one.
As a consequence all request are being sent with the same value. i.e. the last value of the item variable.
def parseChapter(self, response):
# some treatment
# a loop
request = scrapy.Request(url=<calculated_url>, callback=parseCategory)
request.meta['item'] = my_item_dict.copy()
# print request.meta['item'] is good and different in every iteration
yield request