I have this general save function in scrapy spider.
def save_results(self, menu, url ):
inspect_response(response,self)
res, method = self.crawl_result(url)
self.item['crawl_result'] = res
self.item['raw_menu_urls'] = url
self.item['conversion_method'] = method
self.item['menu_text'] = menu
print self.item
yield self.item
And I call it like this from other function:
def yelp_menu(self, response):
id = response.meta['id']
menu = response.xpath('//div[#class="container biz-menu"]//text()').extract()
menu = self.clean_text(menu)
self.save_results(response.url, menu)
But it never gets called.
Where am I wrong?
P.S. I know it is not how scrapy is supposed to work with items , pipelines and the rest.
The problem is that self.save_results returns a generator. What you need is the following:
for item in self.save_results(response.url, menu):
yield item
Or, if you're using Python 3.3+, you can use yield from magic:
yield from self.save_results(response.url, menu)
Related
I've written a scraper using python to scrape movie names from yiffy torrents. The webpage has traversed around 12 pages. If i run my crawler using print statement, it gives me all the results from all the pages. However, when I run the same using return then it gives me the content from the first page only and do not go on to the next page to process the rest. As I'm having a hard time understanding the behavior of return statement, if somebody points out where I'm going wrong and give me a workaround I would be very happy. Thanks in advance.
This is what I'm trying with (the full code):
import requests
from urllib.request import urljoin
from lxml.html import fromstring
main_link = "https://www.yify-torrent.org/search/western/"
# film_storage = [] #I tried like this as well (keeping the list storage outside the function)
def get_links(link):
root = fromstring(requests.get(link).text)
film_storage = []
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
film_storage.append(name)
return film_storage
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
if __name__ == '__main__':
items = get_links(main_link)
for item in items:
print(item)
But, when i do like below, i get all the results (pasted gist portion only):
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
print(name) ## using print i get all the results from all the pages
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
Your return statement prematurely terminates your get_links() function. Meaning this part
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
is never executed.
Quickfix would be to put the return statement at the end of your function, but you have to make film_storage global(defined outside the get_links() function).
Edit:
Just realized, since you will be making your film_storage global, there is no need for the return statement.
Your code in main would just look like this:
get_links(main_link)
for item in film_storage:
print(item)
Your film_storage results list is local to the function get_links() which is called recursively for the next page. After the recursive call (for all the next pages), the initial (entry) function returns results only for the first page.
You'll have to either (1) unwrap the tail recursion into a loop, (2) make results list global; (3) use a callback (like you call print), or the best option (4) is to turn the get_links function into a generator that yields results for all pages.
Generator version:
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
yield name
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
for name in get_links(full_link):
yield name
This one is extension of Multiple nested request with scrapy
. Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?
To deal with (1) I considered this:
class CatLoader(ItemLoader):
def __int__(self, item=None, selector=None, response=None, parent=None, **context):
super(self.__class__, self).__init__(item, selector, response, parent, **context)
self.lock = threading.Lock()
self.counter = 0
def dec_counter(self):
self.lock.acquire()
self.counter += 1
self.lock.release()
Then in parser:
if len(urls) == 0:
self.logger.warning('Cat without items, url: ' + response.url)
item = cl.load_item()
yield item
cl.counter = len(urls)
for url in urls:
rq = Request(url, self.parse_item)
rq.meta['loader'] = cl
yield rq
And in parse_item() I can do:
def parse_item(self, response):
l = response.meta['loader']
l.dec_counter()
if l.counter == 0:
yield l.load_item()
BUT! To deal with 2 i neeed in each function do:
def parse_item(self, response):
try:
l = response.meta['loader']
finally:
l.dec_counter()
if l.counter == 0:
yield l.load_item()
Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?
UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So:
1) Articles must be inservet to table AFTER category.
2) Article must know ID of it's category to form relation
Same thing for images records
I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time and finished_time attributes.
So I have something like:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right?
Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?
Here is trick
With each request, send batch_id, total_products_in_this_batch and processed_this_batch
and anywhere in any function check
for batch in Batches.objects.all():
processed_this_batch = 0
# TODO: Get some batch_id here
# TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`
for prod in batch.get_products():
processed_this_batch = processed_this_batch + 1
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })
And in anywhere in code, for any particular batch, check if processed_this_batch == total_products_in_this_batch then save batch
For this kind of deals you can use signal closed which you can bind a function to run when spider is done crawling.
I made some adaptations to #Umair suggestion and came up with a solution that works great for my case:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
This works fine as long as all the Requests yielded by start_requests have different url's.
If there are any duplicates, scrapy will filter them out and not call your parse method,
so you end up with counter['curr'] < counter['total'] and the batch status is left RUNNING forever.
As it turns out you can override scrapy's behaviour for duplicates.
First, we need to change settings.py to specify an alternative "duplicates filter" class:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
Then we create the MyDupeFilter class, that lets the spider know when there is a duplicate:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
Then we modify our spider to make it increment our counter when a duplicate is found:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
And we are good to go
This is my code. Two parser functions call the same AfterParserFinished() which counts the number of invocations to determine the time all parsers accomplished
countAccomplishedParsers: int = 0
def AfterParserFinished(self):
self.countAccomplishedParsers =self.countAccomplishedParsers+1
print self.countAccomplishedParsers #How many parsers have been accomplished
if self.countAccomplishedParsers == 2:
print("Accomplished: 2. Do something.")
def parse1(self, response):
self.AfterParserFinished()
pass
def parse2(self, response):
self.AfterParserFinished()
pass
Since I added another request at the end of the for loop, to test a link, the Spyder only generates Items for the first index of the loop.
def parse_product_page(self, response):
products = response.xpath('//div[#class="content"]//div[#class="tov-rows"]//div[#class="t-row"]')
for x, product in enumerate(products): #ERROR: Just gives an item for the first product
product_loader = VerbraucherweltProdukt()
product_loader['name'] = product.xpath(
'//div[#class="t-center"]//div[#class="t-name"]/text()').extract_first()
request = scrapy.Request(non_ref_link,callback=self.test_link, errback=self.test_link)
request.meta['item'] = product_loader
yield request
It all worked before when i just yielded the product item, but since the item is returned in the callback, i dont know where my problem lays.
The callback is just:
def test_link(self, response):
item = response.meta['item']
item['link_fehlerhaft'] = response.status
yield item
Also the full code, maybe the problem is anywhere else:
http://pastebin.com/tgL38zpD
Here's your culprit:
link = product.xpath('//div[#class="t-right"]//a/#href').extract_first()
You're not grounding your recursive xpath to the product node you have. To fix it simply pre append . to your xpath to indicate current node as root:
link = product.xpath('.//div[#class="t-right"]//a/#href').extract_first()
This question already has answers here:
Python Argument Binders
(7 answers)
Closed last month.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
item['hclass'] = response.request.url.split("/")[8].split('-')[-1]
item['server'] = response.request.url.split('/')[2].split('.')[0]
item['hardcore'] = len(response.request.url.split("/")[8].split('-')) == 3
item['seasonal'] = response.request.url.split("/")[6] == 'season'
item['rank'] = sel.xpath('td[#class="cell-Rank"]/text()').extract()[0].strip()
item['battle_tag'] = sel.xpath('td[#class="cell-BattleTag"]//a/text()').extract()[1].strip()
item['grift'] = sel.xpath('td[#class="cell-RiftLevel"]/text()').extract()[0].strip()
item['time'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
item['date'] = sel.xpath('td[#class="cell-RiftTime"]/text()').extract()[0].strip()
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile)
def parse_profile(self, response):
sel = Selector(response)
item = HeroItem()
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
Well, I'm scraping a whole table in the main parse method and I have taken several fields from that table. One of these fields is an url and I want to explore it to get a whole new bunch of fields. How can I pass my already created ITEM object to the callback function so the final item keeps all the fields?
As it is shown in the code above, I'm able to save the fields inside the url (code at the moment) or only the ones in the table (simply write yield item)
but I can't yield only one object with all the fields together.
I have tried this, but obviously, it doesn't work.
yield Request(url, callback=self.parse_profile(item))
def parse_profile(self, response, item):
sel = Selector(response)
item['weapon'] = sel.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
return item
This is what you'd use the meta Keyword for.
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = HeroItem()
# Item assignment here
url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[#class="cell-BattleTag"]//a/#href').extract()[0].strip()
yield Request(url, callback=self.parse_profile, meta={'hero_item': item})
def parse_profile(self, response):
item = response.meta.get('hero_item')
item['weapon'] = response.xpath('//li[#class="slot-mainHand"]/a[#class="slot-link"]/#href').extract()[0].split('/')[4]
yield item
Also note, doing sel = Selector(response) is a waste of resources and differs from what you did earlier, so I changed it. It's automatically mapped in the response as response.selector, which also has the convenience shortcut of response.xpath.
Here's a better way to pass args to callback function:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
I had a similar issue with Tkinter's extra argument passing, and found this solution to work (here: http://infohost.nmt.edu/tcc/help/pubs/tkinter/web/extra-args.html), converted to your problem:
def parse(self, response):
item = HeroItem()
[...]
def handler(self = self, response = response, item = item):
""" passing as default argument values """
return self.parse_profile(response, item)
yield Request(url, callback=handler)
#peduDev
Tried your approach but something failed due to an unexpected keyword.
scrapy_req = scrapy.Request(url=url,
callback=self.parseDetailPage,
cb_kwargs=dict(participant_id=nParticipantId))
def parseDetailPage(self, response, participant_id ):
.. Some code here..
yield MyParseResult (
.. some code here ..
participant_id = participant_id
)
Error reported
, cb_kwargs=dict(participant_id=nParticipantId)
TypeError: _init_() got an unexpected keyword argument 'cb_kwargs'
Any idea what caused the unexpected keyword argument other than perhaps an to old scrapy version?
Yep. I verified my own suggestion and after an upgrade it all worked as suspected.
sudo pip install --upgrade scrapy