Here is my code
spider.py
def parse(self,response):
item=someItem()
cuv=Vitae()
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
items.py
class someItem(Item):
cuv=Field()
class Vitae(Item):
link=Field()
No errors are displayed!
It adds the object "cuv" to "item" but attributes to "cuv" are never added, what am I missing here?
Why you use scrapy.Item inside another one?
Try using a simple python dict inside your item['cuv']. And try to move request.meta to scrapy.Request constructor argument.
And you should use yield instead of return
def parse(self,response):
item=someItem()
request=scrapy.Request(url, meta={'item': item} callback=self.cvsearch)
yield request
def cvsearch(self, response):
item=response.meta['item']
item['cuv'] = {'link':response.url}
yield item
I am not a very good explainer but I'll try to explain what's wrong best I can
Scrapy is asynchronous meaning there is no order which requests are executed. Let's take a look at this piece of code
def parse(self,response):
item=someItem()
cuv={}
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
logging.error(item['cuv']) #this will return null [1]
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
[1]-this is because this line will execute before cvsearch is done which you can't control. To solve this you have to do a cascade for multiple requests
def parse(self,response):
item=someItem()
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
request=scrapy.Request(url, callback=self.another)
yield request
def another (self, response)
item=response.meta['item']
yield item
To fully grasp this concept I advise to take a look at multithreading. Please add anything that I missed!
Related
I am using Scrapy to crawl a site.
I have code similar to this:
class mySpider(scrapy.Spider):
def start_requests(self):
yield SplashRequest(url=example_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
def parse(self, response):
try:
self.extract_data_from_page(response)
if (next_link_still_on_page(response):
next_url = grok_next_url(response)
yield SplashRequest(url=next_url,
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
except Exception:
pass
def extract_data_from_page(self, response):
pass
def next_link_still_on_page(self,response):
pass
def grok_next_url(self, response):
pass
In the parse() method, the callback function is parse() is this to be frowned upon (e.g. a logic bug causing potential stack overflow?).
You can use the same callback. From a technical perspective it isn't an issue. Especially if the yielded request is of the same nature as the current one, then it should ideally reuse the same logic.
However, from a person-who-has-to-read-the-source-code perspective, it is better to have separate parsers for different tasks or page types (the whole single responsibility principle).
Let me illustrate with an example. Let's say you have a listing website (jobs, products, whatever) and you have two main classes of URLs:
Search result pages: .../search?q=something&page=2
Item pages: .../item/abc
The search result page contains pagination links and items. Such a page would spawn two kinds of requests to:
Parse the next page
Parse the item
The Item page will not spawn another request.
So now you can stuff all of that into the same parser and use it for every request:
def parse(self, response):
if 'search' in response.url:
for item in response.css('.item'):
# ...
yield Request(item_url, callback=self.parse)
# ...
yield Request(next_url, callback=self.parse)
if 'item' in response.url:
yield {
'title': response.css('...'),
# ...
}
That is obviously a very condensed example, but as it grows it will become harder to follow along.
Instead, break up the different page parsers:
def parse_search(self, response):
for item in response.css('.items'):
yield Request(item_url, callback=self.parse_item)
next_url = response.css('...').get()
yield Request(next_url, callback=self.parse_search)
def parse_item(self, response):
yield {
'title': response.css('...'),
# ...
}
So basically, if it's a matter of "another of the same kind of page" then it's normal to use the same callback in order to reuse the same logic. If the next request requires a different kind of parsing, rather make a separate parser.
I'm struggling with Scrapy and I don't understand how exactly passing items between callbacks works. Maybe somebody could help me.
I'm looking into http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I'm trying to understand flow of actions there, step by step:
[parse_page1]
item = MyItem() <- object item is created
item['main_url'] = response.url <- we are assigning value to main_url of object item
request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) <- we are requesting a new page and launching parse_page2 to scrap it.
[parse_page2]
item = response.meta['item'] <- I don't understand here. We are creating a new object item or this is the object item created in [parse_page1]? And what response.meta['item'] does mean? We pass to the request in 3 only information like link and callback we didn't add any additional arguments to which we could refer ...
item['other_url'] = response.url <- we are assigning value to other_url of object item
return item <- we are returning item object as a result of request
[parse_page1]
request.meta['item'] = item <- We are assigning object item to request? But request is finished, callback already returned item in 6 ????
return request <- we are getting results of request, so item from 6, am I right?
I went through all documentation concerning scrapy and request/response/meta but still I don't understand what is happening here in points 4 and 7.
line 4: request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
line 5: request.meta['item'] = item
line 6: return request
You are confused about the previous code, let me explain it (I enumerated to explain it here):
In line 4, you are instantiating a scrapy.Request object, this doesn't work like other other requests libraries, here you are not calling the url, and not going to the callback function just yet.
You are adding arguments to the scrapy.Request object in line 5, so for example you could also declare the scrapy.Request object like:
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
and you could have avoided line 5.
Is in line 6 when you are calling the scrapy.Request object, and when scrapy is making it work, like calling the url specified, going to the following callback, and passing meta with it, you coul have also avoided line 6 (and line 5) if you would have called the request like this:
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
So the idea here is that your callback methods should return (preferably yield) a Request or and Item, scrapy will output the Item and continue crawling the Request.
#eLRuLL's answer is wonderful. I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item'] #response.meta is equal to request.meta,so here item['main_url'] ="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recordind urls of page1 and page2.
I'd like to do something special to those each one of the landing urls in start_urls, and then the spider'd follow all the nextpages and crawl deeper. So my code's roughly like this:
def init_parse(self, response):
item = MyItem()
# extract info from the landing url and populate item fields here...
yield self.newly_parse(response)
yield item
return
parse_start_url = init_parse
def newly_parse(self, response):
item = MyItem2()
newly_soup = BeautifulSoup(response.body)
# parse, return or yield items
return item
The code won't work because spider only allows return item, request or None but I yield self.newly_parse, so how can I achieve this in scrapy?
My not so elegant solution:
put the init_parse function inside newly_parse and implement an is_start_url check in the beginning, if response.url is inside start_urls, we'll go through the init_parse procedure.
Another ugly solution
Separate out the code where # parse, return or yield items happens and make it a class method or generator, and call this method or generator both inside init_parse and newly_parse.
If you're going to yield multiple items under newly_parse your line under init_parse should be:
for item in self.newly_parse(response):
yield item
as self.newly_parse will return a generator which you will need to iterate through first as scrapy won't recognize it.
Similar to the person here: Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback, I am having difficulty accessing the list of items I build in my callback function. I have tried building the list both in the parse function (but doesn't work because the callback hasn't returned), and the callback, but neither have worked for me.
I am trying to return all the items that I build from these requests. Where do I call "return items" such that the item has been fully processed? I am trying to replicate the tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html#using-our-item)
Thanks!!
The relevant code is below:
class ASpider(Spider):
items = []
...
def parse(self, response):
input_file = csv.DictReader(open("AFile.txt"))
x = 0
for row in input_file:
yield Request("ARequest",
cookies = {"arg1":"1", "arg2":row["arg2"], "style":"default", "arg3":row["arg3"]}, callback = self.aftersubmit, dont_filter = True)
def aftersubmit(self, response):
hxs = Selector(response)
# Create new object..
item = AnItem()
item['Name'] = "jsc"
return item
You need to return (or yield) an item from the aftersubmitcallback method. Quote from docs:
In the callback function, you parse the response (web page) and return
either Item objects, Request objects, or an iterable of both.
def aftersubmit(self, response):
hxs = Selector(response)
item = AnItem()
item['Name'] = "jsc"
return item
Note that this particular Item instance doesn't make sense since you haven't really put anything from the response into it's fields.
I have this rule:
Rule(SgmlLinkExtractor(allow=('http://.*/category/.*/.*/.*',))),
Rule(SgmlLinkExtractor(allow=('http://.*/product/.*', )),cb_kwargs={'crumbs':response.url},callback='parse_item'),
I want to pass the first response to the function (parse_item), but the problem is that this line of code gives me an error response is not defined.
How do I access the response of last rule ?
You can access the Response object only in the callback, try this:
Rule(SgmlLinkExtractor(allow=r'http://.*/category/.*/.*/.*'), callback='parse_cat', follow=True),
Rule(SgmlLinkExtractor(allow=r'http://.*/product/.*'), callback='parse_prod'),
def parse_cat(self, response):
crumbs = response.url
return self.parse_item(response, crumbs)
def parse_prod(self, response):
crumbs = response.url
return self.parse_item(response, crumbs)
def parse_item(self, response, crumbs):
...
If you want to access the category url (referer url) through which you came to the product, inside parse_item, you can access it by:
response.request.headers.get('Referer')
via: nyov on #scrapy irc