Scrapy getting response url inside CrawlSpider rule - python

I have this rule:
Rule(SgmlLinkExtractor(allow=('http://.*/category/.*/.*/.*',))),
Rule(SgmlLinkExtractor(allow=('http://.*/product/.*', )),cb_kwargs={'crumbs':response.url},callback='parse_item'),
I want to pass the first response to the function (parse_item), but the problem is that this line of code gives me an error response is not defined.
How do I access the response of last rule ?

You can access the Response object only in the callback, try this:
Rule(SgmlLinkExtractor(allow=r'http://.*/category/.*/.*/.*'), callback='parse_cat', follow=True),
Rule(SgmlLinkExtractor(allow=r'http://.*/product/.*'), callback='parse_prod'),
def parse_cat(self, response):
crumbs = response.url
return self.parse_item(response, crumbs)
def parse_prod(self, response):
crumbs = response.url
return self.parse_item(response, crumbs)
def parse_item(self, response, crumbs):
...

If you want to access the category url (referer url) through which you came to the product, inside parse_item, you can access it by:
response.request.headers.get('Referer')
via: nyov on #scrapy irc

Related

Getting this error when scraping JSON with scrapy: Spider must return request, item, or None, got 'str'

I am trying to get a json field with key "longName" with scrapy but I am receiving the error: "Spider must return request, item, or None, got 'str'".
The JSON I'm trying to scrape looks something like this:
{
"id":5355,
"code":9594,
}sadsadsd
This is my code:
import scrapy
import json
class NotesSpider(scrapy.Spider):
name = 'notes'
allowed_domains = ['blahblahblah.com']
start_urls = ['https://blahblahblah.com/api/123']
def parse(self, response):
data = json.loads(response.body)
yield from data['longName']
I get the above error when I run "scrapy crawl notes" in prompt. Anyone can point me in the right direction?
If you only want longName modifying your parse method like this should do the trick:
def parse(self, response):
data = json.loads(response.body)
yield {"longName": data["longName"]}

Scrapy passing requests along

I previously used some code like this to visit a page and change the url around a bit to generate a second request which gets passed to a second parse method:
from scrapy.http import Request
def parse_final_page(self, response):
# do scraping here:
def get_next_page(self, response, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
yield req
def parse(self, response):
if 'substring' in response.url:
new_url = 'some_new_url'
yield from self.get_next_page(response, new_url)
else:
pass
# continue..
# scraping items
# yield
This snippet is pretty old (2 years or so) and i'm currently using Scrapy 2.2, although i'm not sure if that's relevant. Note that get_next_page gets called, but parse_final_page never runs, which I don't get...
Why is parse_final_page not being called? Or more to the point.. is there an easier way for me to just generate a new request on the fly? I would prefer to not use a middleware or change start_urls in this context.
1 - "Why is parse_final_page not being called?"
Your script works fine for me on Scrapy v2.2.1, so its probably an issue with the specific request you're trying to make.
2 - "...is there an easier way for me to just generate a new request on the fly?"
You could try this variation where you return the request from the get_next_page callback, instead of yielding it (note I removed the from keyword and did not send the response object to the callback):
def parse(self, response):
if 'substring' in response.url:
new_url = ''
yield self.get_next_page(new_url)
else:
# continue..
# scraping items
# yield
def get_next_page(self, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
return req
def parse_final_page(self, response):
# do scraping here:

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

I am trying to parse a list from a javascript website. When I run it, it only gives me back one entry on each column and then the spider shuts down. I have already set up my middleware settings. I am not sure what is going wrong. Thanks in advance!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[#id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[#class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[#class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
The .extract_first() (now .get()) you used will always return the first result. It's not an iterator so there is no sense to call it several times. You should try the .getall() method. That will be something like:
names = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-1"]/text()').getall()
sources = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-2"]/text()').getall()

Scrapy get pre-redirect url

I've a crawler running without troubles but i need to get the start_url and not the redirected one.
The problem is i'm using rules to pass parameters to the URL ( like field-keywords=xxxxx ) and finally get the correct url.
The parse function starts getting the item attributes without any troubles but when i want the start URL ( the true one ) it stores the redirected one ...
I've tryed:
response.url
response.request.meta.get('redirect_urls')
Both returns the final url ( the redirected one ) and not the start_url.
Some one know why, or has any clue ?
Thanks in advance.
use a Spider Middleware to keep track of the start url from every response:
from scrapy import Request
class StartRequestsMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
request.meta.update(start_url=request.url)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_url=response.meta['start_url'],
)
yield output
keep track of the start_url every response comes from with:
response.meta['start_url']
Have you tried response.request.url? I personally would override the start_requests method adding the original url in the meta, something like:
yield Request(url, meta={'original_request': url})
And then extract it using response.meta['original_request']

Scrapy not filling object on request

Here is my code
spider.py
def parse(self,response):
item=someItem()
cuv=Vitae()
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
items.py
class someItem(Item):
cuv=Field()
class Vitae(Item):
link=Field()
No errors are displayed!
It adds the object "cuv" to "item" but attributes to "cuv" are never added, what am I missing here?
Why you use scrapy.Item inside another one?
Try using a simple python dict inside your item['cuv']. And try to move request.meta to scrapy.Request constructor argument.
And you should use yield instead of return
def parse(self,response):
item=someItem()
request=scrapy.Request(url, meta={'item': item} callback=self.cvsearch)
yield request
def cvsearch(self, response):
item=response.meta['item']
item['cuv'] = {'link':response.url}
yield item
I am not a very good explainer but I'll try to explain what's wrong best I can
Scrapy is asynchronous meaning there is no order which requests are executed. Let's take a look at this piece of code
def parse(self,response):
item=someItem()
cuv={}
item['cuv']=cuv
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
logging.error(item['cuv']) #this will return null [1]
def cvsearch(self, response):
item=response.meta['item']
cv=item['cuv']
cv['link']=response.url
return item
[1]-this is because this line will execute before cvsearch is done which you can't control. To solve this you have to do a cascade for multiple requests
def parse(self,response):
item=someItem()
request=scrapy.Request(url, callback=self.cvsearch)
request.meta['item'] = item
yield request
def cvsearch(self, response):
item=response.meta['item']
request=scrapy.Request(url, callback=self.another)
yield request
def another (self, response)
item=response.meta['item']
yield item
To fully grasp this concept I advise to take a look at multithreading. Please add anything that I missed!

Categories

Resources