Scrapy get pre-redirect url - python

I've a crawler running without troubles but i need to get the start_url and not the redirected one.
The problem is i'm using rules to pass parameters to the URL ( like field-keywords=xxxxx ) and finally get the correct url.
The parse function starts getting the item attributes without any troubles but when i want the start URL ( the true one ) it stores the redirected one ...
I've tryed:
response.url
response.request.meta.get('redirect_urls')
Both returns the final url ( the redirected one ) and not the start_url.
Some one know why, or has any clue ?
Thanks in advance.

use a Spider Middleware to keep track of the start url from every response:
from scrapy import Request
class StartRequestsMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
request.meta.update(start_url=request.url)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_url=response.meta['start_url'],
)
yield output
keep track of the start_url every response comes from with:
response.meta['start_url']

Have you tried response.request.url? I personally would override the start_requests method adding the original url in the meta, something like:
yield Request(url, meta={'original_request': url})
And then extract it using response.meta['original_request']

Related

Scrapy passing requests along

I previously used some code like this to visit a page and change the url around a bit to generate a second request which gets passed to a second parse method:
from scrapy.http import Request
def parse_final_page(self, response):
# do scraping here:
def get_next_page(self, response, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
yield req
def parse(self, response):
if 'substring' in response.url:
new_url = 'some_new_url'
yield from self.get_next_page(response, new_url)
else:
pass
# continue..
# scraping items
# yield
This snippet is pretty old (2 years or so) and i'm currently using Scrapy 2.2, although i'm not sure if that's relevant. Note that get_next_page gets called, but parse_final_page never runs, which I don't get...
Why is parse_final_page not being called? Or more to the point.. is there an easier way for me to just generate a new request on the fly? I would prefer to not use a middleware or change start_urls in this context.
1 - "Why is parse_final_page not being called?"
Your script works fine for me on Scrapy v2.2.1, so its probably an issue with the specific request you're trying to make.
2 - "...is there an easier way for me to just generate a new request on the fly?"
You could try this variation where you return the request from the get_next_page callback, instead of yielding it (note I removed the from keyword and did not send the response object to the callback):
def parse(self, response):
if 'substring' in response.url:
new_url = ''
yield self.get_next_page(new_url)
else:
# continue..
# scraping items
# yield
def get_next_page(self, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
return req
def parse_final_page(self, response):
# do scraping here:

Scrapy/Python yield and continue processing possible?

I am trying this sample code
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
yield scrapy.item.Item()
for url in self.urls:
return Request(url)
But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.
I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?
You ask why the second code does not work, but I don’t think you fully understand why the first code works :)
The for loop of your first code only loops once.
What is happening is:
self.parse() is called for the URL in self.start_urls.
self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().
When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).
The last step repeats in a loop, but it is not the for loop that does it.
You can change your original code to this and it will work the same way:
def parse(self, response):
try:
return next(self.urls)
except StopIteration:
pass
Because to call items/requests it should be generator function.
You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.
The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.

How to yield url with orders to let scrapy crawl

here is my code:
def parse(self, response):
selector = Selector(response)
sites = selector.xpath("//h3[#class='r']/a/#href")
for index, site in enumerate(sites):
url = result.group(1)
print url
yield Request(url = site.extract(),callback = self.parsedetail)
def parsedetail(self,response):
print response.url
...
obj = Store.objects.filter(id=store_obj.id,add__isnull=True)
if obj:
obj.update(add=add)
in def parse
scarpy will get urls from google
the url output like:
www.test.com
www.hahaha.com
www.apple.com
www.rest.com
but when it yield to def parsedetail
the url is not with order maybe it will become :
www.rest.com
www.test.com
www.hahaha.com
www.apple.com
is there any way I can let the yield url with order to send to def parsedetail ?
Because I need to crawl www.test.com first.(the data the top url provide in google search is more correctly)
if there is no data in it.
I will go next url until update the empty field .(www.hahaha.com ,www.apple.com,www.rest.com)
Please guide me thank you!
By default, the order which the Scrapy requests are scheduled and sent is not defined. But, you can control it using priority keyword argument:
priority (int) – the priority of this request (defaults to 0). The
priority is used by the scheduler to define the order used to process
requests. Requests with a higher priority value will execute earlier.
Negative values are allowed in order to indicate relatively
low-priority.
You can also make the crawling synchronous by passing around the callstack inside the meta dictionary, as an example see this answer.

python scrapy parse() function, where is the return value returned to?

I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
’http://www.example.com/1.html’,
’http://www.example.com/2.html’,
’http://www.example.com/3.html’,
]
def parse(self, response):
for h3 in response.xpath(’//h3’).extract():
yield MyItem(title=h3)
for url in response.xpath(’//a/#href’).extract():
yield scrapy.Request(url, callback=self.parse)
I know, the parse method must return an item or/and request, but where are these return values returned to?
One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider, it has Rule with callback. What about this callback's return value? where to ? same as parse()?
I am very confused on Scrapy procedure, even i read the document....
According to the documentation:
The parse() method is in charge of processing the response and
returning scraped data (as Item objects) and more URLs to follow (as
Request objects).
In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler which pipes the requests to the Downloader for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback method).
The whole data-flow process is described in the Architecture Overview page in a very detailed manner.
Hope that helps.

Categories

Resources