scrapy how to repeat a duplicate request - python

when I send a request to scrape API sometimes it doesn't load properly and it returns me -1 instead of the price.
So I put a while loop to make it repeat the request as long as I get -1 but the spider stops after the first request because of duplicate request.
so my question is, how can I change it to process duplicate requests?
example code:
is_checked = False
while(not is_checked):
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json")
jsonresponse = loads(response.body)
sellPrice = jsonresponse['data']['Prices'][0]['Price']
if sellPrice!=-1:
is_checked = True
yield {'SellPrice': sellPrice}
bare in mind I use inline requests library but it is not relevant to the solution.

To force scheduling duplicate request, set dont_filter=True in Request's constructor. In your example above, change
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json")
to
response = yield scrapy.Request("https://api.bookscouter.com/v3/prices/sell/"+isbn+".json", dont_filter=True)

Related

Scrapy - Filtered duplicate request

I'm working with scrapy. I want to loop through a db table and grab the starting page for each scrape (random_form_page), then yield a request for each start page. Please note that I am hitting an api to get a proxy with the initial request. I want to set up each request to have its own proxy, so using the callback model I have:
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR
)
yield PR
I notice:
[scrapy] DEBUG: Filtered duplicate request: <GET 'htp://my-api'> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
In my code I can see that although it loops through 8 times it only yields a request for the first page. The others I assume are being filtered out. I've looked at http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class but still unsure how to turn off this filtering action. How can I turn off the filtering?
use
dont_filter = True in Request object
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR,
dont_filter = True
)
yield PR
As/if you are accessing an API you most probably want to disable the duplicate filter altogether:
# settings.py
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
This way you don't have to clutter all your Request creation code with dont_filter=True.
One word of caution, though (thanks to Brick Yang's comment): if your spider is crawling a website by discovering, extracting and following links you should not do this, as your spider is likely picking up the same links multiple times and would be recrawling them over and over again - resulting in an endless crawling loop.

How to detect HTTP response status code and set a proxy accordingly in scrapy?

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code?
For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:
if http_status_code in [403, 503, ..., n]:
proxy_ip = 'new ip'
# Then keep using it till it's gets another error code
Any ideas?
Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.
There are few ways of doing this.
First one, straight-forward just use it in your spider code:
def parse(self, response):
if response.status in range(400, 600):
return Request(response.url,
meta={'proxy': 'http://myproxy:8010'}
dont_filter=True) # you need to ignore filtering because you already did one request to this url
Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:
from project.settings import PROXY_URL
class MyDM(object):
def process_response(self, request, response, spider):
if response.status in range(400, 600):
logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
return Request(response.url,
meta={'proxy': PROXY_URL}
dont_filter=True)
return response
Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:
Specify that in Request Meta:
Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all
Request(url, meta={'handle_httpstatus_all': True})
Set a project/spider wide parameters:
HTTPERROR_ALLOW_ALL = True # for all
HTTPERROR_ALLOWED_CODES = [404, 505] # for specific
as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

How to fetch the Response object of a Request synchronously on Scrapy?

I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.
Anyway, I'm trying to make a method to check if a file exists and here is my code:
def file_exists(self, url):
res = False;
response = Request(url, method='HEAD', dont_filter=True)
if response.status == 200:
res = True
return res
I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.
Is there a way to get the Response object within the code block where you call the Response() method?
If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
import requests
request_object = requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.
Request objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.
Alternatively you can always block asynchronous stack by using something like requests like:
def parse(self, response):
image_url = response.xpath('//img/#href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.
Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.

Asynchronous JSON Requests in Python

I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?
I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.

Scrapy CrawlSpider retry scrape

For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.

Categories

Resources