Scrapy - Timing to get a Response - python

Just like in Scrapy request+response+download time, I would like to know how many time it takes to get a Response. The solution proposed doesn't meet my needs because of the following issue:
When a Downloader Middleware process_request method returns a Request object, the Request is rescheduled and isn't passed immediately to the remaining process_request methods. As a consequence, the solution proposed will include the time needed for the scheduler to return the Request to the Engine again.
What I want is only the time the Downloader takes to download a page (the time elapsed between the end of Downloader Middleware processings of the Request and first Downloader Middleware processing of the Response).
My idea is that one could either:
Disable the rescheduling of returned Request. But is it desirable and how can we do this?
Or Try to use the 'timer' used to trigger a TimeoutError. But I don't know how to access it.
Thanks in advance!

Isn't this exactly what download_latency request meta key contains? Or your requirement is different?

Related

waiting for completion of get request

I have to get several pages of a json API with about 130'000 entries.
The request is fairly simple with:
response = requests.request("GET", url, headers=headers, params=querystring)
Where the querystring is an access token and the headers fairly simple.
I created a while loop where basically every request url is in the form of
https://urlprovider.com/endpointname?pageSize=10000&rowStart=0
and the rowStart increments by pageSize until there is no more further pages.
The problem I encounter is the following response after about 5-8 successful requests:
{'errorCode': 'ERROR_XXX', 'code': 503, 'message': 'Maximum limit for unprocessed API requests have been reached. Please try again later.', 'success': False}
From the error message I get that I initiate the next request before the last has finished. Does anyone know how I can make sure the get request has finished before the next one starts (except something crude like a sleep()) or if the error could lie elsewhere?
I found the answer to my question.
Requests is synchronous, meaning that it will ALWAYS wait until the the call has finished before continuing
The response from the API provider is misleading, as the request has thus already been processed before the next one comes.
The root cause is difficult to assess, but it may be to do with a limit imposed by the API provider
What has worked:
A crude sleep(10), which makes the program wait 10 seconds before processing the next request
Better solution: Create a Session. According to the documentation:
The Session object [...] will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
Not only does this resolve the problem but also increases the performance compared to my initial code.

scrapy: spider quits without error messages before all requests yielded

If there are many requests in scheduler, would scheduler reject more requests to be added?
I met a very tricky question. I am trying to scrape a forum with all posts and comments. The problem is scrapy seems never finish it jobs and quits without error messages. I am wondering if I yielded too many requests so that scrapy stopped yielding new requests and just quit.
But I could not find documentation says that scrapy will quit if too many requests in schedular. Here is my code:
https://github.com/spacegoing/sentiment_mqd/blob/a46b59866e8f0a888b43aba6df0481a03136cf21/guba_spiders/guba_spiders/spiders/guba_spider.py#L217
The strange thing is that scrapy seems can only scrape 22 pages. If I start from page 1, it will stop at page 21. If I start from page 21, then it will stop at page 41.... There is no exception raised and scraped results are desired outputs.
1.
The code on GitHub you shared at a46b598 is probably not the exact version you have locally for the sample jobs. E.g. I haven't observed any line for the log lines like <timestamp> [guba] INFO: <url>.
But, well, I assumed there's no too significant difference.
2.
It's suggested to have the log level configured to DEBUG when you encounter any issue.
3.
If you've got the log level configured to DEBUG, you'd probably see something like this:
2018-10-26 15:25:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://guba.eastmoney.com/topic,600000_22.html>: max redirections reached
Some more lines: https://gist.github.com/starrify/b2483f0ed822a02d238cdf9d32dfa60e
That happens because you're passing the full response.meta dict to the following requests (related code), and Scrapy's RedirectMiddleware relies on some meta values (e.g. "redirect_times" and "redirect_ttl") to perform the check.
And the solution is simple: pass only the values you need into next_request.meta.
4.
It's also observed that you're trying to rotate the user agent strings, possibly for avoiding web crawl bans. But there's no other action taken. That would make your requests fishy still, because:
Scrapy's cookie management is enabled by default, which would use a same cookie jar for all your requests.
All your requests come from a same source IP address.
Thus I'm unsure whether it's good enough for you to scrape the whole site properly, especially when you're not throttling the requests.

Twisted Proxy with range splitting

I want to write a twisted proxy that splits up very large GET request into smaller fixed size ranges and sends it on to another proxy (using the Range: bytes). The other proxy doesn't allow large responses and when the response is to large it returns a 502.
How can I implement a proxy in twisted that on a 502 error it tries splitting the request into smaller allowed chunks. The documentation is hard to follow. I know I need to extend ProxyRequest, but from there I'm a bit stuck.
It doesn't have to be a twisted proxy, but it seems to be easily modified and I managed to at least get it to forward the request unmodified to the proxy by just setting the connectTCP to my proxy (in ProxyRequest.parsed).
Extending ProxyRequest is probably not the easiest way to do this, actually; ProxyRequest pretty strongly assumes that one request = one response, whereas here you want to split up a single request into multiple requests.
Easier would be to simply write a Resource implementation that does what you want, which briefly would be:
in render_GET, construct a URL to make several outgoing requests using Agent
return NOT_DONE_YET
as each response comes in, call request.write on your original incoming requests, and then issue a new request with a Range header
finally when the last response comes in, call request.finish on your original request
You can simply construct a Site object with your Resource, and set isLeaf on your Resource to true so your Resource doesn't have to implement any traversal logic and can just build the URL using request.prePathURL and request.postpath. (request.postpath is sadly undocumented; it's a list of the not-yet-traversed path segments in the request).

scrapy: switch out failing proxies

I'm using this script to randomize proxies in scrapy. The problem is that once it's allocated a proxy to a request, it won't allocate another one because of this code:
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
return
That means that if there is a bad proxy which is not connecting to anything, then the request will fail. I'm intending to modify it like this:
if request.meta.get('retry_times',0) < 5:
return
thereby letting it allocate a new proxy if the current one fails 5 times. I'm assuming that if I set RETRY_TIMES to, say 20, in settings.py, then the request won't fail until 4 different proxies have each made 5 attempts.
I'd like to know if that will cause any problems. As I understand it, the reason that the check is there in the first place is for stateful transactions, such as those relying on log-ins, or perhaps cookies. Is that correct?
I bumped with the same problem.
I improved the aivarsk/scrapy-proxies. My middleware inherited by basic RetryMiddleware and trying to use one proxy RETRY_TIMES. If proxy is unavailable, the middleware change it.
Yes, I think the idea of that script was to check if the user is already defining a proxy on the meta parameter, so it can control it from the spider.
Setting it to change proxy every 5 times is ok, but I think you'll have to re login to the page, as most pages know when you changed from where you are making the request (proxy).
The idea if rotating proxies is not as easy as just selecting one randomly, because you could still end up using the same proxy, and also defining the rules for when a site "banned" you is not as simple as only checking statuses. This are the services I know for that thing you want: Crawlera and Proxymesh.
If you want direct functionality on scrapy for rotating proxies, I recommend to use Crawlera as it is already fully integrated.

Dynamically change scrapy Request scheduler priority

I'm using scrapy to perform test on an internal web app.
Once all my tests are done, I use CrawlSpider to check everywhere, and I run for each response a HTML validator and I look for 404 media files.
It work very well except for this: the crawl at the end, GET things in a random order...
So, URL that perform DELETE operation are being executed before other operations.
I would like to schedule all delete at the end. I tried many way, with such kind of scheduler:
from scrapy import log
class DeleteDelayer(object):
def enqueue_request(self, spider, request):
if request.url.find('delete') != -1:
log.msg("delay %s" % request.url, log.DEBUG)
request.priority = 50
But it does not work... I see delete being "delay" in the log but they are executed during the execution.
I thought of using a middleware that can pile up in memory all the delete URL and when the spider_idle signal is called to put them back in, but I'm not sure on how to do this.
What is the best way to acheive this?
default priority for request is 0, so you set priority to 50 will not work
you can use a middleware to collect (insert the requests into your own queue, e.g, redis set) and ignore (return IngnoreRequest Exception) those 'delete' request
start a 2nd crawl with requests load from your queue in step 2

Categories

Resources