Python web scraping: difference between sleep and request(page, timeout=x) - python

When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,
sleep(10)
response = requests.get(url)
and,
response = requests.get(url, timeout=10)
That is, timeout is much faster.
Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.
Why is there such a difference in speed?
Why is the scraping duration per page less than 10 seconds?
I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.

time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.
time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.
They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.

response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.
Read more about requests timeout here.
time.sleep cause your main thread to sleep , so your program will always wait for 10 seconds always before making a request to the url.

Related

Log nginx "queue time"

I don't know if "queue time" is the right term for what i'm trying to log, maybe TTFB (time to first byte) is more correct.
I'm trying to explain better with a test I did:
I wrote a little python app (flask framework), with one function (one endpoint) that need about 5 seconds to complete the process (but same result with a sleep of 5 seconds).
I used uWSGI as application server, configured with 1 process and 1 thread, and nginx as reverse proxy.
With this configuration if i do two concurrent requests from the browser what i see is that the first finishes in about 5 seconds and the second finishes in about 10 second.
That's all right, with only one uWSGI process the second request must wait the first is completed, but what i want to log is the time the second request stay in "queue" waiting to be processed from uWSGI.
I tried all the nginx log variables I could find and could seem relevant to my need:
$request_time
request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client
$upstream_response_time
keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution.
$upstream_header_time
keeps time spent on receiving the response header from the upstream server (1.7.10); the time is kept in seconds with millisecond resolution.
but all of them report the same time, about 5 second, for both requests.
I also tried to add to log the variable $msec
time in seconds with a milliseconds resolution at the time of the log write
and a custom variable $my_start_time, initialized at the start of the server section with set $my_start_time "${msec}"; in this context msec is:
current time in seconds with the milliseconds resolution
but also in this case the difference between the two times is about 5 seconds for both requests.
I suppose nginx should know the time that i try to log or at least the total time of the request from which i can subtract the "request time" and get the waiting time.
If i analyze the requests with the chrome browser and check the waterfall i see, for the first request, a total time of about 5 seconds of which almost all in the row "Waiting (TTFB)" while for the second request i see a total time of about 10 seconds with about 5 in the row "Waiting (TTFB)" and about 5 in the row "Stalled".
The time i want log from server side is the "Stalled" time reported by chrome; from this question:
Understanding Chrome network log "Stalled" state
i understand that this time is related to proxy negotiation, so i suppose it is related with nginx that act as reverse proxy.
The test configuration is done with long process in order to measure these times more easily, but the time will be present, albeit shorter, whenever there are more concurrent requests then uWSGI processes.
Did I miss something in my elucubrations?
What is the correct name of this "queue time"?
How can i log it?
Thanks in advance for any suggestion

Want requests.get to wait for redirection during some time

I have experienced some problems concerning a fix period of time before the redirects to the other. I have already searched in StackOverflow for some way to make requests.get wait for let's say 7 seconds and get all history of redirects during this time, but nothing found.
The only thing I found that is near (even if far) to I want was timeout option of requests.
requests.get('http://github.com', timeout=0.001)
According to requests docs:
timeout is not a time limit on the entire response download
I think what I need is to give it this limit, to make it wait until 7 seconds, but couldn't even find where to change requests package.

Python requests time out on a linux machine but not on windows

So I wrote a python script that iterates over a list of URLs and records the time it takes to get a response. Some of these ULRs can take upwards of a minute to respond, which is expected (expensive API calls) the first time they are called, but are practically instantaneous the second time (redis cache).
When I run the script on my windows machine, it works as expected for all URLs.
On my Linux server it runs as expected until it hits an URL that takes upwards of about 30 seconds to respond. At that point the call to requests.get(url, timeout=600) does not return until the 10 minute timeout is reached and then comes back with a "Read Timeout". Calling the same URL again afterwards results in a fast, successfull request, because the response has now been cached in redis. (So the request must have finished on the server providing the API.)
I would be thankful for any ideas as to what might be causing this weird behavior.

How to wait for a POST request (requests.post) completion in Python?

I'm using the requests library in Python to do a POST call.
My POST call takes about 5 minutes to be completed. It will create a file in a S3 bucket.
After that, I want to download this file. However, I need to create an extra logic to wait for my POST to finish before executing the next line on my code to download the file.
Any suggestions?
Is it possible to use the subprocess library for this? If so, how would be the syntax?
Code:
import requets
r = requests.post(url)
# wait for the post call to finish
download_file(file_name)
It should already wait until it's finished.
Python, unlike Node.js, will block requests by default. You'd have to explicitly run it in another thread if you wanted to run it async. If it takes your POST request 5 minutes to fetch, then the download line won't run until the 5 minutes are up and the POST request is completed.
The question says the POST request takes 5 minutes to return, but maybe that's not quite right? Maybe the POST request returns promptly, but the server continues to grind 5 minutes creating the file for the S3 bucket? In that case, the need for a delay makes sense. The fact that a separate download is needed at all tends to support this interpretation (the requested info doesn't come back from the request itself).
If a failed download throws an exception, try this:
import time
r = requests.post(url)
while True:
time.sleep(60) # sixty second delay
try:
download_file(file_name)
break
except Error:
print ("File not ready, trying again in one minute")
Or if download_file simply returns False on failure:
import time
r = requests.post(url)
while True:
time.sleep(60) # sixty second delay
if download_file(file_name):
break
print ("File not ready, trying again in one minute")
Since my interpretation of the question is speculative, I'll delete this answer if it's not to the point.
Michael's answer is correct. However, in case you're running Selenium to crawl the webpage, the frontend JS takes some time to appropriately render and show the request result. In such scenarios I tend to use:
import time
time.sleep(5)
That said, in such cases you have explicit and implicit waits as other options, too. Take a look at: Slenium Waits documentation.
In case you're directly sending requests to the API, Python waits for the response until it's complete.

Limit speed of urlfetch per domain

Is there a way to limit the number of requests that urlfetch makes to any single server, per time unit?
I accidentally DoS'd a site I was crawling, since the async urlfetch api made it branch out until it died (each request spawns more than one new request on average). The logs contain ~200 DeadlineExceeded with a millisecond between each.
You could use time.sleep() method. Suspend execution of the current thread for the given number of seconds.
import time
[...]
for u in urls:
urllib2.urlopen(u, timeout=4)
time.sleep(1)
https://docs.python.org/2/library/time.html#time.sleep

Categories

Resources