Limit speed of urlfetch per domain - python

Is there a way to limit the number of requests that urlfetch makes to any single server, per time unit?
I accidentally DoS'd a site I was crawling, since the async urlfetch api made it branch out until it died (each request spawns more than one new request on average). The logs contain ~200 DeadlineExceeded with a millisecond between each.

You could use time.sleep() method. Suspend execution of the current thread for the given number of seconds.
import time
[...]
for u in urls:
urllib2.urlopen(u, timeout=4)
time.sleep(1)
https://docs.python.org/2/library/time.html#time.sleep

Related

BigQuery Python client - meaning of timeout parameter, and how to set query result timeout

This question is about the timeout parameter in the result method of QueryJob objects in the BigQuery Python client.
It looks like the meaning of timeout has changed in relation to version 1.24.0.
For example, the documentation for QueryJob's result in version 1.24.0 states that timeout is:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout is interpreted as the approximate total time of all requests.
As I understand it, this could be used as a way to limit the total time that the result method call will wait for the results.
For example, consider the following script:
import logging
from google.cloud import bigquery
# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)
PROJECT_ID = "project_id" # replace by actual project ID
client = bigquery.Client(project=PROJECT_ID)
QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
TIMEOUT = 30 # in seconds
query_job = client.query(QUERY) # API request - starts the query
assert query_job.state == 'RUNNING'
# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
assert query_job.state == 'DONE'
As I understand it, if all the API calls involved in fetching the results added up to more than 30 seconds, the call to result would give up. So, timeout here serves to limit the total execution time of the result method call.
However, later versions introduced a change. For example, the documentation for result in 1.27.2 states that timeout is:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout applies to each individual request.
If I'm understanding this correctly, the example above changes meaning completely, and the call to result could potentially take more than 30 seconds.
My doubts are:
What exactly is the difference of the script above if I run it with the new version of result versus the old version?
What are the currently recommended use cases for passing a timeout value to result?
What is the currently recommended way to time out after a given total time while waiting for query results?
Thank you.
As you can see in this fix:
A transport layer timeout is made independent of the query timeout,
i.e. the maximum time to wait for the query to complete.
The query timeout is used by the blocking poll so that the backend
does not block for too long when polling for job completion, but the
transport can have different timeout requirements, and we do not want
it to be raising sometimes unnecessary timeout errors.
Apply timeout to each of the underlying requests
As job methods do not split the timeout anymore between all requests a
method might make, the Client methods are adjusted in the same way.
So the basic difference is that in the previous version, if many requests were made in layer below they would share a 30 seconds timeout. In other words, if the first request takes 20 seconds, the second would timeout in 10 seconds.
In the new version every single request will have 30 seconds.
About the use case, basically it depends on your application. If you can not wait a long time for a request that might be lost you can decrease you timeout.

Log nginx "queue time"

I don't know if "queue time" is the right term for what i'm trying to log, maybe TTFB (time to first byte) is more correct.
I'm trying to explain better with a test I did:
I wrote a little python app (flask framework), with one function (one endpoint) that need about 5 seconds to complete the process (but same result with a sleep of 5 seconds).
I used uWSGI as application server, configured with 1 process and 1 thread, and nginx as reverse proxy.
With this configuration if i do two concurrent requests from the browser what i see is that the first finishes in about 5 seconds and the second finishes in about 10 second.
That's all right, with only one uWSGI process the second request must wait the first is completed, but what i want to log is the time the second request stay in "queue" waiting to be processed from uWSGI.
I tried all the nginx log variables I could find and could seem relevant to my need:
$request_time
request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client
$upstream_response_time
keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution.
$upstream_header_time
keeps time spent on receiving the response header from the upstream server (1.7.10); the time is kept in seconds with millisecond resolution.
but all of them report the same time, about 5 second, for both requests.
I also tried to add to log the variable $msec
time in seconds with a milliseconds resolution at the time of the log write
and a custom variable $my_start_time, initialized at the start of the server section with set $my_start_time "${msec}"; in this context msec is:
current time in seconds with the milliseconds resolution
but also in this case the difference between the two times is about 5 seconds for both requests.
I suppose nginx should know the time that i try to log or at least the total time of the request from which i can subtract the "request time" and get the waiting time.
If i analyze the requests with the chrome browser and check the waterfall i see, for the first request, a total time of about 5 seconds of which almost all in the row "Waiting (TTFB)" while for the second request i see a total time of about 10 seconds with about 5 in the row "Waiting (TTFB)" and about 5 in the row "Stalled".
The time i want log from server side is the "Stalled" time reported by chrome; from this question:
Understanding Chrome network log "Stalled" state
i understand that this time is related to proxy negotiation, so i suppose it is related with nginx that act as reverse proxy.
The test configuration is done with long process in order to measure these times more easily, but the time will be present, albeit shorter, whenever there are more concurrent requests then uWSGI processes.
Did I miss something in my elucubrations?
What is the correct name of this "queue time"?
How can i log it?
Thanks in advance for any suggestion

Python web scraping: difference between sleep and request(page, timeout=x)

When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,
sleep(10)
response = requests.get(url)
and,
response = requests.get(url, timeout=10)
That is, timeout is much faster.
Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.
Why is there such a difference in speed?
Why is the scraping duration per page less than 10 seconds?
I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.
time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.
time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.
They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.
response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.
Read more about requests timeout here.
time.sleep cause your main thread to sleep , so your program will always wait for 10 seconds always before making a request to the url.

Why do long HTTP round trip-times stall my Tornado AsyncHttpClient?

I'm using Tornado to send requests in rapid, periodic succession (every 0.1s or even 0.01s) to a server. For this, I'm using AsyncHttpClient.fetch with a callback to handle the response.
Here's a very simple code to show what I mean:
from functools import partial
from tornado import gen, locks, httpclient
from datetime import timedelta, datetime
# usually many of these running on the same thread, maybe requesting the same server
#gen.coroutine
def send_request(url, interval):
wakeup_condition = locks.Condition()
#using this to allow requests to send immediately
http_client = httpclient.AsyncHTTPClient(max_clients=1000)
for i in range(300):
req_time = datetime.now()
current_callback = partial(handle_response, req_time)
http_client.fetch(url, current_callback, method='GET')
yield wakeup_condition.wait(timeout=timedelta(seconds=interval))
def handle_response(req_time, response):
resp_time = datetime.now()
write_to_log(req_time, resp_time, resp_time - req_time) #opens the log and writes to it
When I was testing it against a local server, it was working fine, the requests were being sent on time, the round trip time was obviously minimal.
However, when I test it against a remote server, with larger round trip times (especially for higher request loads), the request timing gets messed up by multiple seconds: The period of wait between each request becomes much larger than the desired period.
How come? I thought the async code wouldn't be affected by the roundtrip time since it isn't blocking while waiting for the response. Is there any known solution to this?
After some tinkering and tcpdumping, I've concluded that two things were really slowing down my coroutine. With these two corrected stalling has gone down enormously drastically and the timeout in yield wakeup_condition.wait(timeout=timedelta(seconds=interval)) is much better respected:
The computer I'm running on doesn't seem to be caching DNS, which for AsyncHTTPClient seem to be a blocking network call. As such every coroutine sending requests has the added time to wait for the DNS to resolve. Tornado docs say:
tornado.httpclient in the default configuration blocks on DNS
resolution but not on other network access (to mitigate this use
ThreadedResolver or a tornado.curl_httpclient with a
properly-configured build of libcurl).
...and in the AsynHTTPClient docs
To select curl_httpclient, call AsyncHTTPClient.configure at startup:
AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
I ended up implementing my own thread which resolves and caches DNS, however, and that resolved the issue by issuing the request directly to the IP address.
The URL I was using was HTTPS, changing to a HTTP url improved performance. For my use case that's not always possible, but it's good to be able to localize part of the issue

app engine python urlfetch timing out

I have two instances of app engine applications running that I want to communicate with a Restful interface. Once the data of one is updated, it calls a web hook on the second which will retrieve a fresh copy of the data for it's own system.
Inside 'site1' i have:
from google.appengine.api import urlfetch
url = www.site2.com/data_updated
result = urlfetch.fetch(url)
Inside the handler for data_updated on 'site2' I have:
url = www.site1.com/get_new_data
result = urlfetch.fetch(url)
There is very little data being passed between the two sites but I receive the following error. I've tried increasing the deadline to 10 seconds but this still doesn't work.
DeadlineExceededError: ApplicationError: 5
Can anyone provide any insight into what might be happening?
Thanks - Richard
App Engine's urlfetch doesn't always behave as it is expected, you have about 10 seconds to fetch the URL. Assuming the URL you're trying to fetch is up and running, you should be able to catch the DeadlineExceededError by calling from google.appengine.runtime import apiproxy_errors and then wrapping the urlfetch call within a try/except block using except apiproxy_errors.DeadlineExceededError:.
Relevant answer here.
Changing the method
from
result = urlfetch.fetch(url)
to
result = urlfetch(url,deadline=2,method=urlfetch.POST)
has fixed the Deadline errors.
From the urlfetch documentation:
deadline
The maximum amount of time to wait for a response from the
remote host, as a number of seconds. If the remote host does not
respond in this amount of time, a DownloadError is raised.
Time spent waiting for a request does not count toward the CPU quota
for the request. It does count toward the request timer. If the app
request timer expires before the URL Fetch call returns, the call is
canceled.
The deadline can be up to a maximum of 60 seconds for request handlers
and 10 minutes for tasks queue and cron job handlers. If deadline is
None, the deadline is set to 5 seconds.
Have you tried manually querying the URLs (www.site2.com/data_updated and www.site1.com/get_new_data) with curl or otherwise to make sure that they're responding within the time limit? Even if the amount of data that needs to be transferred is small, maybe there's a problem with the handler that's causing a delay in returning the results.
The amount of data being transferred is not the problem here, the latency is.
If the app you are talking to is often taking > 10 secs to respond, you will have to use a "proxy callback" server on another cloud platform (EC2, etc.) If you can hold off for a while the new backend instances are supposed to relax the urlfetch time limits somewhat.
If the average response time is < 10 secs, and only a relatively few are failing, just retry a few times. I hope for your sake the calls are idempotent (i.e. so that a retry doesn't have adverse effects). If not, you might be able to roll your own layer on top - it's a bit painful but it works ok, it's what we do.
J
The GAE doc now states the deadline can be 60 sec:
result = urlfetch(url,deadline=60,method=urlfetch.POST)

Categories

Resources