I'm performing queries in a server with Open Source Routing Machine (OSRM) deployed. I send a set of coordinates and obtain a n x n matrix of network distances over a streets network.
In order to improve the speed of the computations, I want to use "ThreadPoolExecutor" to parallelize queries.
So far, I'm setting the connection in two ways, both giving me the same error:
def osrm_query(url_input):
'Send request'
response = requests.get(url_input)
r = response.json()
return r
def osrm_query_2(url_input):
'Send request'
s = requests.Session()
retries = Retry(total=3,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
response = s.get(url_input)
r = response.json()
return r
I generate the set of URLs (in the _urls list) that I want to send as requests and parallelize this way:
with ThreadPoolExecutor(max_workers=5) as executor:
for each in executor.map(osrm_query_2, _urls):
r.append(each)
So far, all works ok, but then, when parsing more than 40,000 URLs, I get this error back:
OSError: [WinError 10048] Only one usage of each socket address (protocol/networ
k address/port) is normally permitted
As far as I understand, the problem is that I send too many request from my machine, exhausting the number of ports available to send the request (it looks that this has nothing to do with the machine I'm sending the request to).
How can I fix this?
There is a way to tell the treadPoolExecutor to re-use the connections?
I was guided in the right direction by someone outside Stack Overflow.
The trick was to point the workers of the pool to a request Session. The function to send the queries was re-worked as follows:
def osrm_query(url_input, session):
'Send request'
response = session.get(url_input)
r = response.json()
return r
And the parallelization would be:
with ThreadPoolExecutor(max_workers=50) as executor:
with requests.Session() as s:
for each in executor.map(osrm_query, _urls, repeat(s)):
r.append(each)
This way I reduced the time of execution from 100 minutes (not parallelized) to 7 minutes with 50 workers as the max_workers argument, for 200,000 urls.
Related
What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.
I have a Django App that accepts messages from a remote device as a POST message.
This fits well with Django's framework! I used the generic View class (from django.views import View) and defined my own POST function.
But the remote device requires a special reply that I cannot generate in Django (yet). So, I use the Requests library to re-create the POST message and send it up to the manufacturer's cloud server.
That server processes the data, and responds with the special message in the body. Idealy, the entire HTML response message should go back to the remote device. If it does not get a valid reply, it will re-send the message. Which would be annoying!
I've been googling, but am having trouble getting a clear picture on how to either:
(a): Reply back in Django with the Requests.response object without any edits.
(b): Build a Django response and send it back.
Actually, I think I do know how to do (b), but its work. I would rather do (a) if its possible.
Thanks in Advance!
Rich.
Thanks for the comments and questions!
The perils of late night programming: you might over-think something, or miss the obvious. I was so focused on finding a way to return the request.response without any changes/edits I did not even sketch out what option (b) would be.
Well, it turns out its pretty simple:
s = Session()
# Populate POST to cloud with data from remote device request:
req = Request('POST', url, data=data, headers=headers)
prepped = req.prepare()
timeout = 10
retries = 3
while retries > 0:
try:
logger.debug("POST data to remote host")
resp = s.send(prepped, timeout=timeout)
break
except:
logger.debug("remote host connection failed, retry")
retries -= 1
logger.debug("retries left: %d", retries)
time.sleep(.3)
if retries == 0:
pass # There isn't anything I can do if this fails repeatedly...
# Build reply to remote device:
r = HttpResponse(resp.content,
content_type = resp.headers['Content-Type'],
status = resp.status_code,
reason = resp.reason,
)
r['Server'] = resp.headers['Server']
r['Connection'] = resp.headers['Connection']
logger.debug("Returning Server response to remote device")
return r
The Session "s" allows one to use "prepped" and "send", which allows one to monkey with the request object before its sent, and to re-try the send. I think at least some of it can be removed in a refactor; making this process even simpler.
There are 3 HTTP object at play here:
"req" is the POST I send up to the cloud server to get back a special (encrypted) reply.
"resp" is the reply back from the cloud server. The body (.content) contains the special reply.
"r" is the Django HTTP response I need to send back to the remote device that started this ball rolling by POSTing data to my view.
Its pretty simple to populate the response with the data, and set headers to the values returned by the cloud server.
I know this works because the remote device does not POST the same data twice! If there was a mistake anyplace in this process, it would re-send the same data over and over. I copied the While/try loop from a Socket repeater module. I don't know if that is really applicable to HTTP. I have been testing this on live hardware for over 48 hours and so far it has never failed. Timeouts are a question mark too, in that I know the remote device and cloud server have strict limits. So if there is an error in my "repeater", re-trying may not work if the process takes too long. It might be better to just discard/give up on the current POST. And wait for the remote device to re-try. Sorry, refactoring out loud...
I have several Flask servers which handle POST requests and returns some values. I need to run an infinite process which sends requests to all servers, waiting until first response, update internal state based on that response and send new request to that server again. Here is a pseudocode for this:
obj = SomeObject()
requests = [obj.make_request() for _ in range(10)]
responses = grequests.imap(requests, size=10)
for response in responses:
obj.update_state(response)
requests.append(obj.make_request())
What is the proper implementation of such logic in python?
Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()
I am trying to use Python to login to a website and gather information from several webpages and I get the following error:
Traceback (most recent call last):
File "extract_test.py", line 43, in <module>
response=br.open(v)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code
I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?
Here's my code:
import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open
urls_list=[first,second,third,fourth]
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()
for url in urls_list:
br.open(url)
print re.findall("Some String")
Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3
Writing this piece of code when requesting fixed my problem:
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.
As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:
1) Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.
2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.
3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.
Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:
class TooManyRequests(Exception):
"""Too many requests"""
#task(
rate_limit='10/s',
autoretry_for=(ConnectTimeout, TooManyRequests,),
retry_backoff=True)
def api(*args, **kwargs):
r = requests.get('placeholder-external-api')
if r.status_code == 429:
raise TooManyRequests()
if response.status_code == 429:
time.sleep(int(response.headers["Retry-After"]))
Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.
There is a brief blog post demonstrating a way to use tor along with urllib2:
http://blog.flip-edesign.com/?p=119
I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.
Check out this article
In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.