How to fire and forgot a HTTP request? - python

Is it possible to fire a request and not wait for response at all?
For python, most internet search results in
asynchronous-requests-with-python-requests
grequests
requests-futures
However, all the above solutions spawns a new thread and wait for response on each of the respective threads. Is it possible to not wait for any response at all, anywhere?

You can run your thread as a daemon, see the code below; If I comment out the line (t.daemon = True), the code will wait on the threads to finish before exiting. With daemon set to true, it will simply exit. You can try it with the example below.
import requests
import threading
import time
def get_thread():
g = requests.get("http://www.google.com")
time.sleep(2)
print(g.text[0:100])
if __name__ == '__main__':
t = threading.Thread(target=get_thread)
t.daemon = True # Try commenting this out, running it, and see the difference
t.start()
print("Done")

I don't really know what you are trying to achieve by just firing an http request. So I will list some use cases I can think of.
Ignoring the result
If the only thing you want is that your program feels like it never stops for making a request. You can use a library like aiohttp to make concurrent request without actually calling await for the responses.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
session.get('http://python.org')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
but how can you know that the request was made successfully if you don't check anything?
Ignoring the body
Maybe you want to be very performant, and you are worried about loosing time reading the body. In that case yo can just fire the request, check the status code and then close the connection.
def make_request(url = "yahoo.com", timeout= 50):
conn = http.client.HTTPConnection(url, timeout=timeout)
conn.request("GET", "/")
res = conn.getresponse()
print(res.status)
conn.close()
If you close the connection as I did previously you won't be able to reuse the connections.
The right way
I would recommend to await on asynchronous calls using aiohttp so you can add the necessary logic without having to block.
But if you are looking for performance, a custom solution with the http library is necessary. Maybe you could also consider very small request/responses, small timeouts and compression in your client and server.

Related

How to tell if requests.Session() is good?

I have the following session-dependent code which must be run continuously.
Code
import requests
http = requests.Session()
while True:
# if http is not good, then run http = requests.Session() again
response = http.get(....)
# process respons
# wait for 5 seconds
Note: I moved the line http = requests.Session() out of the loop.
Issue
How to check if the session is working?
An example for a not working session may be after the web server is restarted. Or loadbalancer redirects to a different web server.
The requests.Session object is just a persistence and connection-pooling object to allow shared state between different HTTP request on the client-side.
If the server unexpectedly closes a session, so that it becomes invalid, the server probably would respond with some error-indicating HTTP status code.
Thus requests would raise an error. See Errors and Exceptions:
All exceptions that Requests explicitly raises inherit from requests.exceptions.RequestException.
See the extended classes of RequestException.
Approach 1: implement open/close using try/except
Your code can catch such exceptions within a try/except-block.
It depends on the server's API interface specification how it will signal a invalidated/closed session. This signal response should be evaluated in the except block.
Here we use session_was_closed(exception) function to evaluate the exception/response and Session.close() to close the session correctly before opening a new one.
import requests
# initially open a session object
s = requests.Session()
# execute requests continuously
while True:
try:
response = s.get(....)
# process response
except requests.exceptions.RequestException as e:
if session_was_closed(e):
s.close() # close the session
s = requests.Session() # opens a new session
else:
# process non-session-related errors
# wait for 5 seconds
Depending on the server response of your case, implement the method session_was_closed(exception).
Approach 2: automatically open/close using with
From Advanced Usage, Session Objects:
Sessions can also be used as context managers:
with requests.Session() as s:
s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
This will make sure the session is closed as soon as the with block is exited, even if unhandled exceptions occurred.
I would flip the logic and add a try-except.
import requests
http = requests.Session()
while True:
try:
response = http.get(....)
except requests.ConnectionException:
http = requests.Session()
continue
# process respons
# wait for 5 seconds
See this answer for more info. I didn't test if the raised exception is that one, so please test it.

Fetch multiple URLs with asyncio/aiohttp and retry for failures

I'm attempting to write some asynchronous GET requests with the aiohttp package, and have most of the pieces figured out, but am wondering what the standard approach is when handling the failures (returned as exceptions).
A general idea of my code so far (after some trial and error, I am following the approach here):
import asyncio
import aiofiles
import aiohttp
from pathlib import Path
with open('urls.txt', 'r') as f:
urls = [s.rstrip() for s in f.readlines()]
async def fetch(session, url):
async with session.get(url) as response:
if response.status != 200:
response.raise_for_status()
data = await response.text()
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
async def fetch_all(urls, loop):
async with aiohttp.ClientSession(loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls],
return_exceptions=True)
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(urls, loop))
Now this runs fine:
As expected the results variable is populated with None entries where the corresponding URL [i.e. at the same index in the urls array variable, i.e. at the same line number in the input file urls.txt] was successfully requested, and the corresponding file is written to disk.
This means I can use the results variable to determine which URLs were not successful (those entries in results not equal to None)
I have looked at a few different guides to using the various asynchronous Python packages (aiohttp, aiofiles, and asyncio) but I haven't seen the standard way to handle this final step.
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'?
...or should the retrying to send a GET request be initiated by some sort of callback upon failure
The errors look like this: (ClientConnectorError(111, "Connect call failed ('000.XXX.XXX.XXX', 443)") i.e. the request to IP address 000.XXX.XXX.XXX at port 443 failed, probably because there's some limit from the server which I should respect by waiting with a time out before retrying.
Is there some sort of limit I might consider putting on, to batch the number of requests rather than trying them all?
I am getting about 40-60 successful requests when attempting a few hundred (over 500) URLs in my list.
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs, but this isn't the case.
I haven't worked with asynchronous Python and sessions/loops before, so would appreciate any help to find how to get the results. Please let me know if I can give any more information to improve this question, thank you!
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'? ...or should the retrying to send a GET request be initiated by some sort of callback upon failure
You can do the former. You don't need any special callback, since you are executing inside a coroutine, so a simple while loop will suffice, and won't interfere with execution of other coroutines. For example:
async def fetch(session, url):
data = None
while data is None:
try:
async with session.get(url) as response:
response.raise_for_status()
data = await response.text()
except aiohttp.ClientError:
# sleep a little and try again
await asyncio.sleep(1)
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs
The term "complete" is meant in the technical sense of a coroutine completing (running its course), which is achieved either by the coroutine returning or raising an exception.

The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
future = asyncio.Future()
future.set_result(soup)
return future
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.
I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.
loop = asyncio.get_event_loop()
async def return_soup(url):
r = await loop.run_in_executor(None, requests.get, url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.
The reason as mentioned in other answers is the lack of library support for coroutines.
As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.
Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.
In your example the code would be:
def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
def parseURL_async(url):
print("Started to download {0}".format(url))
soup = return_soup(url)
print("Finished downloading {0}".format(url))
return soup
async def main():
result_url_1, result_url_2 = await asyncio.gather(
asyncio.to_thread(parseURL_async, url_1),
asyncio.to_thread(parseURL_async, url_2),
)
asyncio.run(main())

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Filling the list with responses and processing them at the same time

I'm trying to download some products from a web page. This web page (according to robots.txt) allows me to send 2000req/minute. The problem is that sequential sending requests and then processing it is too much time-consuming.
I've realised that method which sends request can be moved into the pool which is much better according to time consume. It's probably because the processor don't need to wait to response and rather sends another request at the moment.
So I have a pool, the responses are being appended into the list RESPONSES.
Simple code:
from multiprocessing.pool import ThreadPool as Pool
import requests
RESPONSES = []
with open('products.txt') as f:
LINES = f.readlines()[:100]
def post_request(url):
html = requests.get(url).content
RESPONSES.append(html)
def parse_html_return_object(resp):
#some code here
pass
def insert_object_into_database():
pass
pool = Pool(100)
for line in LINES:
pool.apply_async(post_request,args=(line[:-1],))
pool.close()
pool.join()
The thing I want is to process those RESPONSES (HTMLS) so it would be popping responses from RESPONSE list and parsing it during the requesting.
So it could be like this (Time -->):
post_request(line1)->post_request(line2)->Response_line1->parse_html_return_object(response)->post_request...
Is there some simple way to do that?

Categories

Resources