I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
future = asyncio.Future()
future.set_result(soup)
return future
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.
I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.
loop = asyncio.get_event_loop()
async def return_soup(url):
r = await loop.run_in_executor(None, requests.get, url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.
The reason as mentioned in other answers is the lack of library support for coroutines.
As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.
Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.
In your example the code would be:
def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
def parseURL_async(url):
print("Started to download {0}".format(url))
soup = return_soup(url)
print("Finished downloading {0}".format(url))
return soup
async def main():
result_url_1, result_url_2 = await asyncio.gather(
asyncio.to_thread(parseURL_async, url_1),
asyncio.to_thread(parseURL_async, url_2),
)
asyncio.run(main())
Related
I currently have a python multi-threading scraper that looks like following, how can I turn this into a websocket server so that it can push message once the scraper found some new data? Most websocket implementations that I found are asynchronous using asyncio and I don't want to rewrite all of my codes for that.
Are there any websocket implementations that I can use without rewriting my multi-threading codes into async version? Thanks!
from threading import Thread
Class Scraper:
def _send_request(self):
response = requests.get(url)
if response.status_code == 200 and 'good' in response.json():
send_ws_message()
def scrape(self):
while True:
Thread(target=self._send_request).start()
time.sleep(1)
scraper = Scraper()
scraper.scrape()
this code (snippet_1) is adapted from ThreadPoolExecutor Example in doc
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
print('%r page is %d bytes' % (url, len(data)))
print('after')
which works well, and gets
'http://www.foxnews.com/' page is 990869 bytes 'http://www.cnn.com/'
page is 990869 bytes 'http://www.bbc.co.uk/' page is 990869 bytes
'http://europe.wsj.com/' page is 990869 bytes after
this code is my own (snippet_2) to implement the same job with direct function call.
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
for url in URLS:
with urllib.request.urlopen(url, timeout=60) as conn:
print('%r page is %d bytes' % (url, len(data)))
print('after')
snippet_1 seems to be more common, but why?
When you are reading things from a network, your application will probably spend most of its time waiting on a reply.
Normally, the Global Interpreter Lock inside CPython (the Python implementation you are probably using) ensures that only one thread at a time is executing Python bytecode.
But when waiting for I/O (including network I/O) the GIL is released giving other threads opportunity to run. That means that multiple reads are effectively running in parallel instead of one after another, shortening overall execution time.
For a handful of URI's that won't make much of a difference. But the more URI's you use, the more noticable it gets.
So the ThreadPoolExecutor is mainly useful for running I/O operations in parallel. The ProcessPoolExecutor on the other hand is useful for running CPU intensive tasks in parallel. Since it uses multiple processes, the restriction of the GIL doesn't apply.
Is it possible to fire a request and not wait for response at all?
For python, most internet search results in
asynchronous-requests-with-python-requests
grequests
requests-futures
However, all the above solutions spawns a new thread and wait for response on each of the respective threads. Is it possible to not wait for any response at all, anywhere?
You can run your thread as a daemon, see the code below; If I comment out the line (t.daemon = True), the code will wait on the threads to finish before exiting. With daemon set to true, it will simply exit. You can try it with the example below.
import requests
import threading
import time
def get_thread():
g = requests.get("http://www.google.com")
time.sleep(2)
print(g.text[0:100])
if __name__ == '__main__':
t = threading.Thread(target=get_thread)
t.daemon = True # Try commenting this out, running it, and see the difference
t.start()
print("Done")
I don't really know what you are trying to achieve by just firing an http request. So I will list some use cases I can think of.
Ignoring the result
If the only thing you want is that your program feels like it never stops for making a request. You can use a library like aiohttp to make concurrent request without actually calling await for the responses.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
session.get('http://python.org')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
but how can you know that the request was made successfully if you don't check anything?
Ignoring the body
Maybe you want to be very performant, and you are worried about loosing time reading the body. In that case yo can just fire the request, check the status code and then close the connection.
def make_request(url = "yahoo.com", timeout= 50):
conn = http.client.HTTPConnection(url, timeout=timeout)
conn.request("GET", "/")
res = conn.getresponse()
print(res.status)
conn.close()
If you close the connection as I did previously you won't be able to reuse the connections.
The right way
I would recommend to await on asynchronous calls using aiohttp so you can add the necessary logic without having to block.
But if you are looking for performance, a custom solution with the http library is necessary. Maybe you could also consider very small request/responses, small timeouts and compression in your client and server.
I'm attempting to write some asynchronous GET requests with the aiohttp package, and have most of the pieces figured out, but am wondering what the standard approach is when handling the failures (returned as exceptions).
A general idea of my code so far (after some trial and error, I am following the approach here):
import asyncio
import aiofiles
import aiohttp
from pathlib import Path
with open('urls.txt', 'r') as f:
urls = [s.rstrip() for s in f.readlines()]
async def fetch(session, url):
async with session.get(url) as response:
if response.status != 200:
response.raise_for_status()
data = await response.text()
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
async def fetch_all(urls, loop):
async with aiohttp.ClientSession(loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls],
return_exceptions=True)
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(urls, loop))
Now this runs fine:
As expected the results variable is populated with None entries where the corresponding URL [i.e. at the same index in the urls array variable, i.e. at the same line number in the input file urls.txt] was successfully requested, and the corresponding file is written to disk.
This means I can use the results variable to determine which URLs were not successful (those entries in results not equal to None)
I have looked at a few different guides to using the various asynchronous Python packages (aiohttp, aiofiles, and asyncio) but I haven't seen the standard way to handle this final step.
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'?
...or should the retrying to send a GET request be initiated by some sort of callback upon failure
The errors look like this: (ClientConnectorError(111, "Connect call failed ('000.XXX.XXX.XXX', 443)") i.e. the request to IP address 000.XXX.XXX.XXX at port 443 failed, probably because there's some limit from the server which I should respect by waiting with a time out before retrying.
Is there some sort of limit I might consider putting on, to batch the number of requests rather than trying them all?
I am getting about 40-60 successful requests when attempting a few hundred (over 500) URLs in my list.
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs, but this isn't the case.
I haven't worked with asynchronous Python and sessions/loops before, so would appreciate any help to find how to get the results. Please let me know if I can give any more information to improve this question, thank you!
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'? ...or should the retrying to send a GET request be initiated by some sort of callback upon failure
You can do the former. You don't need any special callback, since you are executing inside a coroutine, so a simple while loop will suffice, and won't interfere with execution of other coroutines. For example:
async def fetch(session, url):
data = None
while data is None:
try:
async with session.get(url) as response:
response.raise_for_status()
data = await response.text()
except aiohttp.ClientError:
# sleep a little and try again
await asyncio.sleep(1)
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs
The term "complete" is meant in the technical sense of a coroutine completing (running its course), which is achieved either by the coroutine returning or raising an exception.
is this possible to open 50 different urls at the same time using selenium with python?
is this possible using threading?
If so, how would I go about doing this?
If not, then what would be a good method to do so?
You can try below to open 50 URLs one by one in new tab:
urls = ["http://first.com", "http://second.com", ...]
for url in urls:
driver.execute_script('window.open("%s")' % url)
you can use celery (Distributed Task Queue) to open all these urls.
or you can use async and await with aiohttp on python >= 3.5 , which runs a single thread on a single process but concurrently(utilises the wait time on urls for fetching other urls)
here is the code sample for the same. Loop takes care of scheduling these concurrent tasks.
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(hello("http://httpbin.org/headers"))
well, open 50 urls at the same time seems unreasonable and will require a LOT of processing, but, is possible. However i would recommend you to use a form of iteration opening one url at a time. 50 times.
list = ['list of urls here','2nd url'...]
driver = webdriver.Firefox()
for i in list:
moving = driver.get(i)
...#rest of your code
driver.quit()
but... you can make one driver.get('url') for each url you want... using different drivers. Or tabs. But it will require a lot of processing.