this code (snippet_1) is adapted from ThreadPoolExecutor Example in doc
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
print('%r page is %d bytes' % (url, len(data)))
print('after')
which works well, and gets
'http://www.foxnews.com/' page is 990869 bytes 'http://www.cnn.com/'
page is 990869 bytes 'http://www.bbc.co.uk/' page is 990869 bytes
'http://europe.wsj.com/' page is 990869 bytes after
this code is my own (snippet_2) to implement the same job with direct function call.
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
for url in URLS:
with urllib.request.urlopen(url, timeout=60) as conn:
print('%r page is %d bytes' % (url, len(data)))
print('after')
snippet_1 seems to be more common, but why?
When you are reading things from a network, your application will probably spend most of its time waiting on a reply.
Normally, the Global Interpreter Lock inside CPython (the Python implementation you are probably using) ensures that only one thread at a time is executing Python bytecode.
But when waiting for I/O (including network I/O) the GIL is released giving other threads opportunity to run. That means that multiple reads are effectively running in parallel instead of one after another, shortening overall execution time.
For a handful of URI's that won't make much of a difference. But the more URI's you use, the more noticable it gets.
So the ThreadPoolExecutor is mainly useful for running I/O operations in parallel. The ProcessPoolExecutor on the other hand is useful for running CPU intensive tasks in parallel. Since it uses multiple processes, the restriction of the GIL doesn't apply.
Related
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
future = asyncio.Future()
future.set_result(soup)
return future
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.
I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.
loop = asyncio.get_event_loop()
async def return_soup(url):
r = await loop.run_in_executor(None, requests.get, url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.
The reason as mentioned in other answers is the lack of library support for coroutines.
As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.
Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.
In your example the code would be:
def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
def parseURL_async(url):
print("Started to download {0}".format(url))
soup = return_soup(url)
print("Finished downloading {0}".format(url))
return soup
async def main():
result_url_1, result_url_2 = await asyncio.gather(
asyncio.to_thread(parseURL_async, url_1),
asyncio.to_thread(parseURL_async, url_2),
)
asyncio.run(main())
Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()
I'm trying to download some products from a web page. This web page (according to robots.txt) allows me to send 2000req/minute. The problem is that sequential sending requests and then processing it is too much time-consuming.
I've realised that method which sends request can be moved into the pool which is much better according to time consume. It's probably because the processor don't need to wait to response and rather sends another request at the moment.
So I have a pool, the responses are being appended into the list RESPONSES.
Simple code:
from multiprocessing.pool import ThreadPool as Pool
import requests
RESPONSES = []
with open('products.txt') as f:
LINES = f.readlines()[:100]
def post_request(url):
html = requests.get(url).content
RESPONSES.append(html)
def parse_html_return_object(resp):
#some code here
pass
def insert_object_into_database():
pass
pool = Pool(100)
for line in LINES:
pool.apply_async(post_request,args=(line[:-1],))
pool.close()
pool.join()
The thing I want is to process those RESPONSES (HTMLS) so it would be popping responses from RESPONSE list and parsing it during the requesting.
So it could be like this (Time -->):
post_request(line1)->post_request(line2)->Response_line1->parse_html_return_object(response)->post_request...
Is there some simple way to do that?
Here I want to make some modificatins for my setting.
I want response from multiple API calls within a single request made to my server. from all these API calls I want to combine results and return them as a response. Until here pretty much everything follows as given in examples of gevent documentation and over here. Now the catch here is that I want to pass response in incremental way, so if first API call has returned the result I will return this result to frontend in one long waited request and then wait for other API calls and pass them in same request to frontend.
I have tried to do this through code but I dont know how to proceed with this setting. The gevent .joinall() and .join() block untill all the greenlets are finished getting responses.
Any way I can procceed with gevent in this setting ?
Code I am using here is given on link https://bitbucket.org/denis/gevent/src/tip/examples/concurrent_download.py . Here the .joinall() in the last statement waits until all urls have complete giving responses, I want it to be non blocking so that I can process the responses in the callback function print_head() and return them incrementally.
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import urllib2
def print_head(url):
print ('Starting %s' % url)
data = urllib2.urlopen(url).read()
print ('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.joinall(jobs)
If you want to collect results from multiple greenlets, then modify print_head() to return the result and then use .get() method to collect them all.
Put this after joinall():
total_result = [x.get() for x in jobs]
Actually, joinall() is not even necessary in this case.
If print_head() looks like this:
def print_head(url):
print ('Starting %s' % url)
return urllib2.urlopen(url).read()
Then total_result will be a list of size 3 containing the responses from all the requests.