I want to make a lot of url requets to a REST webserivce. Typically between 75-90k. However, I need to throttle the number of concurrent connections to the webservice.
I started playing around with grequests in the following manner, but quickly started chewing up opened sockets.
concurrent_limit = 30
urllist = buildUrls()
hdrs = {'Host' : 'hostserver'}
g_requests = (grequests.get(url, headers=hdrs) for url in urls)
g_responses = grequests.map(g_requests, size=concurrent_limit)
As this runs for a minute or so, I get hit with 'maximum number of sockets reached' errors.
As far as I can tell, each one of the requests.get calls in grequests uses it's own session which means a new socket is opened for each request.
I found a note on github referring how to make grequests use a single session. But this seems to effectively bottleneck all requests into a single shared pool. That seems to defeat the purpose of asynchronous http requests.
s = requests.session()
rs = [grequests.get(url, session=s) for url in urls]
grequests.map(rs)
Is is possible to use grequests or gevent.Pool in a way that creates a number of sessions?
Put another way: How can I make many concurrent http requests using either through queuing or connection pooling?
I ended up not using grequests to solve my problem. I'm still hopeful it might be possible.
I used threading:
class MyAwesomeThread(Thread):
"""
Threading wrapper to handle counting and processing of tasks
"""
def __init__(self, session, q):
self.q = q
self.count = 0
self.session = session
self.response = None
Thread.__init__(self)
def run(self):
"""TASK RUN BY THREADING"""
while True:
url, host = self.q.get()
httpHeaders = {'Host' : host}
self.response = session.get(url, headers=httpHeaders)
# handle response here
self.count+= 1
self.q.task_done()
return
q=Queue()
threads = []
for i in range(CONCURRENT):
session = requests.session()
t=MyAwesomeThread(session,q)
t.daemon=True # allows us to send an interrupt
threads.append(t)
## build urls and add them to the Queue
for url in buildurls():
q.put_nowait((url,host))
## start the threads
for t in threads:
t.start()
rs is a AsyncRequest list。each AsyncRequest have it's own session.
rs = [grequests.get(url) for url in urls]
grequests.map(rs)
for ar in rs:
print(ar.session.cookies)
Something like this:
NUM_SESSIONS = 50
sessions = [requests.Session() for i in range(NUM_SESSIONS)]
reqs = []
i = 0
for url in urls:
reqs.append(grequests.get(url, session=sessions[i % NUM_SESSIONS]
i+=1
responses = grequests.map(reqs, size=NUM_SESSIONS*5)
That should spread the requests over 50 different sessions.
Related
I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.
asyncio code looks like:
import requests
import time
from requests_html import AsyncHTMLSession
aio_session = AsyncHTMLSession()
urls = [...] # 100 urls
async def fetch(url):
try:
response = await aio_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
if status == 200:
return {
'url': url,
'status': status,
'html': response.html
}
return {
'url': url,
'status': status
}
def extract_html(urls):
tasks = []
for url in urls:
tasks.append(lambda url=url: fetch(url))
websites = aio_session.run(*tasks)
return websites
if __name__ == "__main__":
start_time = time.time()
websites = extract_html(urls)
print(time.time() - start_time)
Execution time (multiple tests):
13.466366291046143
14.279950618743896
12.980706453323364
BUT
If i run an example with threading:
from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time
num_fetch_threads = 50
enclosure_queue = Queue()
html_session = HTMLSession()
urls = [...] # 100 urls
def fetch(i, q):
while True:
url = q.get()
try:
response = html_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
q.task_done()
if __name__ == "__main__":
for i in range(num_fetch_threads):
worker = Thread(target=fetch, args=(i, enclosure_queue,))
worker.setDaemon(True)
worker.start()
start_time = time.time()
for url in urls:
enclosure_queue.put(url)
enclosure_queue.join()
print(time.time() - start_time)
Execution time (multiple tests):
7.476433515548706
6.786043643951416
6.717151403427124
The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?
Thanks in advance.
It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.
You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.
This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 6 years ago.
For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from the server.
Is there anyway to speed it up and keep my current header setting? All tutorials I found where without a header.
Here is my code snipped:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...
Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'
You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.
import asyncio
import functools
async def parse(url):
print('in parse for url {}'.format(url))
info = await #write the logic for fetching the info, it waits for the responses from the urls
print('done with url {}'.format(url))
return 'parse {} result from {}'.format(info, url)
async def main(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()
i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent
I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))
I am using python 2.7 on Windows machine. I have an array of urls accompanied by data and headers, so POST method is required.
In simple execution it works well:
rescodeinvalid =[]
success = []
for i in range(0,len(HostArray)):
data = urllib.urlencode(post_data)
req = urllib2.Request(HostArray[i], data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(HostArray[i])
if responsecode == 200:
success.append(HostArray[i])
My question is if HostArray length is very large, then it is taking much time in loop.
So, how to check each url of HostArray in a multithread. If response code of each url is 200, then I am doing different operation. I have arrays to store 200 and 400 responses.
So, how to do this in multithread in python
If you want to do each one in a separate thread you could do something like:
rescodeinvalid =[]
success = []
def post_and_handle(url,post_data)
data = urllib.urlencode(post_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(url) # Append is thread safe
elif responsecode == 200:
success.append(url) # Append is thread safe
workers = []
for i in range(0,len(HostArray)):
t = threading.Thread(target=post_and_handle,args=(HostArray[i],post_data))
t.start()
workers.append(t)
# Wait for all of the requests to complete
for t in workers:
t.join()
I'd also suggest using requests: http://docs.python-requests.org/en/latest/
as well as a thread pool:
Threading pool similar to the multiprocessing Pool?
Thread pool usage:
from multiprocessing.pool import ThreadPool
# Done here because this must be done in the main thread
pool = ThreadPool(processes=50) # use a max of 50 threads
# do this instead of Thread(target=func,args=args,kwargs=kwargs))
pool.apply_async(func,args,kwargs)
pool.close() # I think
pool.join()
scrapy uses twisted library to call multiple urls in parallel without the overhead of opening a new thread per request, it also manage internal queue to accumulate and even prioritize them as a bonus you can also restrict number of parallel requests by settings maximum concurrent requests, you can either launch a scrapy spider as an external process or from your code, just set spider start_urls = HostArray
Your case (basically processing a list into another list) looks like an ideal candidate for concurrent.futures (see for example this answer) or you may go all the way to Executor.map. And of course use ThreadPoolExecutor to limit the number of concurrently running threads to something reasonable.
I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.