I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.
asyncio code looks like:
import requests
import time
from requests_html import AsyncHTMLSession
aio_session = AsyncHTMLSession()
urls = [...] # 100 urls
async def fetch(url):
try:
response = await aio_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
if status == 200:
return {
'url': url,
'status': status,
'html': response.html
}
return {
'url': url,
'status': status
}
def extract_html(urls):
tasks = []
for url in urls:
tasks.append(lambda url=url: fetch(url))
websites = aio_session.run(*tasks)
return websites
if __name__ == "__main__":
start_time = time.time()
websites = extract_html(urls)
print(time.time() - start_time)
Execution time (multiple tests):
13.466366291046143
14.279950618743896
12.980706453323364
BUT
If i run an example with threading:
from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time
num_fetch_threads = 50
enclosure_queue = Queue()
html_session = HTMLSession()
urls = [...] # 100 urls
def fetch(i, q):
while True:
url = q.get()
try:
response = html_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
q.task_done()
if __name__ == "__main__":
for i in range(num_fetch_threads):
worker = Thread(target=fetch, args=(i, enclosure_queue,))
worker.setDaemon(True)
worker.start()
start_time = time.time()
for url in urls:
enclosure_queue.put(url)
enclosure_queue.join()
print(time.time() - start_time)
Execution time (multiple tests):
7.476433515548706
6.786043643951416
6.717151403427124
The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?
Thanks in advance.
It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.
You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.
Related
I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).
I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))
Is there a possible way to speed up my code using multiprocessing interface? The problem is that this interface uses map function, which works only with 1 function. But my code has 3 functions. I tried to combine my functions into one, but didn't get success. My script reads the URL of site from file and performs 3 functions over it. For Loop makes it very slow, because I got a lot of URLs
import requests
def Login(url): #Log in
payload = {
'UserName_Text' : 'user',
'UserPW_Password' : 'pass',
'submit_ButtonOK' : 'return buttonClick;'
}
try:
p = session.post(url+'/login.jsp', data = payload, timeout=10)
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
print "site is DOWN! :", url[8:]
session.cookies.clear()
session.close()
else:
print 'OK: ', p.url
def Timer(url): #Measure request time
try:
timer = requests.get(url+'/login.jsp').elapsed.total_seconds()
except (requests.exceptions.ConnectionError):
print 'Request time: None'
print '-----------------------------------------------------------------'
else:
print 'Request time:', round(timer, 2), 'sec'
def Logout(url): # Log out
try:
logout = requests.get(url+'/logout.jsp', params={'submit_ButtonOK' : 'true'}, cookies = session.cookies)
except(requests.exceptions.ConnectionError):
pass
else:
print 'Logout '#, logout.url
print '-----------------------------------------------------------------'
session.cookies.clear()
session.close()
for line in open('text.txt').read().splitlines():
session = requests.session()
Login(line)
Timer(line)
Logout(line)
Yes, you can use multiprocessing.
from multiprocessing import Pool
def f(line):
session = requests.session()
Login(session, line)
Timer(session, line)
Logout(session, line)
if __name__ == '__main__':
urls = open('text.txt').read().splitlines()
p = Pool(5)
print(p.map(f, urls))
The requests session cannot be global and shared between workers, every worker should use its own session.
You write that you already "tried to combine my functions into one, but didn't get success". What exactly didn't work?
There are many ways to accomplish your task, but multiprocessing is not needed at that level, it will just add complexity, imho.
Take a look at gevent, greenlets and monkey patching, instead!
Once your code is ready, you can wrap a main function into a gevent loop, and if you applied the monkey patches, the gevent framework will run N jobs concurrently (you can create a jobs pool, set the limits of concurrency, etc.)
This example should help:
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
from __future__ import print_function
import sys
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def print_head(url):
print('Starting %s' % url)
data = urlopen(url).read()
print('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.wait(jobs)
You can find more here and in the Github repository, from where this example comes from
P.S.
Greenlets will works with requests as well, you don't need to change your code.
I want to make a lot of url requets to a REST webserivce. Typically between 75-90k. However, I need to throttle the number of concurrent connections to the webservice.
I started playing around with grequests in the following manner, but quickly started chewing up opened sockets.
concurrent_limit = 30
urllist = buildUrls()
hdrs = {'Host' : 'hostserver'}
g_requests = (grequests.get(url, headers=hdrs) for url in urls)
g_responses = grequests.map(g_requests, size=concurrent_limit)
As this runs for a minute or so, I get hit with 'maximum number of sockets reached' errors.
As far as I can tell, each one of the requests.get calls in grequests uses it's own session which means a new socket is opened for each request.
I found a note on github referring how to make grequests use a single session. But this seems to effectively bottleneck all requests into a single shared pool. That seems to defeat the purpose of asynchronous http requests.
s = requests.session()
rs = [grequests.get(url, session=s) for url in urls]
grequests.map(rs)
Is is possible to use grequests or gevent.Pool in a way that creates a number of sessions?
Put another way: How can I make many concurrent http requests using either through queuing or connection pooling?
I ended up not using grequests to solve my problem. I'm still hopeful it might be possible.
I used threading:
class MyAwesomeThread(Thread):
"""
Threading wrapper to handle counting and processing of tasks
"""
def __init__(self, session, q):
self.q = q
self.count = 0
self.session = session
self.response = None
Thread.__init__(self)
def run(self):
"""TASK RUN BY THREADING"""
while True:
url, host = self.q.get()
httpHeaders = {'Host' : host}
self.response = session.get(url, headers=httpHeaders)
# handle response here
self.count+= 1
self.q.task_done()
return
q=Queue()
threads = []
for i in range(CONCURRENT):
session = requests.session()
t=MyAwesomeThread(session,q)
t.daemon=True # allows us to send an interrupt
threads.append(t)
## build urls and add them to the Queue
for url in buildurls():
q.put_nowait((url,host))
## start the threads
for t in threads:
t.start()
rs is a AsyncRequest list。each AsyncRequest have it's own session.
rs = [grequests.get(url) for url in urls]
grequests.map(rs)
for ar in rs:
print(ar.session.cookies)
Something like this:
NUM_SESSIONS = 50
sessions = [requests.Session() for i in range(NUM_SESSIONS)]
reqs = []
i = 0
for url in urls:
reqs.append(grequests.get(url, session=sessions[i % NUM_SESSIONS]
i+=1
responses = grequests.map(reqs, size=NUM_SESSIONS*5)
That should spread the requests over 50 different sessions.
What is the best way to make an asynchronous call appear synchronous? Eg, something like this, but how do I coordinate the calling thread and the async reply thread? In java I might use a CountdownLatch() with a timeout, but I can't find a definite solution for Python
def getDataFromAsyncSource():
asyncService.subscribe(callback=functionToCallbackTo)
# wait for data
return dataReturned
def functionToCallbackTo(data):
dataReturned = data
There is a module you can use
import concurrent.futures
Check this post for sample code and module download link: Concurrent Tasks Execution in Python
You can put executor results in future, then get them, here is the sample code from http://pypi.python.org:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
A common solution would be the usage of a synchronized Queue and passing it to the callback function. See http://docs.python.org/library/queue.html.
So for your example this could look like (I'm just guessing the API to pass additional arguments to the callback function):
from Queue import Queue
def async_call():
q = Queue()
asyncService.subscribe(callback=callback, args=(q,))
data = q.get()
return data
def callback(data, q):
q.put(data)
This is a solution using the threading module internally so it might not work depending on your async library.