I am downloading some information from webpages in the form
http://example.com?p=10
http://example.com?p=20
...
The point is that I don't know how many they are. At some point I will receive an error from the server, or maybe at some point I want to stop the processing since I have enough. I want to run them in parallel.
def generator_query(step=10):
i = 0
yield "http://example.com?p=%d" % i
i += step
def task(url):
t = request.get(url).text
if not t: # after the last one
return None
return t
I can implement it with consumer/producer pattern with queues, but I am wondering it is possible to have an higher level implementation, for example with the concurrent module.
Non-concurrent example:
results = []
for url in generator_query():
results.append(task(url))
You could use concurrent's ThreadPoolExecutor. An example of how to use it is provided here.
You'll need to break out of the example's for-loop, when you're getting invalid answers from the server (the except section) or whenever you feel like you got enough data (you could count valid responses in the else section for example).
You could use aiohttp for this purpose:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def coro(step):
url = 'https://example.com?p={}'.format(step)
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
print(html)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
tasks = [coro(i*10) for i in range(10)]
loop.run_until_complete(asyncio.wait(tasks))
as for the page error, you might have to figure it yourself since I don't know what website you're dealing with. Maybe try...except?
Notice: if your python version is higher than 3.5, it might cause an ssl certificate verification error.
Related
I have a scraper that requires the use of a websocket server (can't go into too much detail on why because of company policy) that I'm trying to turn into a template/module for easier use on other websites.
I have one main function that runs the loop of the server (e.g. ping-pongs to keep the connection alive and send work and stop commands when necessary) that I'm trying to turn into a generator that yields the HTML of scraped pages (asynchronously, of course). However, I can't figure out a way to turn the server into a generator.
This is essentially the code I would want (simplified to just show the main idea, of course):
import asyncio, websockets
needsToStart = False # Setting this to true gets handled somewhere else in the script
async def run(ws):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
yield data # Yielding the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
generator = websockets.serve(run, 'localhost', 9999)
while True:
html = await anext(generator)
# Do whatever with html
This, of course, doesn't work, giving the error "TypeError: 'Serve' object is not callable". But is there any way to set up something along these lines? An alternative I could try is creating an 'intermittent' object that holds the data which the end loop awaits, but that seems messier to me than figuring out a way to get this idea to work.
Thanks in advance.
I found a solution that essentially works backwards, for those in need of the same functionality: instead of yielding the data, I pass along the function that processes said data. Here's the updated example case:
import asyncio, websockets
from functools import partial
needsToStart = False # Setting this to true gets handled somewhere else in the script
def process(html):
pass
async def run(ws, htmlFunc):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
htmlFunc(data) # Processing the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
func = partial(run, htmlFunc=process)
websockets.serve(func, 'localhost', 9999)
So I've been experimenting with web scraping with aiohttp, and I ran into this issue where whenever I use a proxy, the code within the session.get doesn't run. I've looked all over the internet and couldn't find a solution.
import asyncio
import time
import aiohttp
from aiohttp.client import ClientSession
import random
failed = 0
success = 0
proxypool = []
with open("proxies.txt", "r") as jsonFile:
lines = jsonFile.readlines()
for i in lines:
x = i.split(":")
proxypool.append("http://"+x[2]+":"+x[3].rstrip()+"#"+x[0]+":"+x[1])
async def download_link(url:str,session:ClientSession):
global failed
global success
proxy = proxypool[random.randint(0, len(proxypool))]
print(proxy)
async with session.get(url, proxy=proxy) as response:
if response.status != 200:
failed +=1
else:
success +=1
result = await response.text()
print(result)
async def download_all(urls:list):
my_conn = aiohttp.TCPConnector(limit=1000)
async with aiohttp.ClientSession(connector=my_conn,trust_env=True) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(download_link(url=url,session=session))
tasks.append(task)
await asyncio.gather(*tasks,return_exceptions=True) # the await must be nest inside of the session
url_list = ["https://www.google.com"]*100
start = time.time()
asyncio.run(download_all(url_list))
end = time.time()
print(f'download {len(url_list)-failed} links in {end - start} seconds')
print(failed, success)
Here is the problem though, the code works fine on my mac. However, when I try to run the exact same code on windows, it doesn't run. It also works fine without proxies, but as soon as I add them, it doesn't work.
At the end, you can see that I print failed and succeeded. On my mac it will output 0, 100, whereas on my windows computer, it will print 0,0 - This proves that that code isn't running (Also, nothing is printed)
The proxies I am using are paid proxies, and they work normally if I use requests.get(). Their format is "http://user:pass#ip:port"
I have also tried just using "http://ip:port" then using BasicAuth to carry the user and password, but this does not work either.
I've seen that many other people have had this problem, however the issue never seems to get solved.
Any help would be appreciated :)
So after some more testing and researching I found the issue, I needed to add ssl = False
So the correct way to make the request would be:
async with session.get(url, proxy=proxy, ssl = False) as response:
That worked for me.
I'm trying to use Python in an async manner in order to speed up my requests to a server. The server has a slow response time (often several seconds, but also sometimes faster than a second), but works well in parallel. I have no access to this server and can't change anything about it. So, I have a big list of URLs (in the code below, pages) which I know beforehand, and want to speed up their loading by making NO_TASKS=5 requests at a time. On the other hand, I don't want to overload the server, so I want a minimum pause between every request of 1 second (i. e. a limit of 1 request per second).
So far I have successfully implemented the semaphore part (five requests at a time) using a Trio queue.
import asks
import time
import trio
NO_TASKS = 5
asks.init('trio')
asks_session = asks.Session()
queue = trio.Queue(NO_TASKS)
next_request_at = 0
results = []
pages = [
'https://www.yahoo.com/',
'http://www.cnn.com',
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.perl.org',
'http://www.cisco.com',
'http://www.facebook.com',
'http://www.twitter.com',
'http://www.macrumors.com/',
'http://arstechnica.com/',
'http://www.reuters.com/',
'http://abcnews.go.com/',
'http://www.cnbc.com/',
]
async def async_load_page(url):
global next_request_at
sleep = next_request_at
next_request_at = max(trio.current_time() + 1, next_request_at)
await trio.sleep_until(sleep)
next_request_at = max(trio.current_time() + 1, next_request_at)
print('start loading page {} at {} seconds'.format(url, trio.current_time()))
req = await asks_session.get(url)
results.append(req.text)
async def producer(url):
await queue.put(url)
async def consumer():
while True:
if queue.empty():
print('queue empty')
return
url = await queue.get()
await async_load_page(url)
async def main():
async with trio.open_nursery() as nursery:
for page in pages:
nursery.start_soon(producer, page)
await trio.sleep(0.2)
for _ in range(NO_TASKS):
nursery.start_soon(consumer)
start = time.time()
trio.run(main)
However, I'm missing the implementation of the limiting part, i. e. the implementation of max. 1 request per second. You can see above my attempt to do so (first five lines of async_load_page), but as you can see when you execute the code, this is not working:
start loading page http://www.reuters.com/ at 58097.12261669573 seconds
start loading page http://www.python.org at 58098.12367392373 seconds
start loading page http://www.pypy.org at 58098.12380622773 seconds
start loading page http://www.macrumors.com/ at 58098.12389389973 seconds
start loading page http://www.cisco.com at 58098.12397854373 seconds
start loading page http://arstechnica.com/ at 58098.12405119873 seconds
start loading page http://www.facebook.com at 58099.12458010273 seconds
start loading page http://www.twitter.com at 58099.37738939873 seconds
start loading page http://www.perl.org at 58100.37830828273 seconds
start loading page http://www.cnbc.com/ at 58100.91712723473 seconds
start loading page http://abcnews.go.com/ at 58101.91770178373 seconds
start loading page http://www.jython.org at 58102.91875295573 seconds
start loading page https://www.yahoo.com/ at 58103.91993155273 seconds
start loading page http://www.cnn.com at 58104.48031027673 seconds
queue empty
queue empty
queue empty
queue empty
queue empty
I've spent some time searching for answers but couldn't find any.
One of the ways to achieve your goal would be using a mutex acquired by a worker before sending a request and released in a separate task after some interval:
async def fetch_urls(urls: Iterator, responses, n_workers, throttle):
# Using binary `trio.Semaphore` to be able
# to release it from a separate task.
mutex = trio.Semaphore(1)
async def tick():
await trio.sleep(throttle)
mutex.release()
async def worker():
for url in urls:
await mutex.acquire()
nursery.start_soon(tick)
response = await asks.get(url)
responses.append(response)
async with trio.open_nursery() as nursery:
for _ in range(n_workers):
nursery.start_soon(worker)
If a worker gets response sooner than after throttle seconds, it will block on await mutex.acquire(). Otherwise the mutex will be released by the tick and another worker will be able to acquire it.
This is similar to how leaky bucket algorithm works:
Workers waiting for the mutex are like water in a bucket.
Each tick is like a bucket leaking at a constant rate.
If you add a bit of logging just before sending a request you should get an output similar to this:
0.00169 started
0.001821 n_workers: 5
0.001833 throttle: 1
0.002152 fetching https://httpbin.org/delay/4
1.012 fetching https://httpbin.org/delay/2
2.014 fetching https://httpbin.org/delay/2
3.017 fetching https://httpbin.org/delay/3
4.02 fetching https://httpbin.org/delay/0
5.022 fetching https://httpbin.org/delay/2
6.024 fetching https://httpbin.org/delay/2
7.026 fetching https://httpbin.org/delay/3
8.029 fetching https://httpbin.org/delay/0
9.031 fetching https://httpbin.org/delay/0
10.61 finished
Using trio.current_time() for this is much too complicated IMHO.
The easiest way to do rate limiting is a rate limiter, i.e. a separate task that basically does this:
async def ratelimit(queue,tick, task_status=trio.TASK_STATUS_IGNORED):
with trio.open_cancel_scope() as scope:
task_status.started(scope)
while True:
await queue.put()
await trio.sleep(tick)
Example use:
async with trio.open_nursery() as nursery:
q = trio.Queue(0) # can use >0 for burst modes
limiter = await nursery.start(ratelimit, q, 1)
while whatever:
await q.get(None) # will return at most once per second
do_whatever()
limiter.cancel()
in other words, you start that task with
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
and then you can be sure that at most one call of
await q.put(None)
per second will return, as the zero-length queue acts as a rendezvous point. When you're done, call
limiter.cancel()
to stop the rate limiting task, otherwise your nursery won't exit.
If your use case includes starting sub-tasks which you need to finish before the limiter gets cancelled, the easiest way to do that is to rin them in another nursery, i.e. instead of
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
you'd use something like
async with trio.open_nursery() as inner_nursery:
await start_tasks(inner_nursery, q)
limiter.cancel()
which would wait for the tasks to finish before touching the limiter.
NB: You can easily adapt this for "burst" mode, i.e. allow a certain number of requests before the rate limiting kicks in, by simply increasing the queue's length.
Motivation and origin of this solution
Some months have passed since I asked this question. Python has improved since then, so has trio (and my knowledge of them). So I thought it was time for a little update using Python 3.6 with type annotations and trio-0.10 memory channels.
I developed my own improvement of the original version, but after reading #Roman Novatorov's great solution, adapted it again and this is the result. Kudos to him for the main structure of the function (and the idea to use httpbin.org for illustration purposes). I chose to use memory channels instead of a mutex to be able to take out any token re-release logic out of the worker.
Explanation of solution
I can rephrase the original problem like this:
I want to have a number of workers that start the request independently of each other (thus, they will be realized as asynchronous functions).
There is zero or one token released at any point; any worker starting a request to the server consumes a token, and the next token will not be issued until a minimum time has passed. In my solution, I use trio's memory channels to coordinate between the token issuer and the token consumers (workers)
In case your not familiar with memory channels and their syntax, you can read about them in the trio doc. I think the logic of async with memory_channel and memory_channel.clone() can be confusing in the first moment.
from typing import List, Iterator
import asks
import trio
asks.init('trio')
links: List[str] = [
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/4'
] * 3
async def fetch_urls(urls: List[str], number_workers: int, throttle_rate: float):
async def token_issuer(token_sender: trio.abc.SendChannel, number_tokens: int):
async with token_sender:
for _ in range(number_tokens):
await token_sender.send(None)
await trio.sleep(1 / throttle_rate)
async def worker(url_iterator: Iterator, token_receiver: trio.abc.ReceiveChannel):
async with token_receiver:
for url in url_iterator:
await token_receiver.receive()
print(f'[{round(trio.current_time(), 2)}] Start loading link: {url}')
response = await asks.get(url)
# print(f'[{round(trio.current_time(), 2)}] Loaded link: {url}')
responses.append(response)
responses = []
url_iterator = iter(urls)
token_send_channel, token_receive_channel = trio.open_memory_channel(0)
async with trio.open_nursery() as nursery:
async with token_receive_channel:
nursery.start_soon(token_issuer, token_send_channel.clone(), len(urls))
for _ in range(number_workers):
nursery.start_soon(worker, url_iterator, token_receive_channel.clone())
return responses
responses = trio.run(fetch_urls, links, 5, 1.)
Example of logging output:
As you see, the minimum time between all page requests is one second:
[177878.99] Start loading link: https://httpbin.org/delay/7
[177879.99] Start loading link: https://httpbin.org/delay/6
[177880.99] Start loading link: https://httpbin.org/delay/4
[177881.99] Start loading link: https://httpbin.org/delay/7
[177882.99] Start loading link: https://httpbin.org/delay/6
[177886.20] Start loading link: https://httpbin.org/delay/4
[177887.20] Start loading link: https://httpbin.org/delay/7
[177888.20] Start loading link: https://httpbin.org/delay/6
[177889.44] Start loading link: https://httpbin.org/delay/4
Comments on the solution
As not untypical for asynchronous code, this solution does not maintain the original order of the requested urls. One way to solve this is to associate an id to the original url, e. g. with a tuple structure, put the responses into a response dictionary and later grab the responses one after the other to put them into a response list (saves sorting and has linear complexity).
You need to increment next_request_at by 1 every time you come into async_load_page. Try using next_request_at = max(trio.current_time() + 1, next_request_at + 1). Also I think you only need to set it once. You may get into trouble if you're setting it around awaits, where you're giving the opportunity for other tasks to change it before examining it again.
Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.
I have 5,00,000 urls. and want to get response of each asynchronously.
import aiohttp
import asyncio
#asyncio.coroutine
def worker(url):
response = yield from aiohttp.request('GET', url, connector=aiohttp.TCPConnector(share_cookies=True, verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u) for u in url_list]))
main()
I want 200 connections at a time(concurrent 200), not more than this because
when I run this program for 50 urls it works fine, i.e url_list[:50]
but if I pass whole list, i get this error
aiohttp.errors.ClientOSError: Cannot connect to host www.example.com:443 ssl:True Future/Task exception was never retrieved future: Task()
may be frequency is too much and server is denying to respond after a limit?
Yes, one can expect a server to stop responding after causing too much traffic (whatever the definition of "too much traffic") to it.
One way to limit number of concurrent requests (throttle them) in such cases is to use asyncio.Semaphore, similar in use to these used in multithreading: just like there, you create a semaphore and make sure the operation you want to throttle is aquiring that semaphore prior to doing actual work and releasing it afterwards.
For your convenience, asyncio.Semaphore implements context manager to make it even easier.
Most basic approach:
CONCURRENT_REQUESTS = 200
#asyncio.coroutine
def worker(url, semaphore):
# Aquiring/releasing semaphore using context manager.
with (yield from semaphore):
response = yield from aiohttp.request(
'GET',
url,
connector=aiohttp.TCPConnector(share_cookies=True,
verify_ssl=False))
body = yield from response.read_and_close()
print(url)
def main():
url_list = [] # lacs of urls, extracting from a file
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([worker(u, semaphore) for u in url_list]))