How to speed up async requests in Python - python

I want to download/scrape 50 million log records from a site. Instead of downloading 50 million in one go, I was trying to download it in parts like 10 million at a time using the following code but it's only handling 20,000 at a time (more than that throws an error) so it becomes time-consuming to download that much data. Currently, it takes 3-4 mins to download 20,000 records with the speed of 100%|██████████| 20000/20000 [03:48<00:00, 87.41it/s] so how to speed it up?
import asyncio
import aiohttp
import time
import tqdm
import nest_asyncio
nest_asyncio.apply()
async def make_numbers(numbers, _numbers):
for i in range(numbers, _numbers):
yield i
n = 0
q = 10000000
async def fetch():
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession() as session:
post_tasks = []
# prepare the coroutines that poat
async for x in make_numbers(n, q):
post_tasks.append(do_get(session, url, x))
# now execute them all at once
responses = [await f for f in tqdm.tqdm(asyncio.as_completed(post_tasks), total=len(post_tasks))]
async def do_get(session, url, x):
headers = {
'Content-Type': "application/x-www-form-urlencoded",
'Access-Control-Allow-Origin': "*",
'Accept-Encoding': "gzip, deflate",
'Accept-Language': "en-US"
}
async with session.get(url + str(x), headers=headers) as response:
data = await response.text()
print(data)
s = time.perf_counter()
try:
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch())
except:
print("error")
elapsed = time.perf_counter() - s
# print(f"{__file__} executed in {elapsed:0.2f} seconds.")
Traceback (most recent call last):
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 986, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1056, in create_connection
raise exceptions[0]
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1041, in create_connection
sock = await self._connect_sock(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 955, in _connect_sock
await self.sock_connect(sock, address)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 702, in sock_connect
return await self._proactor.connect(sock, address)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 328, in __wakeup
future.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 812, in _poll
value = callback(transferred, key, ov)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\windows_events.py", line 599, in finish_connect
ov.getresult()
OSError: [WinError 121] The semaphore timeout period has expired
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 136, in <module>
loop.run_until_complete(fetch())
File "C:\Users\SGM\AppData\Roaming\Python\Python39\site-packages\nest_asyncio.py", line 81, in run_until_complete
return f.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 256, in __step
result = coro.send(None)
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 88, in fetch
response = await f
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 37, in _wait_for_one
return f.result()
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\futures.py", line 201, in result
raise self._exception
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\asyncio\tasks.py", line 258, in __step
result = coro.throw(exc)
File "C:\Users\SGM\Desktop\xnet\x3stackoverflow.py", line 125, in do_get
async with session.get(url + str(x), headers=headers) as response:
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\client.py", line 1138, in __aenter__
self._resp = await self._coro
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\client.py", line 535, in _request
conn = await self._connector.connect(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 542, in connect
proto = await self._create_connection(req, traces, timeout)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 907, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 1206, in _create_direct_connection
raise last_exc
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
File "C:\Users\SGM\AppData\Local\Programs\Python\Python39\lib\site-packages\aiohttp\connector.py", line 992, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host example.com:80 ssl:default [The semaphore timeout period has expired]

Bottleneck: number of simultaneous connections
First, the bottleneck is the total number of simultaneous connections in the TCP connector.
That default for aiohttp.TCPConnector is limit=100. On most systems (tested on macOS), you should be able to double that by passing a connector with limit=200:
# async with aiohttp.ClientSession() as session:
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=200)) as session:
The time taken should decrease significantly. (On macOS: q = 20_000 decreased 43% from 58 seconds to 33 seconds, and q = 10_000 decreased 42% from 31 to 18 seconds.)
The limit you can configure depends on the number of file descriptors that your machine can open. (On macOS: You can run ulimit -n to check, and ulimit -n 1024 to increase to 1024 for the current terminal session, and then change to limit=1000. Compared to limit=100, q = 20_000 decreased 76% to 14 seconds, and q = 10_000 decreased 71% to 9 seconds.)
Supporting 50 million requests: async generators
Next, the reason why 50 million requests appears to hang is simply because of its sheer number.
Just creating 10 million coroutines in post_tasks takes 68-98 seconds (varies greatly on my machine), and then the event loop is further burdened with that many tasks, 99.99% of which are blocked by the TCP connection pool.
We can defer the creation of coroutines using an async generator:
async def make_async_gen(f, n, q):
async for x in make_numbers(n, q):
yield f(x)
We need a counterpart to asyncio.as_completed() to handle async_gen and concurrency:
from asyncio import ensure_future, events
from asyncio.queues import Queue
def as_completed_for_async_gen(fs_async_gen, concurrency):
done = Queue()
loop = events.get_event_loop()
# todo = {ensure_future(f, loop=loop) for f in set(fs)} # -
todo = set() # +
def _on_completion(f):
todo.remove(f)
done.put_nowait(f)
loop.create_task(_add_next()) # +
async def _wait_for_one():
f = await done.get()
return f.result()
async def _add_next(): # +
try:
f = await fs_async_gen.__anext__()
except StopAsyncIteration:
return
f = ensure_future(f, loop=loop)
f.add_done_callback(_on_completion)
todo.add(f)
# for f in todo: # -
# f.add_done_callback(_on_completion) # -
# for _ in range(len(todo)): # -
# yield _wait_for_one() # -
for _ in range(concurrency): # +
loop.run_until_complete(_add_next()) # +
while todo: # +
yield _wait_for_one() # +
Then, we update fetch():
from functools import partial
CONCURRENCY = 200 # +
n = 0
q = 50_000_000
async def fetch():
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=CONCURRENCY)) as session:
# post_tasks = [] # -
# # prepare the coroutines that post # -
# async for x in make_numbers(n, q): # -
# post_tasks.append(do_get(session, url, x)) # -
# Prepare the coroutines generator # +
async_gen = make_async_gen(partial(do_get, session, url), n, q) # +
# now execute them all at once # -
# responses = [await f for f in tqdm.asyncio.tqdm.as_completed(post_tasks, total=len(post_tasks))] # -
# Now execute them with a specified concurrency # +
responses = [await f for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q)] # +
Other limitations
With the above, the program can start processing 50 million requests but:
it will still take 8 hours or so with CONCURRENCY = 1000, based on the estimate from tqdm.
your program may run out of memory for responses and crash.
For point 2, you should probably do:
# responses = [await f for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q)]
for f in tqdm.tqdm(as_completed_for_async_gen(async_gen, CONCURRENCY), total=q):
response = await f
# Do something with response, such as writing to a local file
# ...
An error in the code
do_get() should return data:
async def do_get(session, url, x):
headers = {
'Content-Type': "application/x-www-form-urlencoded",
'Access-Control-Allow-Origin': "*",
'Accept-Encoding': "gzip, deflate",
'Accept-Language': "en-US"
}
async with session.get(url + str(x), headers=headers) as response:
data = await response.text()
# print(data) # -
return data # +

If it's not the bandwidth that limits you (but I cannot check this), there is a solution less complicated than the celery and rabbitmq but it is not as scalable as the celery and rabbitmq, it will be limited by your number of CPU.
Instead of splitting calls on celery workers, you split them on multiple processes.
I modified the fetch function like this:
async def fetch(start, end):
# example
url = "https://httpbin.org/anything/log?id="
async with aiohttp.ClientSession() as session:
post_tasks = []
# prepare the coroutines that poat
# use start and end arguments here!
async for x in make_numbers(start, end):
post_tasks.append(do_get(session, url, x))
# now execute them all at once
responses = [await f for f in
tqdm.tqdm(asyncio.as_completed(post_tasks), total=len(post_tasks))]
and I modified the main processes:
import concurrent.futures
from itertools import count
def one_executor(start, end):
loop = asyncio.new_event_loop()
try:
loop.run_until_complete(fetch(start, end))
except:
print("error")
if __name__ == '__main__':
s = time.perf_counter()
# Change the value to the number of core you want to use.
max_worker = 4
length_by_executor = q // max_worker
with concurrent.futures.ProcessPoolExecutor(max_workers=max_worker) as executor:
for index_min in count(0, length_by_executor):
# no matter with duplicated indexes due to the use of
# range in make_number function.
index_max = min(index_min + length_by_executor, q)
executor.submit(one_executor, index_min, index_max)
if index_max == q:
break
elapsed = time.perf_counter() - s
print(f"executed in {elapsed:0.2f} seconds.")
Here the result I get (with the value of q set to 10_000):
1 worker: executed in 13.90 seconds.
2 workers: executed in 7.24 seconds.
3 workers: executed in 6.82 seconds.
I don't work on the tqdm progress bar, with the current solution, two bars will be displayed (but I think tqdm works well with multi processes).

Related

gcp pubsub-lite subscription python "AuthMetadataPluginCallback \"<google.auth.transport.grpc.AuthMetadataPlugin object at " raised exception!" error

I am using the python pubsublite client(async version) for subscribing from pubsub-lite.
I get the below error intermittently
Traceback (most recent call last):
File \"/usr/local/lib/python3.10/site-packages/grpc/_plugin_wrapping.py\", line 89, in __call__
self._metadata_plugin(
File \"/usr/local/lib/python3.10/site-packages/google/auth/transport/grpc.py\", line 101, in __call__
callback(self._get_authorization_headers(context), None)
File \"/usr/local/lib/python3.10/site-packages/google/auth/transport/grpc.py\", line 87, in _get_authorization_headers
self._credentials.before_request(
File \"/usr/local/lib/python3.10/site-packages/google/auth/credentials.py\", line 134, in before_request
self.apply(headers)
File \"/usr/local/lib/python3.10/site-packages/google/auth/credentials.py\", line 110, in apply
_helpers.from_bytes(token or self.token)
File \"/usr/local/lib/python3.10/site-packages/google/auth/_helpers.py\", line 130, in from_bytes
raise ValueError(\"{0!r} could not be converted to unicode\".format(value))
ValueError: None could not be converted to unicode"
I don't use GOOGLE_APPLICATION_CREDENTIALS env variable to specify credentials, instead I do as below(I don't want to write credentials to a file in aws host)
import asyncio
from google.cloud.pubsublite.cloudpubsub import AsyncSubscriberClient
from google.cloud.pubsublite.types import (
CloudRegion,
CloudZone,
FlowControlSettings,
SubscriptionPath,
)
from google.oauth2 import service_account
class AsyncTimedIterable:
def __init__(self, iterable, poll_timeout=90):
class AsyncTimedIterator:
def __init__(self):
self._iterator = iterable.__aiter__()
async def __anext__(self):
try:
result = await asyncio.wait_for(
self._iterator.__anext__(), int(poll_timeout)
)
if not result:
raise StopAsyncIteration
return result
except asyncio.TimeoutError as e:
raise e
self._factory = AsyncTimedIterator
def __aiter__(self):
return self._factory()
# TODO add project info below
location = CloudZone(CloudRegion("region"), "zone")
subscription_path = SubscriptionPath("project_number", location, "subscription_id")
# TODO add service account details
gcp_creds = {}
async def async_receive_from_subscription(per_partition_count=100):
# Configure when to pause the message stream for more incoming messages based on the
# maximum size or number of messages that a single-partition subscriber has received,
# whichever condition is met first.
per_partition_flow_control_settings = FlowControlSettings(
# 1,000 outstanding messages. Must be >0.
messages_outstanding=per_partition_count,
# 10 MiB. Must be greater than the allowed size of the largest message (1 MiB).
bytes_outstanding=10 * 1024 * 1024,
)
async with AsyncSubscriberClient(
credentials=service_account.Credentials.from_service_account_info(gcp_creds)
) as async_subscriber_client:
message_iterator = await async_subscriber_client.subscribe(
subscription_path,
per_partition_flow_control_settings=per_partition_flow_control_settings,
)
timed_iter = AsyncTimedIterable(message_iterator, 90)
async for message in timed_iter:
yield message
async def main():
async for message in async_receive_from_subscription(per_partition_count=100_000):
print(message.data)
if __name__ == "__main__":
asyncio.run(main())
when I went through the files in stack trace I saw a code comment as below in file ``
# The plugin may be invoked on a thread created by Core, which will not
# have the context propagated. This context is stored and installed in
# the thread invoking the plugin.
Is it because the credentials I set are not being sent to another thread when it is created?

Python Asyncio errors: "OSError: [WinError 6] The handle is invalid" and "RuntimeError: Event loop is closed" [duplicate]

This question already has answers here:
Exception event loop is closed with aiohttp and asyncio in python 3.8
(2 answers)
Closed 2 years ago.
I am having some difficulties with properly with my code, as I get the following error after my code finishes executing while debugging on VSCode:
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x00000188AB3259D0>
Traceback (most recent call last):
File "c:\users\gam3p\appdata\local\programs\python\python38\lib\asyncio\proactor_events.py", line 116, in __del__
self.close()
File "c:\users\gam3p\appdata\local\programs\python\python38\lib\asyncio\proactor_events.py", line 108, in close
self._loop.call_soon(self._call_connection_lost, None)
File "c:\users\gam3p\appdata\local\programs\python\python38\lib\asyncio\base_events.py", line 719, in call_soon
self._check_closed()
File "c:\users\gam3p\appdata\local\programs\python\python38\lib\asyncio\base_events.py", line 508, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
I also get the following error if I run the code using the command line:
Cancelling an overlapped future failed
future: <_OverlappedFuture pending overlapped=<pending, 0x25822633550> cb=[_ProactorReadPipeTransport._loop_reading()]>
Traceback (most recent call last):
File "c:\users\gam3p\appdata\local\programs\python\python38\lib\asyncio\windows_events.py", line 66, in _cancel_overlapped
self._ov.cancel()
OSError: [WinError 6] The handle is invalid
The code below is an asynchronous API wrapper for Reddit. Since Reddit asks bots to be limited to a rate of 60 requests per minute, I have decided to implement a throttled loop that processes the requests in a queue, in a separate thread.
To run it, you would need to create a Reddit app and use your login credentials as well as the bot ID and secret.
If it helps, I am using Python 3.8.3 64-bit on Windows 10.
import requests
import aiohttp
import asyncio
import threading
from types import SimpleNamespace
from time import time
oauth_url = 'https://oauth.reddit.com/'
base_url = 'https://www.reddit.com/'
agent = 'windows:reddit-async:v0.1 (by /u/UrHyper)'
class Reddit:
def __init__(self, username: str, password: str, app_id: str, app_secret: str):
data = {'grant_type': 'password',
'username': username, 'password': password}
auth = requests.auth.HTTPBasicAuth(app_id, app_secret)
response = requests.post(base_url + 'api/v1/access_token',
data=data,
headers={'user-agent': agent},
auth=auth)
self.auth = response.json()
if 'error' in self.auth:
msg = f'Failed to authenticate: {self.auth["error"]}'
if 'message' in self.auth:
msg += ' - ' + self.auth['message']
raise ValueError(msg)
token = 'bearer ' + self.auth['access_token']
self.headers = {'Authorization': token, 'User-Agent': agent}
self._rate = 1
self._last_loop_time = 0.0
self._loop = asyncio.new_event_loop()
self._queue = asyncio.Queue(0)
self._end_loop = False
self._loop_thread = threading.Thread(target=self._start_loop_thread)
self._loop_thread.start()
def stop(self):
self._end_loop = True
def __del__(self):
self.stop()
def _start_loop_thread(self):
asyncio.set_event_loop(self._loop)
self._loop.run_until_complete(self._process_queue())
async def _process_queue(self):
while True:
if self._end_loop and self._queue.empty():
await self._queue.join()
break
start_time = time()
if self._last_loop_time < self._rate:
await asyncio.sleep(self._rate - self._last_loop_time)
try:
queue_item = self._queue.get_nowait()
url = queue_item['url']
callback = queue_item['callback']
data = await self._get_data(url)
self._queue.task_done()
callback(data)
except asyncio.QueueEmpty:
pass
finally:
self._last_loop_time = time() - start_time
async def _get_data(self, url):
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=self.headers) as response:
assert response.status == 200
data = await response.json()
data = SimpleNamespace(**data)
return data
async def get_bot(self, callback: callable):
url = oauth_url + 'api/v1/me'
await self._queue.put({'url': url, 'callback': callback})
async def get_user(self, user: str, callback: callable):
url = oauth_url + 'user/' + user + '/about'
await self._queue.put({'url': url, 'callback': callback})
def callback(data): print(data['name'])
async def main():
reddit = Reddit('', '', '', '')
await reddit.get_bot(lambda bot: print(bot.name))
reddit.stop()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I ran into a similar problem with asyncio.
Since Python 3.8 they change the default event loop on Windows to ProactorEventLoop instead of SelectorEventLoop and their are some issues with it.
so adding
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
above
loop = asyncio.get_event_loop()
Will get the old eventloop back without issues.

django celery and asyncio - loop argument must agree with Future approx every 3 mins

Im using django celery and celery beat to run periodic tasks. I run a task every one minute to get some data via SNMP.
My function uses asyncio as per the below. I have put a check in the code to check if the loop is closed and to create a new one.
but what seems to be happening is every few tasks, I get an failure and in the Django-tasks-results db I have the below traceback. there seems to be a failure around every 3 minutes, but there are successes every minute there are not failures
Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/itapp/itapp/monitoring/tasks.py", line 32, in link_data
return get_link_data()
File "/itapp/itapp/monitoring/jobs/link_monitoring.py", line 209, in get_link_data
done, pending = loop.run_until_complete(asyncio.wait(tasks))
File "/usr/local/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/asyncio/tasks.py", line 311, in wait
fs = {ensure_future(f, loop=loop) for f in set(fs)}
File "/usr/local/lib/python3.6/asyncio/tasks.py", line 311, in <setcomp>
fs = {ensure_future(f, loop=loop) for f in set(fs)}
File "/usr/local/lib/python3.6/asyncio/tasks.py", line 514, in ensure_future
raise ValueError('loop argument must agree with Future')
ValueError: loop argument must agree with Future
Function:
async def retrieve_data(link):
poll_interval = 60
results = []
# credentials:
link_mgmt_ip = link.mgmt_ip
link_index = link.interface_index
snmp_user = link.device_circuit_subnet.device.snmp_data.name
snmp_auth = link.device_circuit_subnet.device.snmp_data.auth
snmp_priv = link.device_circuit_subnet.device.snmp_data.priv
hostname = link.device_circuit_subnet.device.hostname
print('polling data for {} on {}'.format(hostname,link_mgmt_ip))
# first poll for speeds
download_speed_data_poll1 = snmp_get(link_mgmt_ip, down_speed_oid % link_index ,snmp_user, snmp_auth, snmp_priv)
# check we were able to poll
if 'timeout' in str(get_snmp_value(download_speed_data_poll1)).lower():
return 'timeout trying to poll {} - {}'.format(hostname ,link_mgmt_ip)
upload_speed_data_poll1 = snmp_get(link_mgmt_ip, up_speed_oid % link_index, snmp_user, snmp_auth, snmp_priv)
# wait for poll interval
await asyncio.sleep(poll_interval)
# second poll for speeds
download_speed_data_poll2 = snmp_get(link_mgmt_ip, down_speed_oid % link_index, snmp_user, snmp_auth, snmp_priv)
upload_speed_data_poll2 = snmp_get(link_mgmt_ip, up_speed_oid % link_index, snmp_user, snmp_auth, snmp_priv)
# create deltas for speed
down_delta = int(get_snmp_value(download_speed_data_poll2)) - int(get_snmp_value(download_speed_data_poll1))
up_delta = int(get_snmp_value(upload_speed_data_poll2)) - int(get_snmp_value(upload_speed_data_poll1))
# set speed results
download_speed = round((down_delta * 8 / poll_interval) / 1048576)
upload_speed = round((up_delta * 8 / poll_interval) / 1048576)
# get description and interface state
int_desc = snmp_get(link_mgmt_ip, int_desc_oid % link_index, snmp_user, snmp_auth, snmp_priv)
int_state = snmp_get(link_mgmt_ip, int_state_oid % link_index, snmp_user, snmp_auth, snmp_priv)
...
return results
def get_link_data():
mgmt_ip = Subquery(
DeviceCircuitSubnets.objects.filter(device_id=OuterRef('device_circuit_subnet__device_id'),subnet__subnet_type__poll=True).values('subnet__subnet')[:1])
link_data = LinkTargets.objects.all() \
.select_related('device_circuit_subnet') \
.select_related('device_circuit_subnet__device') \
.select_related('device_circuit_subnet__device__snmp_data') \
.select_related('device_circuit_subnet__subnet') \
.select_related('device_circuit_subnet__circuit') \
.annotate(mgmt_ip=mgmt_ip)
tasks = []
loop = asyncio.get_event_loop()
if asyncio.get_event_loop().is_closed():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(asyncio.new_event_loop())
for link in link_data:
tasks.append(asyncio.ensure_future(retrieve_data(link)))
if tasks:
start = time.time()
done, pending = loop.run_until_complete(asyncio.wait(tasks))
loop.close()
results = []
for completed_task in done:
results.append(completed_task.result()[0])
end = time.time()
print("Poll time: {}".format(end - start))
return 'Link data updated for {}'.format(' \n '.join(results))
else:
return 'no tasks defined'
from these urls suggested by user4815162342
https://medium.freecodecamp.org/a-guide-to-asynchronous-programming-in-python-with-asyncio-232e2afa44f6
When to use and when not to use Python 3.5 `await` ?
When running ascync functions, any input output operation needs to be async compatible, except functions that are run in memory. (in my example a regex query)
i.e any function that needs to gather data from another source (a django query in my example) which is not async compatible must be run in an executor.
I think I have now fixed my issues by running all django DB calls in executors, I have not had an issue running the script ad hoc since.
However I have compatibility issue with celery and async (as celery is not yet compatible with asyncio which throws up some errors, but not the errors I was previously seeing)

why url works in browser but not using requests get method

while testing, I just discovered, that this
url = ' http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'
works in browser and file download begins but if i try to fetch this file using
requests.get(url)
it gives massive error ...
any clue why is this happening ? do in need to decode this to make it working?
Update
this is the error I keep getting:
Exception in thread Thread-5:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "python/file_download.py", line 98, in _downloadChunk
stream=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/adapters.py", line 381, in send
raise Timeout(e)
Timeout: (<requests.packages.urllib3.connectionpool.HTTPConnectionPool object at 0x10258de90>, 'Connection to wi312.rockdizfile.com timed out. (connect timeout=0.001)')
there was no space when I posted, it was just in newline because I posted inline code embed.
Here is the code that makes requests:(also try new URL: http://archive.org/download/LucyIsabelleMarsh/LucyIsabelleMarsh-ItalianStreetSong.mp3)
import requests
import signal
import sys
import time
import threading
import utils as _fdUtils
from socket import error as SocketError, timeout as SocketTimeout
def _downloadChunk(url, idx, irange, fileName, sizeInBytes):
_log.debug("Downloading %s for first chunk %s " % (irange, idx+1))
pulledSize = irange[-1]
try:
resp = requests.get(url, allow_redirects=False, timeout=0.001,
headers={'Range': 'bytes=%s-%s' % (str(irange[0]), str(irange[-1]))},
stream=True)
except (SocketTimeout, requests.exceptions), e:
_log.error(e)
return
chunk_size = str(irange[-1])
for chunk in resp.iter_content(chunk_size):
status = r"%10d [%3.2f%%]" % (pulledSize, pulledSize * 100. / int(chunk_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
sys.stdout.flush()
pulledSize += len(chunk)
dataDict[idx] = chunk
time.sleep(.03)
if pulledSize == sizeInBytes:
_log.info("%s downloaded %3.0f%%", fileName, pulledSize * 100. / sizeInBytes)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, saveTo, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.__saveTo = saveTo
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
split = items[-1]
fileName = _fdUtils.getFileName(url)
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = _fdUtils.getUrlSizeInBytes(url)
if sizeInBytes:
byteRanges = _fdUtils.getRangeSegements(sizeInBytes, split)
else:
byteRanges = ['0-']
filePath = os.path.join(self.__saveTo, fileName)
downloaders = [
threading.Thread(
target=_downloadChunk,
args=(url, idx, irange, fileName, sizeInBytes),
)
for idx, irange in enumerate(byteRanges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
# this makes the wait for all thread to finish
# which confirms the dataDict is up-to-date
for th in downloaders:
th.join()
downloadedSize = 0
with open(filePath, 'wb') as fh:
for _idx, chunk in sorted(dataDict.iteritems()):
downloadedSize += len(chunk)
status = r"%10d [%3.2f%%]" % (downloadedSize, downloadedSize * 100. / sizeInBytes)
status = status + chr(8)*(len(status)+1)
fh.write(chunk)
sys.stdout.write('%s\r' % status)
time.sleep(.04)
sys.stdout.flush()
if downloadedSize == sizeInBytes:
_log.info("%s, saved to %s", fileName, self.__saveTo)
self.queue.task_done()
finally:
maxSplits.release()
The traceback is showing a Timeout exception, and in your code indeed you have a very short timeout set, either remove this limit or increase it:
requests.get(url, allow_redirects=False, timeout=0.001, # <-- this is very short
Even if you were accessing localhost (your own computer), such a timeout will result in a Timeout exception. From the documentation:
Note
timeout is not a time limit on the entire response download; rather,
an exception is raised if the server has not issued a response for
timeout seconds (more precisely, if no bytes have been received on the
underlying socket for timeout seconds).
So its not doing what you might expect.
You have a space before the start of the url which causes a requests.exceptions.InvalidSchema error:
url = ' http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'
Change to:
url = 'http://wi312.rockdizfile.com/d/uclf2kr7fp4r2ge47pcuihdpky2chcsjur5nrds2hx53f26qgxnrktew/Kimbra%20-%20Love%20in%20High%20Places.mp3'

url fetch gets stuck when multiple urls are passed

in the following code below I am trying to first check if the URL status code and then start the relevant thread and do the same for adding it to queue,
however if urls are too many then I get TimeOut error.
all code added below
but just discovered another bug if I am passing a mp3 file along with some jpeg images the mp3 file downloaded of its correct size is opening as one of the image in urls passed.
_fdUtils
def getParser():
parser = argparse.ArgumentParser(prog='FileDownloader',
description='Utility to download files from internet')
parser.add_argument('-v', '--verbose', default=logging.DEBUG,
help='by default its on, pass None or False to not spit in shell')
parser.add_argument('-st', '--saveTo', default=None, action=FullPaths,
help='location where you want files to download to')
parser.add_argument('-urls', nargs='*',
help='urls of files you want to download.')
parser.add_argument('-se', nargs='*', default=[1], help='Split each url passed to urls by the'\
" respective split order, if a url doesn't have a split default is taken 1 ")
return parser.parse_args()
def getResponse(url):
return requests.head(url, allow_redirects=True, timeout=10, headers={'Accept-Encoding': 'identity'})
def isWorkingURL(url):
response = getResponse(url)
return response.status_code in [302, 200, 100, 204, 300]
def getUrl(url):
""" gets the actual url to download file from.
"""
response = getResponse(url)
return response.headers.get('location', url)
error stack Trace:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "python/file_download.py", line 181, in run
_grabAndWriteToDisk(self, split, url, self.__saveTo, 0, self.queue)
File "python/file_download.py", line 70, in _grabAndWriteToDisk
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 505, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 167, in resolve_redirects
allow_redirects=False,
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests-2.1.0-py2.7.egg/requests/adapters.py", line 381, in send
raise Timeout(e)
Timeout: HTTPConnectionPool(host='ia600506.us.archive.org', port=80): Read timed out. (read timeout=<object object at 0x1002b40b0>)
there we go again:
import argparse
import logging
import Queue
import os
import requests
import signal
import socket
import sys
import time
import threading
import utils as _fdUtils
from collections import OrderedDict
from itertools import izip_longest
from socket import error as SocketError, timeout as SocketTimeout
# timeout in seconds
TIMEOUT = 10
socket.setdefaulttimeout(TIMEOUT)
DESKTOP_PATH = os.path.expanduser("~/Desktop")
appName = 'FileDownloader'
logFile = os.path.join(DESKTOP_PATH, '%s.log' % appName)
_log = _fdUtils.fdLogger(appName, logFile, logging.DEBUG, logging.DEBUG, console_level=logging.DEBUG)
queue = Queue.Queue()
STOP_REQUEST = threading.Event()
maxSplits = threading.BoundedSemaphore(3)
threadLimiter = threading.BoundedSemaphore(5)
lock = threading.Lock()
pulledSize = 0
dataDict = {}
def _grabAndWriteToDisk(threadName, url, saveTo, first=None, queue=None, mode='wb', irange=None):
""" Function to download file..
Args:
url(str): url of file to download
saveTo(str): path where to save file
first(int): starting byte of the range
queue(Queue.Queue): queue object to set status for file download
mode(str): mode of file to be downloaded
irange(str): range of byte to download
"""
fileName = _fdUtils.getFileName(url)
filePath = os.path.join(saveTo, fileName)
fileSize = _fdUtils.getUrlSizeInBytes(url)
downloadedFileSize = 0 if not first else first
block_sz = 8192
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
for fileBuffer in resp.iter_content(block_sz):
if not fileBuffer:
break
with open(filePath, mode) as fd:
downloadedFileSize += len(fileBuffer)
fd.write(fileBuffer)
mode = 'a'
status = r"%10d [%3.2f%%]" % (downloadedFileSize, downloadedFileSize * 100. / fileSize)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.01)
sys.stdout.flush()
if downloadedFileSize == fileSize:
STOP_REQUEST.set()
queue.task_done()
_log.debug("Downloaded %s %s%% using %s and saved to %s", fileName,
downloadedFileSize * 100. / fileSize, threadName.getName(), saveTo)
def _downloadChunk(url, idx, irange, fileName, sizeInBytes):
_log.debug("Downloading %s for first chunk %s of %s " % (irange, idx+1, fileName))
pulledSize = irange[-1]
try:
resp = requests.get(url, allow_redirects=False, timeout=TIMEOUT,
headers={'Range': 'bytes=%s-%s' % (str(irange[0]), str(irange[-1]))},
stream=True)
except (SocketTimeout, requests.exceptions), e:
_log.error(e)
return
chunk_size = str(irange[-1])
for chunk in resp.iter_content(chunk_size):
status = r"%10d [%3.2f%%]" % (pulledSize, pulledSize * 100. / int(chunk_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
sys.stdout.flush()
pulledSize += len(chunk)
dataDict[idx] = chunk
time.sleep(.03)
if pulledSize == sizeInBytes:
_log.info("%s downloaded %3.0f%%", fileName, pulledSize * 100. / sizeInBytes)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, saveTo, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.__saveTo = saveTo
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
split = items[-1]
fileName = _fdUtils.getFileName(url)
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = _fdUtils.getUrlSizeInBytes(url)
byteRanges = _fdUtils.getRangeSegements(sizeInBytes, split)
filePath = os.path.join(self.__saveTo, fileName)
downloaders = [
threading.Thread(
target=_downloadChunk,
args=(url, idx, irange, fileName, sizeInBytes),
)
for idx, irange in enumerate(byteRanges)
]
# start threads, let run in parallel, wait for all to finish
for th in downloaders:
th.start()
# this makes the wait for all thread to finish
# which confirms the dataDict is up-to-date
for th in downloaders:
th.join()
downloadedSize = 0
with open(filePath, 'wb') as fh:
for _idx, chunk in sorted(dataDict.iteritems()):
downloadedSize += len(chunk)
status = r"%10d [%3.2f%%]" % (downloadedSize, downloadedSize * 100. / sizeInBytes)
status = status + chr(8)*(len(status)+1)
fh.write(chunk)
sys.stdout.write('%s\r' % status)
time.sleep(.04)
sys.stdout.flush()
if downloadedSize == sizeInBytes:
_log.info("%s, saved to %s", fileName, self.__saveTo)
self.queue.task_done()
finally:
maxSplits.release()
else:
while not STOP_REQUEST.isSet():
self.setName("primary_%s_thread" % fileName.split(".")[0])
# if downlaod whole file in single chunk no need
# to start a new thread, so directly download here.
_grabAndWriteToDisk(self, url, self.__saveTo, 0, self.queue)
finally:
threadLimiter.release()
def main(appName):
args = _fdUtils.getParser()
saveTo = args.saveTo if args.saveTo else DESKTOP_PATH
# spawn a pool of threads, and pass them queue instance
# each url will be downloaded concurrently
unOrdUrls = dict(izip_longest(args.urls, args.se, fillvalue=1))
ordUrls = OrderedDict([(k, unOrdUrls[k]) for k in sorted(unOrdUrls, key=unOrdUrls.get, reverse=False) if _fdUtils.isWorkingURL(k, _log) and _fdUtils.notOnDisk(k, saveTo)])
print "length: %s " % len(ordUrls)
for i in xrange(len(ordUrls)):
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
try:
# populate queue with data
for url, split in ordUrls.iteritems():
url = _fdUtils.getUrl(url)
print url
queue.put((url, int(split)))
# wait on the queue until everything has been processed
queue.join()
_log.info('All tasks completed.')
except (KeyboardInterrupt, SystemExit):
_log.critical('! Received keyboard interrupt, quitting threads.')
if __name__ == "__main__":
# change the name of MainThread.
threading.currentThread().setName("FileDownloader")
myapp = threading.currentThread().getName()
main(myapp)
I see two problems in your code. Since it's incomplete, I'm not sure how it's supposed to work, so I can't promise either one is the particular one you're running into first, but I'm pretty sure you need to fix both.
First:
queue.put((_fdUtils.getUrl(url), int(split)))
That's going to call _fdUtils.getUrl(url) in the main thread, and put the result on the queue. Your comments clearly imply that you intended the downloading to happen on the background threads.
If you wanted to pass a function to be called, just pass the function and its argument as separate members of the tuple, or wrap it up in a closure or a partial:
queue.put((lambda: _fdUtils.getUrl(url), int(split)))
Second:
t = ThreadedFetch(saveTo, queue)
t.daemon = True
t.start()
This starts a thread for every URL. That's almost never a good idea. Generally, downloaders don't use more than 4-16 threads at a time, and no more than 2-4 to the same site. You could easily be timing out because you're spamming some sit too fast and its server or router is making you back off for a while. Or, with a huge number of requests, you could be flooding your own network and blocking ACKs or even rebooting the router (especially if you have either a cheap home WiFi router or ADSL with a crappy provider).
Also, a much simpler way to do this would be to use a smart pool, like a multiprocessing.dummy.Pool (multiprocessing.dummy means it acts like the multiprocessing module but uses threads) or, even better, a concurrent.futures.ThreadPoolExecutor. In fact, if you look at the docs, a parallel downloader is the first example for ThreadPoolExecutor.

Categories

Resources