I have a programm that receives Data (Trades) from the Binance API.
This data will be processed and visualized in a web-app with dash and plotly.
In order to get best performance and the slightest delay my program has 3 threads:
Thread 1 - Binance API - get requests - Trades
if __name__ == "__main__":
try:
loop = asyncio.get_event_loop()
binance-thread = threading.Thread(target=start_thread_1)
...
def start_thread_1():
loop.run_until_complete(main(api_key,secret_key))
async def main(api_key,secret_key):
client = await AsyncClient.create(api_key,secret_key)
await trades_listener(client)
async def trades_listener(client):
bm = BinanceSocketManager(client)
symbol = 'BTCUSDT'
async with bm.trade_socket(symbol=symbol) as stream:
while True:
msg = await stream.recv()
event_type = msg['e']
...
trade = Trade(event_type,...)
# <-- safe trade SOMEWHERE to process in other thread ? safe to: process_trades_list
Thread 2 - Web App - Displays Trades and Processed Trades Data
web-thread = threading.Thread(target=webserver.run_server)
...
not worth to mention
Thread 3 - Process Data - Process Trades (calculate RSI, filter big trades, etc)
if __name__ == "__main__":
try:
loop = asyncio.get_event_loop()
binance-thread = threading.Thread(target=start_thread_1)
web-thread = threading.Thread(target=webserver.run_server)
process-thread = threading.Thread(target=start_thread_3)
...
.start()
.sleep()
etc.
.join()
def start_thread_3():
process_trades()
def process_trades():
global process_trades_list
while True:
while len(process_trades_list) > 0:
trade = process_trades_list[0]
process_trades_list.pop(0)
# ...do calculation etc.
HOW can I safe / hand over the data from thread_1 / async thread to thread_3?
I tried to put the trades to a list called process_trades_list and then loop while len(process_trades_list) > 0 all trades.
In the loop I pop() processed trades from the list - but this somehow seems to break the program without throwing errors.
What's best way to get this done?
It is possible that the async stream get's spammed by new incoming trades and I want to minimalize the load..
Here you want a queue.Queue instead of a list. Your last code snippet would look something like this:
import queue
if __name__ == "__main__":
try:
q = queue.Queue()
binance_thread = threading.Thread(target=start_thread_1,
args=(q,))
web_thread = threading.Thread(target=webserver.run_server)
process)thread = threading.Thread(target=process_trades,
args=(q,), daemon=True)
...
.start()
.sleep()
etc.
.join()
def process_trades(q):
while True:
trade = q.get()
# ...do calculation etc.
I eliminated the call to get_event_loop since you didn't use the returned object. I eliminated the start_thread_3 function, which is not necessary.
I made thread-3 a daemon, so it will not keep your application open if everything else is finished.
The queue should be created once, in the main thread, and passed explicitly to each thread that needs to access it. That eliminates the need for a global variable.
The process trade function becomes much simpler. The q.get() call blocks until an object is available. It also pops the object off the queue.
Next you must also modify thread-1 to put objects onto the queue, like this:
def start_thread_1(q):
asyncio.run(main(api_key,secret_key, q))
async def main(api_key,secret_key, q):
client = await AsyncClient.create(api_key,secret_key)
await trades_listener(client, q)
async def trades_listener(client, q):
bm = BinanceSocketManager(client)
symbol = 'BTCUSDT'
async with bm.trade_socket(symbol=symbol) as stream:
while True:
msg = await stream.recv()
event_type = msg['e']
...
trade = Trade(event_type,...)
q.put(trade)
The q.put function is how you safely put a trade object into the queue, which will then result in activity in thread-3.
I modified the start_thread1 function: here is a good place to start the event loop mechanism for this thread.
You ask about avoiding spam attacks on your program. Queues have methods that allow you to limit their size, and possibly throw away trades if they become full.
I don't understand what you are trying to do with the if __name__ == '__main__' logic in thread-1. The program can have only one entry point, and only one module named '__main__'. It looks to me like that has to be thread-3.
Related
My use case is the following :
I’m using python 3.8
I have an async function analyse_doc that is a wrapper for a http request to a web service.
I have approx 1000 docs to analyse as fast as possible. The service allows for 15 transaction per second (and not 15 concurrent request at any second). So first sec I can send 15, then 2nd sec I can send 15 again and so on. If I try to hit the service more than 15 times per sec I get 429 error msg or sometimes 503/504 error (server is busy…)
My question is : is it possible to implement smt in python that effectively sends 15 requests per sec asynchronously then wait 1 sec then do it again until the queue is empty. Also some tasks might fail. Those failing tasks might need a rerun at some point.
So far my code is the following (unbounded parallelism… not even a semaphore) but it handles retry.
tasks = {asyncio.create_task(analyse_async(doc)): doc for doc in documents}
pending = set(tasks)
# Handle retry
while pending:
# backoff in case of 429
time.sleep(1)
# concurrent call return_when all completed
finished, pending = await asyncio.wait(
pending, return_when=asyncio.ALL_COMPLETED
)
# check if task has exception and register for new run.
for task in finished:
arg = tasks[task]
if task.exception():
new_task = asyncio.create_task(analyze_doc(doc))
tasks[new_task] = doc
pending.add(new_task)
You could try adding another sleep tasks into the mix to drive the request generation. Something like this
import asyncio
import random
ONE_SECOND = 1
CONCURRENT_TASK_LIMIT = 2
TASKS_TO_CREATE = 10
loop = asyncio.new_event_loop()
work_todo = []
work_in_progress = []
# just creates arbitrary work to do
def create_tasks():
for i in range(TASKS_TO_CREATE):
work_todo.append(worker_task(i))
# muddle this up to see how drain works
random.shuffle(work_todo)
# represents the actual work
async def worker_task(index):
print(f"i am worker {index} and i am starting")
await asyncio.sleep(index)
print(f"i am worker {index} and i am done")
# gets the next 'concurrent' workload segment (if there is one)
def get_next_tasks():
todo = []
i = 0
while i < CONCURRENT_TASK_LIMIT and len(work_todo) > 0:
todo.append(work_todo.pop())
i += 1
return todo
# drains down any outstanding tasks and closes the loop
async def are_we_done_yet():
print('draining')
await asyncio.gather(*work_in_progress)
loop.stop()
# closes out the program
print('done')
# puts work on the queue every tick (1 second)
async def work():
next_tasks = get_next_tasks()
if len(next_tasks) > 0:
print(f'found {len(next_tasks)} tasks to do')
for task in next_tasks:
# schedules the work, puts it in the in-progress pile
work_in_progress.append(loop.create_task(task))
# this is the 'tick' or speed work gets scheduled on
await asyncio.sleep(ONE_SECOND)
# every 'tick' we add this tasks onto the loop again unless there isn't any more to do...
loop.create_task(work())
else:
# ... if there isn't any to do we just enter drain mode
await are_we_done_yet()
# bootstrap the process
create_tasks()
loop.create_task(work())
loop.run_forever()
Updated version with a simulated exception
import asyncio
import random
ONE_SECOND = 1
CONCURRENT_TASK_LIMIT = 2
TASKS_TO_CREATE = 10
loop = asyncio.new_event_loop()
work_todo = []
work_in_progress = []
# just creates arbitrary work to do
def create_tasks():
for i in range(TASKS_TO_CREATE):
work_todo.append(worker_task(i))
# muddle this up to see how drain works
random.shuffle(work_todo)
# represents the actual work
async def worker_task(index):
try:
print(f"i am worker {index} and i am starting")
await asyncio.sleep(index)
if index % 9 == 0:
print('simulating error')
raise NotImplementedError("some error happened")
print(f"i am worker {index} and i am done")
except:
# put this work back on the pile (fudge the index so it doesn't throw this time)
work_todo.append(worker_task(index + 1))
# gets the next 'concurrent' workload segment (if there is one)
def get_next_tasks():
todo = []
i = 0
while i < CONCURRENT_TASK_LIMIT and len(work_todo) > 0:
todo.append(work_todo.pop())
i += 1
return todo
# drains down any outstanding tasks and closes the loop
async def are_we_done_yet():
print('draining')
await asyncio.gather(*work_in_progress)
if (len(work_todo)) > 0:
loop.create_task(work())
print('found some retries')
else:
loop.stop()
# closes out the program
print('done')
# puts work on the queue every tick (1 second)
async def work():
next_tasks = get_next_tasks()
if len(next_tasks) > 0:
print(f'found {len(next_tasks)} tasks to do')
for task in next_tasks:
# schedules the work, puts it in the in-progress pile
work_in_progress.append(loop.create_task(task))
# this is the 'tick' or speed work gets scheduled on
await asyncio.sleep(ONE_SECOND)
# every 'tick' we add this tasks onto the loop again unless there isn't any more to do...
loop.create_task(work())
else:
# ... if there isn't any to do we just enter drain mode
await are_we_done_yet()
# bootstrap the process
create_tasks()
loop.create_task(work())
loop.run_forever()
This just simulates something going wrong and re-queues the failed task. If the error happens after the main work method has finished it won't get re-queued so in the are-we-there-yet method it would need to check and rerun any failed tasks - this isn't particularly optimal as it'll wait to drain before checking everything else but gives you an idea of an implementation
I am planning to have an asyncio Queue based producer-consumer implementation for a processing of realtime data where sending out data in correct time order is vital. So here is the code snippet of it :
async def produce(Q, n_jobs):
for i in range(n_jobs):
print(f"Producing :{i}")
await Q.put(i)
async def consume(Q):
while True:
n = await Q.get()
print(f"Consumed :{n}")
x = do_sometask_and_return_the_result(n)
print(f"Finished :{n} and Result: {x}")
async def main(loop):
Q = asyncio.Queue(loop=loop, maxsize=3)
await asyncio.wait([produce(Q, 10), consume(Q), consume(Q), consume(Q)])
print("Done")
Here the producer produces data and puts it into the asyncio Queue. I have multiple consumers to consume and process the data. While seeing the outputs, the order is maintained while printing "Consumed :{n}" (as in 1,2,3,4... and so on) , this is completely fine. but, since the function do_sometask_and_return_the_result(n) takes variable time to return the result, the order is not maintained in the next print of n "Finished :{n}" (as in 2,1,4,3,5,...).
Is there any way to synchronize this data as I need to maintain the order of printing the results? I want to see 1,2,3,4,.. sequential prints for 'n' even after do_sometask_and_return_the_result(n).
You could use a priority queue system (using the python heapq library) to reorder your jobs after they are complete. Something like this.
# add these variables at class/global scope
priority_queue = []
current_job_id = 1
job_id_dict = {}
async def produce(Q, n_jobs):
# same as above
async def consume(Q):
while True:
n = await Q.get()
print(f"Consumed :{n}")
x = do_sometask_and_return_the_result(n)
await process_result(n, x)
async def process_result(n, x):
heappush(priority_queue, n)
job_id_dict[n] = x
while current_job_id == priority_queue[0]:
job_id = heappop(priority_queue)
print(f"Finished :{job_id} and Result: {job_id_dict[job_id]}")
current_job_id += 1
async def main(loop):
Q = asyncio.Queue(loop=loop, maxsize=3)
await asyncio.wait([produce(Q, 10), consume(Q), consume(Q), consume(Q)])
print("Done")
For more information on the heapq module: https://docs.python.org/3/library/heapq.html
I have the following code which read and write for each id sequentially.
async def main():
while id < 1000:
data = await read_async(id)
await data.write_async(f'{id}.csv')
id += 1
read_async() takes several minutes and write_async() takes less than one minute to run. Now I want to
Run read_async(id) in parallel. However, at most 3 calls can be run in parallel because of memory limitation.
write_async has to be run sequentially, i.e., write_async(n+1) cannot be run before write_async(n).
You could use a queue and a fixed number of tasks for reading, and write from the main task. The main task can use an event to find out that new data is available from the readers and and a shared dict to get it from them. For example (untested):
async def reader(q, id_to_data, data_ready):
while True:
id = await q.get()
data = await read_async(id)
id_to_data[id] = data
data_ready.set()
async def main():
q = asyncio.Queue()
for id in range(1000):
await q.put(id)
id_to_data = {}
data_ready = asyncio.Event()
readers = [asyncio.create_task(reader(q, id_to_data, data_ready))
for _ in 3]
for id in range(1000):
while True:
# wait for the current ID to appear before writing
if id in id_to_data:
data = id_to_data.pop(id)
await data.write_async(f'{id}.csv')
break
# move on to the next ID
else:
# wait for new data and try again
await data_ready.wait()
data_ready.clear()
for r in readers:
r.cancel()
Using a separate queue for results instead of the event wouldn't work because a queue is unordered. A priority queue would fix that, bit it would still immediately return the lowest id currently available, whereas the writer needs the next id in order to process all ids in order.
I am back with a question about asyncio. I find it very useful (especially due to the GIL with threads) and I am trying to boost performances of some pieces of code.
My application is doing the following:
1 Background daemon thread "A" receives events from connected clients and reacts by populating a SetQueue (that simply is an event queue that removes duplicate ids) and by doing some insertions in a DB. I get this daemon from another module (basically I control a callback from when an event is received). In my sample code below I substituted this with a thread I generate and that very simply just populates the queue with 20 items and mimics DB inserts before exiting.
1 Background daemon thread "B" is launched (loop_start) and he just loops over running until completion a coroutine that:
Fetches all the items in the queue (if not empty, otherwise it release the control for x seconds and then the coroutine is re-launched)
For each id in the queue it launches a chained coroutine that:
Creates and waits for a task that just fetches all relevant information for that id from the DB. I am using MotorClient that supports asyncio to do await in the task itself.
Uses an Pool of Processes executor to launch a process per id that uses the DB data to do some CPU intensive processing.
The main thread just initializes the db_client and takes loop_start and stop commands.
That is basically it.
Now I am trying to boost performance as much as possible.
My current issue is in using motor.motor_asyncio.AsyncioMotorClient() in this way:
It gets initialized in the main thread and there I want to create indexes
Thread "A" needs to perform DB insertions
Thread "B" needs to perform DB finds/reads
How can I do this? Motor states that it is meant for a single thread application where you use obviously a single event loop.
Here I found myself forced to have two events loops, one in thread "A" and one in thread "B". This is not optimal, but I didn't manage to use a single event loop with call_soon_threadsafe while keeping the same behavior...and I think performance wise I am still gaining much with two events loop that release control over the gil bound cpu core.
Should I use three different AsyncioMotorClient instances (one per thread) and use them as stated above? I failed with different errors while trying.
Here is my sample code that doesn't include just the the MotorClient initialization in Asynchro's __init__
import threading
import asyncio
import concurrent.futures
import functools
import os
import time
import logging
from random import randint
from queue import Queue
# create logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
# create file handler which logs even debug messages
fh = logging.FileHandler('{}.log'.format(__name__))
fh.setLevel(logging.DEBUG)
# create console handler with a higher log level
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(name)s - %(processName)s - %(threadName)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)
# add the handlers to the logger
logger.addHandler(fh)
logger.addHandler(ch)
class SetQueue(Queue):
"""Queue that avoids duplicate entries while keeping an order."""
def _init(self, maxsize):
self.maxsize = maxsize
self.queue = set()
def _put(self, item):
if type(item) is not int:
raise TypeError
self.queue.add(item)
def _get(self):
# Get always all items in a thread-safe manner
ret = self.queue.copy()
self.queue.clear()
return ret
class Asynchro:
def __init__(self, event_queue):
self.__daemon = None
self.__daemon_terminate = False
self.__queue = event_queue
def fake_populate(self, size):
t = threading.Thread(target=self.worker, args=(size,))
t.daemon = True
t.start()
def worker(self, size):
run = True
populate_event_loop = asyncio.new_event_loop()
asyncio.set_event_loop(populate_event_loop)
cors = [self.worker_cor(i, populate_event_loop) for i in range(size)]
done, pending = populate_event_loop.run_until_complete(asyncio.wait(cors))
logger.debug('Finished to populate event queue with result done={}, pending={}.'.format(done, pending))
while run:
# Keep it alive to simulate something still alive (minor traffic)
time.sleep(5)
rand = randint(100, 200)
populate_event_loop.run_until_complete(self.worker_cor(rand, populate_event_loop))
if self.__daemon_terminate:
logger.debug('Closed the populate_event_loop.')
populate_event_loop.close()
run = False
async def worker_cor(self, i, loop):
time.sleep(0.5)
self.__queue.put(i)
logger.debug('Wrote {} in the event queue that has now size {}.'.format(i, self.__queue.qsize()))
# Launch fake DB Insertions
#db_task = loop.create_task(self.fake_db_insert(i))
db_data = await self.fake_db_insert(i)
logger.info('Finished to populate with id {}'.format(i))
return db_data
#staticmethod
async def fake_db_insert(item):
# Fake some DB insert
logger.debug('Starting fake db insertion with id {}'.format(item))
st = randint(1, 101) / 100
await asyncio.sleep(st)
logger.debug('Finished db insertion with id {}, sleep {}'.format(item, st))
return item
def loop_start(self):
logger.info('Starting the loop.')
if self.__daemon is not None:
raise Exception
self.__daemon_terminate = False
self.__daemon = threading.Thread(target=self.__daemon_main)
self.__daemon.daemon = True
self.__daemon.start()
def loop_stop(self):
logger.info('Stopping the loop.')
if self.__daemon is None:
raise Exception
self.__daemon_terminate = True
if threading.current_thread() != self.__daemon:
self.__daemon.join()
self.__daemon = None
logger.debug('Stopped the loop and closed the event_loop.')
def __daemon_main(self):
logger.info('Background daemon started (inside __daemon_main).')
event_loop = asyncio.new_event_loop()
asyncio.set_event_loop(event_loop)
run, rc = True, 0
while run:
logger.info('Inside \"while run\".')
event_loop.run_until_complete(self.__cor_main())
if self.__daemon_terminate:
event_loop.close()
run = False
rc = 1
return rc
async def __cor_main(self):
# If nothing in the queue release control for a bit
if self.__queue.qsize() == 0:
logger.info('Event queue is empty, going to sleep (inside __cor_main).')
await asyncio.sleep(10)
return
# Extract all items from event queue
items = self.__queue.get()
# Run asynchronously DB extraction and processing on the ids (using pool of processes)
with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
cors = [self.__cor_process(item, executor) for item in items]
logger.debug('Launching {} coroutines to elaborate queue items (inside __cor_main).'.format(len(items)))
done, pending = await asyncio.wait(cors)
logger.debug('Finished to execute __cor_main with result {}, pending {}'
.format([t.result() for t in done], pending))
async def __cor_process(self, item, executor):
# Extract corresponding DB data
event_loop = asyncio.get_event_loop()
db_task = event_loop.create_task(self.fake_db_access(item))
db_data = await db_task
# Heavy processing of data done in different processes
logger.debug('Launching processes to elaborate db_data.')
res = await event_loop.run_in_executor(executor, functools.partial(self.fake_processing, db_data, None))
return res
#staticmethod
async def fake_db_access(item):
# Fake some db access
logger.debug('Starting fake db access with id {}'.format(item))
st = randint(1, 301) / 100
await asyncio.sleep(st)
logger.debug('Finished db access with id {}, sleep {}'.format(item, st))
return item
#staticmethod
def fake_processing(db_data, _):
# fake some CPU processing
logger.debug('Starting fake processing with data {}'.format(db_data))
st = randint(1, 101) / 10
time.sleep(st)
logger.debug('Finished fake processing with data {}, sleep {}, process id {}'.format(db_data, st, os.getpid()))
return db_data
def main():
# Event queue
queue = SetQueue()
return Asynchro(event_queue=queue)
if __name__ == '__main__':
a = main()
a.fake_populate(20)
time.sleep(5)
a.loop_start()
time.sleep(20)
a.loop_stop()
What's the reason for running multiple event loops?
I suggest just using the single loop in main thread, it's a native mode for asyncio.
asyncio might run loop in non-main thread in very rare scenarios but it doesn't look like your case.
I have a program with one producer and two slow consumers and I'd like to rewrite it with coroutines in such way that each consumer will handle only last value (i.e. skip new values generated during processing the old ones) produced for it (I used threads and threading.Queue() but with it blocks on put(), cause the queue will be full most of the time).
After reading answer to this question I decided to use asyncio.Event and asyncio.Queue. I wrote this prototype program:
import asyncio
async def l(event, q):
h = 1
while True:
# ready
event.set()
# get value to process
a = await q.get()
# process it
print(a * h)
h *= 2
async def m(event, q):
i = 1
while True:
# pass element to consumer, when it's ready
if event.is_set():
await q.put(i)
event.clear()
# produce value
i += 1
el = asyncio.get_event_loop()
ev = asyncio.Event()
qu = asyncio.Queue(2)
tasks = [
asyncio.ensure_future(l(ev, qu)),
asyncio.ensure_future(m(ev, qu))
]
el.run_until_complete(asyncio.gather(*tasks))
el.close()
and I have noticed that l coroutine blocks on q.get() line and doesn't print anything.
It works as I expect after adding asyncio.sleep() in both (I get 1,11,21,...):
import asyncio
import time
async def l(event, q):
h = 1
a = 1
event.set()
while True:
# await asyncio.sleep(1)
a = await q.get()
# process it
await asyncio.sleep(1)
print(a * h)
event.set()
async def m(event, q):
i = 1
while True:
# pass element to consumer, when it's ready
if event.is_set():
await q.put(i)
event.clear()
await asyncio.sleep(0.1)
# produce value
i += 1
el = asyncio.get_event_loop()
ev = asyncio.Event()
qu = asyncio.Queue(2)
tasks = [
asyncio.ensure_future(l(ev, qu)),
asyncio.ensure_future(m(ev, qu))
]
el.run_until_complete(asyncio.gather(*tasks))
el.close()
...but I'm looking for solution without it.
Why is it so? How can I fix it? I think I cannot call await l() from m as both of them have states (in original program the first draws solution with PyGame and the second plots results).
The code is not working as expected as the task running the m function is never stopped. The task will continue increment i in the case that event.is_set() == False. Because this task is never suspended, the task running function l will never be called. Therefore, you need a way to suspend the task running function m. One way of suspending is awaiting another coroutine, that is the reason why a asyncio.sleep works as expected.
I think the following code will work as you expect. The LeakyQueue will ensure that only the last value from the producer will be processed by the consumer. As the complexity is very symmetric, the consumer will consume all values produced by the producer. If you increase the delay argument, you can simulate that the consumer only processes the last value created by the producer.
import asyncio
class LeakyQueue(asyncio.Queue):
async def put(self, item):
if self.full():
await self.get()
await super().put(item)
async def consumer(queue, delay=0):
h = 1
while True:
a = await queue.get()
if delay:
await asyncio.sleep(delay)
print ('consumer', a)
h += 2
async def producer(queue):
i = 1
while True:
await asyncio.ensure_future(queue.put(i))
print ('producer', i)
i += 1
loop = asyncio.get_event_loop()
queue = LeakyQueue(maxsize=1)
tasks = [
asyncio.ensure_future(consumer(queue, 0)),
asyncio.ensure_future(producer(queue))
]
loop.run_until_complete(asyncio.gather(*tasks))