python websockets message queue getting too long, resulting in stale messages - python

Having the following design problem: I open a websocket connection and send multiple messages every second to get open risk.
What’s happening is the message queue is building up wit a lot of “stale” messages (from frequently sending messages), and when I send a trade order out and it fills, the risk is delayed as it has to wait for the while loop to iterate through all the old messages, before it updates with the correct risk
Ie, the deque in websockets["messages"] is getting really long, and it takes the while loop some time to catch up to the "recent" messages
Sample code:
import websockets
url = 'wss://www.deribit.com/ws/api/v2'
risk1 = None
risk2 = None
async with websockets.connect(url) as ws:
# send many messages every second, which builds up a lot of messages in the queue
await send_message("private/get_positions", "btc-perpetual")
await send_message("private/get_positions", "eth-perpetual")
....
await send_message("private/get_positions", "sol-perpetual")
# messages from above build up in a deque, which gets iterated one-at-a-time in while loop
response = await ws.recv()
if response["id"] == 100:
risk1 = response["result"]
elif response["id"] == 200:
risk2 = response["result"]
else:
pass
# message queue gets long, and messages go stale (response from ws.recv() ), resulting in out-of-date risk
risk_usd = calculate_risk(risk1, risk2)
if risk_usd > 0:
await post_order()
some ideas ive had, but not sure if good practice:
ignore message if > x seconds
unpack the websockets["messages"] and choose last item
note: there are multiple variables (risk1, risk2) getting updated with each iteration, and ALL of them need to be up-to-date

Related

Using a Python websocket server as an async generator

I have a scraper that requires the use of a websocket server (can't go into too much detail on why because of company policy) that I'm trying to turn into a template/module for easier use on other websites.
I have one main function that runs the loop of the server (e.g. ping-pongs to keep the connection alive and send work and stop commands when necessary) that I'm trying to turn into a generator that yields the HTML of scraped pages (asynchronously, of course). However, I can't figure out a way to turn the server into a generator.
This is essentially the code I would want (simplified to just show the main idea, of course):
import asyncio, websockets
needsToStart = False # Setting this to true gets handled somewhere else in the script
async def run(ws):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
yield data # Yielding the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
generator = websockets.serve(run, 'localhost', 9999)
while True:
html = await anext(generator)
# Do whatever with html
This, of course, doesn't work, giving the error "TypeError: 'Serve' object is not callable". But is there any way to set up something along these lines? An alternative I could try is creating an 'intermittent' object that holds the data which the end loop awaits, but that seems messier to me than figuring out a way to get this idea to work.
Thanks in advance.
I found a solution that essentially works backwards, for those in need of the same functionality: instead of yielding the data, I pass along the function that processes said data. Here's the updated example case:
import asyncio, websockets
from functools import partial
needsToStart = False # Setting this to true gets handled somewhere else in the script
def process(html):
pass
async def run(ws, htmlFunc):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
htmlFunc(data) # Processing the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
func = partial(run, htmlFunc=process)
websockets.serve(func, 'localhost', 9999)

How to get the next telegram messages from specific users

I'm implementing a telegram bot that will serve users. Initially, it used to get any new message sequentially, even in the middle of an ongoing session with another user. Because of that, anytime 2 or more users tried to use the bot, it used to get all jumbled up. To solve this I implemented a queue system that put users on hold until the ongoing conversation was finished. But this queue system is turning out to be a big hassle. I think my problems would be solved with just a method to get the new messages from a specific chat_id or user. This is the code that I'm using to get any new messages:
def get_next_message_result(self, update_id: int, chat_id: str):
"""
get the next message the of a given chat.
In case of the next message being from another user, put it on the queue, and wait again for
expected one.
"""
update_id += 1
link_requisicao = f'{self.url_base}getUpdates?timeout={message_timeout}&offset={update_id}'
result = json.loads(requests.get(link_requisicao).content)["result"]
if len(result) == 0:
return result, update_id # timeout
if "text" not in result[0]["message"]:
self.responder(speeches.no_text_speech, message_chat_id)
return [], update_id # message without text
message_chat_id = result[0]["message"]["chat"]["id"]
while message_chat_id != chat_id:
self.responder(speeches.wait_speech, message_chat_id)
if message_chat_id not in self.current_user_queue:
self.current_user_queue.append(message_chat_id)
print("Queuing user with the following chat_id:", message_chat_id)
update_id += 1
link_requisicao = f'{self.url_base}getUpdates?timeout={message_timeout}&offset={update_id}'
result = json.loads(requests.get(link_requisicao).content)["result"]
if len(result) == 0:
return result, update_id # timeout
if "text" not in result[0]["message"]:
self.responder(speeches.no_text_speech, message_chat_id)
return [], update_id # message without text
message_chat_id = result[0]["message"]["chat"]["id"]
return result, update_id
On another note: I use the queue so that the moment the current conversation ends, it calls the next user in line. Should I just drop the queue feature and tell the concurrent users to wait a few minutes? While ignoring any messages not from the current chat_id?

SQS Poll deletes messages upon reading

I'm using SQS service in my application which pushes jobs to the queue based on client request.
I have the following polling mechanism setup in a separate job server.
def get_message(self):
response = self.queue.receive_message(
QueueUrl=self.sqs_endpoint_url,
AttributeNames=['All'],
MessageAttributeNames=['All'],
MaxNumberOfMessages=self.max_messages_per_read,
WaitTimeSeconds=self.wait_time,
)
if response and ('Messages' in response) and (len(response["Messages"]) > 0):
return response['Messages'][0]
return None
def poll()
while True:
try:
message = self.get_message()
if message is None:
time.sleep(self.sleep_time_if_no_messages_found)
else:
config.logger.info(f'Processing message {message}')
action(message)
message_receipt_handle = message["ReceiptHandle"]
self.delete_message(message_receipt_handle)
except Exception:
config.logger.error(f'Error in processing message from queue: {str(traceback.format_exc())}')
// From job server
poll()
From the logs I observed that the messages are pushed to the queue, but it receives just one message and rest of the messages are just deleted without ever calling the action() method.
I'm pretty new to SQS and not sure what I'm doing wrong here. The polling takes place in a single thread.
SQS Attributes set,
ReceiveMessageWaitTimeSeconds: 20
VisibilityTimeout: 300
MaxReceiveCount: 5
I changed get_message() to return 10 messages (maximum) instead of 1 and modified poll() to iterate over the messages and now it seems to be working alright.

Combining semaphore and time limiting in python-trio with asks http request

I'm trying to use Python in an async manner in order to speed up my requests to a server. The server has a slow response time (often several seconds, but also sometimes faster than a second), but works well in parallel. I have no access to this server and can't change anything about it. So, I have a big list of URLs (in the code below, pages) which I know beforehand, and want to speed up their loading by making NO_TASKS=5 requests at a time. On the other hand, I don't want to overload the server, so I want a minimum pause between every request of 1 second (i. e. a limit of 1 request per second).
So far I have successfully implemented the semaphore part (five requests at a time) using a Trio queue.
import asks
import time
import trio
NO_TASKS = 5
asks.init('trio')
asks_session = asks.Session()
queue = trio.Queue(NO_TASKS)
next_request_at = 0
results = []
pages = [
'https://www.yahoo.com/',
'http://www.cnn.com',
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.perl.org',
'http://www.cisco.com',
'http://www.facebook.com',
'http://www.twitter.com',
'http://www.macrumors.com/',
'http://arstechnica.com/',
'http://www.reuters.com/',
'http://abcnews.go.com/',
'http://www.cnbc.com/',
]
async def async_load_page(url):
global next_request_at
sleep = next_request_at
next_request_at = max(trio.current_time() + 1, next_request_at)
await trio.sleep_until(sleep)
next_request_at = max(trio.current_time() + 1, next_request_at)
print('start loading page {} at {} seconds'.format(url, trio.current_time()))
req = await asks_session.get(url)
results.append(req.text)
async def producer(url):
await queue.put(url)
async def consumer():
while True:
if queue.empty():
print('queue empty')
return
url = await queue.get()
await async_load_page(url)
async def main():
async with trio.open_nursery() as nursery:
for page in pages:
nursery.start_soon(producer, page)
await trio.sleep(0.2)
for _ in range(NO_TASKS):
nursery.start_soon(consumer)
start = time.time()
trio.run(main)
However, I'm missing the implementation of the limiting part, i. e. the implementation of max. 1 request per second. You can see above my attempt to do so (first five lines of async_load_page), but as you can see when you execute the code, this is not working:
start loading page http://www.reuters.com/ at 58097.12261669573 seconds
start loading page http://www.python.org at 58098.12367392373 seconds
start loading page http://www.pypy.org at 58098.12380622773 seconds
start loading page http://www.macrumors.com/ at 58098.12389389973 seconds
start loading page http://www.cisco.com at 58098.12397854373 seconds
start loading page http://arstechnica.com/ at 58098.12405119873 seconds
start loading page http://www.facebook.com at 58099.12458010273 seconds
start loading page http://www.twitter.com at 58099.37738939873 seconds
start loading page http://www.perl.org at 58100.37830828273 seconds
start loading page http://www.cnbc.com/ at 58100.91712723473 seconds
start loading page http://abcnews.go.com/ at 58101.91770178373 seconds
start loading page http://www.jython.org at 58102.91875295573 seconds
start loading page https://www.yahoo.com/ at 58103.91993155273 seconds
start loading page http://www.cnn.com at 58104.48031027673 seconds
queue empty
queue empty
queue empty
queue empty
queue empty
I've spent some time searching for answers but couldn't find any.
One of the ways to achieve your goal would be using a mutex acquired by a worker before sending a request and released in a separate task after some interval:
async def fetch_urls(urls: Iterator, responses, n_workers, throttle):
# Using binary `trio.Semaphore` to be able
# to release it from a separate task.
mutex = trio.Semaphore(1)
async def tick():
await trio.sleep(throttle)
mutex.release()
async def worker():
for url in urls:
await mutex.acquire()
nursery.start_soon(tick)
response = await asks.get(url)
responses.append(response)
async with trio.open_nursery() as nursery:
for _ in range(n_workers):
nursery.start_soon(worker)
If a worker gets response sooner than after throttle seconds, it will block on await mutex.acquire(). Otherwise the mutex will be released by the tick and another worker will be able to acquire it.
This is similar to how leaky bucket algorithm works:
Workers waiting for the mutex are like water in a bucket.
Each tick is like a bucket leaking at a constant rate.
If you add a bit of logging just before sending a request you should get an output similar to this:
0.00169 started
0.001821 n_workers: 5
0.001833 throttle: 1
0.002152 fetching https://httpbin.org/delay/4
1.012 fetching https://httpbin.org/delay/2
2.014 fetching https://httpbin.org/delay/2
3.017 fetching https://httpbin.org/delay/3
4.02 fetching https://httpbin.org/delay/0
5.022 fetching https://httpbin.org/delay/2
6.024 fetching https://httpbin.org/delay/2
7.026 fetching https://httpbin.org/delay/3
8.029 fetching https://httpbin.org/delay/0
9.031 fetching https://httpbin.org/delay/0
10.61 finished
Using trio.current_time() for this is much too complicated IMHO.
The easiest way to do rate limiting is a rate limiter, i.e. a separate task that basically does this:
async def ratelimit(queue,tick, task_status=trio.TASK_STATUS_IGNORED):
with trio.open_cancel_scope() as scope:
task_status.started(scope)
while True:
await queue.put()
await trio.sleep(tick)
Example use:
async with trio.open_nursery() as nursery:
q = trio.Queue(0) # can use >0 for burst modes
limiter = await nursery.start(ratelimit, q, 1)
while whatever:
await q.get(None) # will return at most once per second
do_whatever()
limiter.cancel()
in other words, you start that task with
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
and then you can be sure that at most one call of
await q.put(None)
per second will return, as the zero-length queue acts as a rendezvous point. When you're done, call
limiter.cancel()
to stop the rate limiting task, otherwise your nursery won't exit.
If your use case includes starting sub-tasks which you need to finish before the limiter gets cancelled, the easiest way to do that is to rin them in another nursery, i.e. instead of
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
you'd use something like
async with trio.open_nursery() as inner_nursery:
await start_tasks(inner_nursery, q)
limiter.cancel()
which would wait for the tasks to finish before touching the limiter.
NB: You can easily adapt this for "burst" mode, i.e. allow a certain number of requests before the rate limiting kicks in, by simply increasing the queue's length.
Motivation and origin of this solution
Some months have passed since I asked this question. Python has improved since then, so has trio (and my knowledge of them). So I thought it was time for a little update using Python 3.6 with type annotations and trio-0.10 memory channels.
I developed my own improvement of the original version, but after reading #Roman Novatorov's great solution, adapted it again and this is the result. Kudos to him for the main structure of the function (and the idea to use httpbin.org for illustration purposes). I chose to use memory channels instead of a mutex to be able to take out any token re-release logic out of the worker.
Explanation of solution
I can rephrase the original problem like this:
I want to have a number of workers that start the request independently of each other (thus, they will be realized as asynchronous functions).
There is zero or one token released at any point; any worker starting a request to the server consumes a token, and the next token will not be issued until a minimum time has passed. In my solution, I use trio's memory channels to coordinate between the token issuer and the token consumers (workers)
In case your not familiar with memory channels and their syntax, you can read about them in the trio doc. I think the logic of async with memory_channel and memory_channel.clone() can be confusing in the first moment.
from typing import List, Iterator
import asks
import trio
asks.init('trio')
links: List[str] = [
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/4'
] * 3
async def fetch_urls(urls: List[str], number_workers: int, throttle_rate: float):
async def token_issuer(token_sender: trio.abc.SendChannel, number_tokens: int):
async with token_sender:
for _ in range(number_tokens):
await token_sender.send(None)
await trio.sleep(1 / throttle_rate)
async def worker(url_iterator: Iterator, token_receiver: trio.abc.ReceiveChannel):
async with token_receiver:
for url in url_iterator:
await token_receiver.receive()
print(f'[{round(trio.current_time(), 2)}] Start loading link: {url}')
response = await asks.get(url)
# print(f'[{round(trio.current_time(), 2)}] Loaded link: {url}')
responses.append(response)
responses = []
url_iterator = iter(urls)
token_send_channel, token_receive_channel = trio.open_memory_channel(0)
async with trio.open_nursery() as nursery:
async with token_receive_channel:
nursery.start_soon(token_issuer, token_send_channel.clone(), len(urls))
for _ in range(number_workers):
nursery.start_soon(worker, url_iterator, token_receive_channel.clone())
return responses
responses = trio.run(fetch_urls, links, 5, 1.)
Example of logging output:
As you see, the minimum time between all page requests is one second:
[177878.99] Start loading link: https://httpbin.org/delay/7
[177879.99] Start loading link: https://httpbin.org/delay/6
[177880.99] Start loading link: https://httpbin.org/delay/4
[177881.99] Start loading link: https://httpbin.org/delay/7
[177882.99] Start loading link: https://httpbin.org/delay/6
[177886.20] Start loading link: https://httpbin.org/delay/4
[177887.20] Start loading link: https://httpbin.org/delay/7
[177888.20] Start loading link: https://httpbin.org/delay/6
[177889.44] Start loading link: https://httpbin.org/delay/4
Comments on the solution
As not untypical for asynchronous code, this solution does not maintain the original order of the requested urls. One way to solve this is to associate an id to the original url, e. g. with a tuple structure, put the responses into a response dictionary and later grab the responses one after the other to put them into a response list (saves sorting and has linear complexity).
You need to increment next_request_at by 1 every time you come into async_load_page. Try using next_request_at = max(trio.current_time() + 1, next_request_at + 1). Also I think you only need to set it once. You may get into trouble if you're setting it around awaits, where you're giving the opportunity for other tasks to change it before examining it again.

Python aiohttp (with asyncio) sends requests very slowly

Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.

Categories

Resources