python aiohttp websockets closed browser tab handling - python

I am trying to create simple active users counter using aiohttp WebSockets and aioredis for storage. When I add a new tab in Google Chrome, my counter increments perfectly in all already opened tabs. However, when I close a tab, nothing changes in other tabs.
I think I should be missing something in whole async/await machinery, but cannot find what can be wrong.
Here is my app
import asyncio
import aiohttp
from aiohttp import web
import aioredis
class CounterView(web.View):
async def get(self):
request = self.request
app = request.app
ws = web.WebSocketResponse()
app['websockets'].append(ws)
await ws.prepare(request)
count = int(await app['db'].incr('counter'))
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
async for msg in ws:
if msg.type == aiohttp.WSMsgType.TEXT:
await ws.send_str(msg.data)
elif msg.type == aiohttp.WSMsgType.ERROR:
print('ws connection closed with exception %s' %
ws.exception())
app['websockets'].remove(ws)
# Execution stops here (on await app['db'] ...) and never returns
count = int(await app['db'].decr('counter'))
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
return ws
async def init_app(loop):
app = web.Application(loop=loop)
db = await aioredis.create_redis('redis://localhost', loop=loop)
app['db'] = db
app['websockets'] = []
app.add_routes([
web.get('', CounterView),
])
return app
if __name__ == '__main__':
loop = asyncio.get_event_loop()
web.run_app(init_app(loop))
And index.html template
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
How many people seeing this page now: <span id="counter"></span>
</body>
<script>
window.onload = function () {
const ws = new WebSocket('ws://localhost:8080');
ws.onmessage = function (event) {
const data = JSON.parse(event.data);
let span = document.getElementById('counter');
console.log(data.msg.count);
span.innerHTML = data.msg.count;
}
};
</script>
</html>
I have also tried in Firefox, and some really weird things happens there.
Opened two tabs, got counter = 2 on both. Then reload first - got 1 in it and still 2 in second one. Reload first tab again - got 2. After this, each reload gives 2.
Until I reload second tab - the same process (reload - 1 - reload - 2 happens there and repeats in first tab)
Also i tried to apply https://stackoverflow.com/a/48695448/6627564 this answer, but nothing changed.
Debugging shows that code executes up to count = int(await app['db'].decr('counter')) and then jumps somewhere to never return back.
Any help is greatly appreciated. As far as I understand, the event loop SHOULD return to execution after this line. Maybe the coroutine is somehow destroyed, but I haven't found any code in library doing this.
My problem is different from what is described in Python Asyncio Websocket not detecting a disconnect on wifi but does on localhost
First of all, my connections are all over localhost.
Secondly, the code after async for msg in ws loop actually starts executing, and debugging shows that ws.close() method is actually called. BUT there is a context switch on next await and execution doesn't go any further.
I have also tried using ws = web.WebSocketResponse(heartbeat=1.0) to activate ping-pong, but I can't see any messages in Dev Tools. I have added single await ws.ping() after await ws.prepare(request) and unfortunately no messages appeared in Dev Tools. Something is definitely going wrong here...

For anyone interested in this problem - the solution.
There are three issues in this code). Two of them are actually unrelated to asyncio.
First of all, app['websockets'] is list and for some reason remove(ws) fails to find correct WebSocketResponse instance and removes another WebSocketResponse from list.
The solution is to use set() instead of list for storing active websockets. This is because set.discard() uses __hash__ magic method, and list.remove() uses __eq__ method. Unfortunately, I cannot find the implementation detail for __eq__ in WebSocketResponse , but __hash__ is using builtin id function which guarantees correct work.
Secondly, look at this lines
ws = web.WebSocketResponse()
....
......
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
Local variable ws is overwritten in the for loop.
The solution is to just use other variable name for iterating, like other_ws
The third one is described in aiohttp's documentation Web Handler Cancellation.
It states that on every await call the handler can be terminated, if client has dropped connection. This is exactly the case - on the first await after dropped connection my handler died. The solutions is provided in the documentation as well, I decided to use asyncio.shield
.

Related

Using a Python websocket server as an async generator

I have a scraper that requires the use of a websocket server (can't go into too much detail on why because of company policy) that I'm trying to turn into a template/module for easier use on other websites.
I have one main function that runs the loop of the server (e.g. ping-pongs to keep the connection alive and send work and stop commands when necessary) that I'm trying to turn into a generator that yields the HTML of scraped pages (asynchronously, of course). However, I can't figure out a way to turn the server into a generator.
This is essentially the code I would want (simplified to just show the main idea, of course):
import asyncio, websockets
needsToStart = False # Setting this to true gets handled somewhere else in the script
async def run(ws):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
yield data # Yielding the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
generator = websockets.serve(run, 'localhost', 9999)
while True:
html = await anext(generator)
# Do whatever with html
This, of course, doesn't work, giving the error "TypeError: 'Serve' object is not callable". But is there any way to set up something along these lines? An alternative I could try is creating an 'intermittent' object that holds the data which the end loop awaits, but that seems messier to me than figuring out a way to get this idea to work.
Thanks in advance.
I found a solution that essentially works backwards, for those in need of the same functionality: instead of yielding the data, I pass along the function that processes said data. Here's the updated example case:
import asyncio, websockets
from functools import partial
needsToStart = False # Setting this to true gets handled somewhere else in the script
def process(html):
pass
async def run(ws, htmlFunc):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
htmlFunc(data) # Processing the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
func = partial(run, htmlFunc=process)
websockets.serve(func, 'localhost', 9999)

How to generate server-sent events for status change notifications in a Python web app?

I have a web app written in CherryPy: a user uploads a file, then some lengthy operation begins, passing through several stages. I want notifications for these stages to be pushed to all the connected clients. But I don't know how to communicate between processes. I guess I would have to launch the lengthy operation in a separate process, but then I don't know how to pass the "advanced to stage N" messages to the "server-sending function".
Conceptually, it would be something like this:
SSEtest.py:
from pathlib import Path
from time import sleep
import cherrypy
def lengthy_operation(name, stream):
for stage in range(10):
print(f'stage {stage}... ', end='')
sleep(2)
print('done')
print('finished')
class SSETest():
#cherrypy.expose
def index(self):
return Path('SSEtest.html').read_text()
#cherrypy.expose
def upload(self, file):
name = file.filename.encode('iso-8859-1').decode('utf-8')
lengthy_operation(name, file.file)
return 'OK'
#cherrypy.expose
def stage(self):
cherrypy.response.headers['Content-Type'] = 'text/event-stream;charset=utf-8'
def lengthy_operation():
for stage in range(5):
yield f'data: stage {stage}... \n\n'
sleep(2)
yield 'data: done\n\n'
yield 'data: finished\n\n'
return lengthy_operation()
stage._cp_config = {'response.stream': True, 'tools.encode.encoding': 'utf-8'}
cherrypy.quickstart(SSETest())
SSEtest.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>SSE Test</title>
</head>
<body>
<h1>SSE Test</h1>
<div>
<form id="load_file_form" action="" enctype="multipart/form-data">
<label for="load_file">Load a file: </label>
<input type="file" id="load_file" name="load_file">
<progress max="100" value="0" id="progress_bar"></progress>
</form>
</div>
<div id="status_messages">
<h3>Stages:</h3>
</div>
<script>
const load_file = document.getElementById('load_file');
const progress_bar = document.getElementById('progress_bar');
function update_progress_bar(event) {
if (event.lengthComputable) {
progress_bar.value = Math.round((event.loaded/event.total)*100);
}
}
load_file.onchange = function (event) {
let the_file = load_file.files[0];
let formData = new FormData();
let connection = new XMLHttpRequest();
formData.append('file', the_file, the_file.name);
connection.open('POST', 'upload', true);
connection.upload.onprogress = update_progress_bar;
connection.onload = function (event) {
if (connection.status != 200) {
alert('Error! ' + event);
}
};
connection.send(formData);
};
const status_messages = document.getElementById("status_messages");
const sse = new EventSource("stage");
sse.onopen = function (event) {
let new_message = document.createElement("p");
new_message.innerHTML = "Connection established: " + event.type;
status_messages.appendChild(new_message);
};
sse.onmessage = function (event) {
let new_message = document.createElement("p");
new_message.innerHTML = event.data;
status_messages.appendChild(new_message);
};
sse.onerror = function(event) {
let new_message = document.createElement("p");
if (event.readyState == EventSource.CLOSED) {
new_message.innerHTML = "Connections closed";
} else {
new_message.innerHTML = "Error: " + event.type;
}
status_messages.appendChild(new_message);
};
</script>
</body>
</html>
I need lengthy_operation() to be called only once, when the file is uploaded. And the messages generated by it to be sent to all the clients. Now it works with the local function, which is not what I want. How can I use the outer function and pass its messages into the stage() method?
I want notifications for these stages to be pushed to all the connected clients.
I suspect in the end you will want more control than that, but I will answer your question as it was asked. Later, you may want to build on the example below and filter the broadcasted notifications based on the user's session, or based on a certain starting timestamp, or some other relevant concept.
Each "connected client" is effectively hanging on a long-running request to /stage which the server will use to stream events to the client. In your example, each client will begin that request immediately and leave it open until the server terminates the stream. You can also close the stream from the client using close() on the EventSource.
Basic Solution
You asked how to have the /stage handler broadcast or mirror its events to all of the currently-connected clients. There are many ways you could accomplish this, but in a nutshell you want the lengthy_operation function to either post events to all /stage handler readers or to a persistent shared location from which all /stage handlers read. I will show a way to encapsulate the first idea described above.
Consider a generic stream event class that serializes to data: <some message>:
class StreamEvent:
def __init__(self, message: str) -> bytes:
self.message = message
def serialize(self) -> str:
return f'data: {self.message}\n\n'.encode('utf-8')
and a more specific derived case for file-related stream events:
class FileStreamEvent(StreamEvent):
def __init__(self, message: str, name: str):
super().__init__(message)
self.name = name
def serialize(self) -> bytes:
return f'data: file: {self.name}: {self.message}\n\n'.encode('utf-8')
You can create an extremely primitive publish/subscribe type of container where /stage can then subscribe listeners and lengthy_operation() can publish StreamEvent instances to all listeners:
class StreamSource:
def __init__(self):
self.listeners: List[Queue] = []
def put(self, event: StreamEvent):
for listener in self.listeners:
listener.put_nowait(event)
def get(self):
listener = Queue()
self.listeners.append(listener)
try:
while True:
event = listener.get()
yield event.serialize()
finally:
self.listeners.remove(listener)
In StreamSource.get(), you likely want to create an end case (e.g. check for a "close" or "finish" event) to exit from the generic while True and you likely want to set a timeout on the blocking Queue.get() call. But for the sake of this example, I kept everything basic.
Now, lengthy_operation() just needs a reference to a StreamSource:
def lengthy_operation(events: StreamSource, name: str, stream: BinaryIO):
for stage in range(10):
events.put(FileStreamEvent(f'stage {stage}: begin', name))
sleep(2)
events.put(FileStreamEvent(f'stage {stage}: end', name))
events.put(FileStreamEvent('finished', name))
SSETest can then provide a shared instance of StreamSource to each lengthy_operation() call and SSETest.stage() can use StreamSource.get() to register a listener on this shared instance:
class SSETest:
_stream_source: StreamSource = StreamSource()
#cherrypy.expose
def index(self):
return Path('SSETest.html').read_text()
#cherrypy.expose
def upload(self, file):
name = file.filename.encode('iso-8859-1').decode('utf-8')
lengthy_operation(self._stream_source, name, file.file)
return 'OK'
#cherrypy.expose
def stage(self):
cherrypy.response.headers['Cache-Control'] = 'no-cache'
cherrypy.response.headers['Content-Type'] = 'text/event-stream'
def stream():
yield from self._stream_source.get()
return stream()
stage._cp_config = {'response.stream': True}
This is a complete[1] example of how to resolve your immediate question but you will most likely want to adapt this as you work closer to the final user experience you probably have in mind.
[1]: I left out the imports for readability, so here they are:
from dataclasses import dataclass
from pathlib import Path
from queue import Queue
from time import sleep
from typing import BinaryIO, List
import cherrypy
Follow-on Exit Conditions
Since you are using cherrypy.quickstart(), in the minimal viable solution above you will have to forcefully exit the SSETest service as I did not assume any graceful "stop" behaviors for you. The first solution explicitly points this out but offers no solution for the sake of readability.
Let's look at a couple ways to provide some initial graceful "stop" conditions:
Add a stop condition to StreamSource
First, at least add a reasonable stop condition to StreamSource. For instance, add a running attribute that allows the StreamSource.get() while loop to exit gracefully. Next, set a reasonable Queue.get() timeout so the loop can periodically test this running attribute between processing messages. Next, ensure at least some relevant CherryPy bus messages trigger this stop behavior. Below, I have rolled all of this behavior into the StreamSource class but you could also register a separate application level CherryPy plugin to handle calling into StreamSource.stop() rather than making StreamSource a plugin. I will demonstrate what that looks like when I add a separate signal handler.
class StreamSource(plugins.SimplePlugin):
def __init__(self, bus: wspbus.Bus):
super().__init__(bus)
self.subscribe()
self.running = True
self.listeners: List[Queue] = []
def graceful(self):
self.stop()
def exit(self):
self.stop()
def stop(self):
self.running = False
def put(self, event: StreamEvent):
for listener in self.listeners:
listener.put_nowait(event)
def get(self):
listener = Queue()
self.listeners.append(listener)
try:
while self.running:
try:
event = listener.get(timeout=1.0)
yield event.serialize()
except Empty:
pass
finally:
self.listeners.remove(listener)
Now, SSETest will need to initialize StreamSource with a bus value since the class is now a SimplePlugin:
_stream_source: StreamSource = StreamSource(cherrypy.engine)
You will find that this solution gets you much closer to what you likely want in terms of user experience. Issue a keyboard interrupt and CherryPy will begin stopping the system, but the first graceful keyboard interrupt will not publish a stop message, for that you need to send a second keyboard interrupt.
Add a SIGINT handler to capture keyboard interrupts
Due to the way cherrypy.quickstart works with signal handlers, you may then want to register a SIGINT handler as a CherryPy-compatible SignalHandler plugin to gracefully stop the StreamSource at the first keyboard interrupt.
Here is an example:
class SignalHandler(plugins.SignalHandler):
def __init__(self, bus: wspbus.Bus, sse):
super().__init__(bus)
self.handlers = {
'SIGINT': self.handle_SIGINT,
}
self.sse = sse
def handle_SIGINT(self):
self.sse.stop()
raise KeyboardInterrupt()
Note that in this case I am demonstrating a generic application level handler which you can then configure and initialize by altering your startup cherrypy.quickstart() logic as follows:
sse = SSETest()
SignalHandler(cherrypy.engine, sse).subscribe()
cherrypy.quickstart(sse)
For this example, I expose a generic application SSETest.stop method to encapsulate the desired behavior:
class SSETest:
_stream_source: StreamSource = StreamSource(cherrypy.engine)
def stop(self):
self._stream_source.stop()
Wrap-up analysis
I am not a CherryPy user and I only started looking at it for the first time yesterday just to answer your question, so I will leave "CherryPy best practices" up to your discretion.
In reality, your problem is a very generic combination of the following Python questions:
how can I implement a simple publish/subscribe pattern? (answered with Queue);
how can I create an exit condition for the subscriber loop? (answered with Queue.get()'s timeout parameter and a running attribute)
how can I influence the exit condition with keyboard interrupts? (answered with a CherryPy-specific signal handler, but this merely sits on top of concepts you will find in Python's built in signal module)
You can solve all of these questions in many ways and some lean more toward generic "Pythonic" solutions (my preference where it makes sense) while others leverage CherryPy-centric concepts (and that makes sense in cases where you want to augment CherryPy behavior rather than rewrite or break it).
As an example, you could use CherryPy bus messages to convey stream messages, but to me that entangles your application logic a bit too much in CherryPy-specific features, so I would probably find a middle ground where you handle your application features generically (so as not to tie yourself to CherryPy) as seen in how my StreamSource example uses a standard Python Queue pattern. You could choose to make StreamSource a plugin so that it can respond to certain CherryPy bus messages directly (as I show above), or you could have a separate plugin that knows to call into the relevant application-specific domains such as StreamSource.stop() (similar to what I show with SignalHandler).
Last, all of your questions are great, but they have all likely been answered before on SO as generic Python questions, so while I am tying the answers here to your CherryPy problem space I also want to help you (and future readers) realize how to think about these particular problems more abstractly beyond CherryPy.

Faust example of publishing to a kafka topic

I'm curious about how you are supposed to express that you want a message delivered to a Kafka topic in faust. The example in their readme doesn't seem to write to a topic:
import faust
class Greeting(faust.Record):
from_name: str
to_name: str
app = faust.App('hello-app', broker='kafka://localhost')
topic = app.topic('hello-topic', value_type=Greeting)
#app.agent(topic)
async def hello(greetings):
async for greeting in greetings:
print(f'Hello from {greeting.from_name} to {greeting.to_name}')
#app.timer(interval=1.0)
async def example_sender(app):
await hello.send(
value=Greeting(from_name='Faust', to_name='you'),
)
if __name__ == '__main__':
app.main()
I would expect hello.send in the above code to publish a message to the topic, but it doesn't appear to.
There are many examples of reading from topics, and many examples of using the cli to push an ad-hoc message. After combing through the docs, I don't see any clear examples of publishing to topics in code. Am I just being crazy and the above code should work?
You can use sink to tell Faust where to deliver the results of an agent function. You can also use multiple topics as sinks at once if you want.
#app.agent(topic_to_read_from, sink=[destination_topic])
async def fetch(records):
async for record in records:
result = do_something(record)
yield result
The send() function is the correct one to call to write to topics. You can even specify a particular partition, just like the equivalent Java API call.
Here is the reference for the send() method:
https://faust.readthedocs.io/en/latest/reference/faust.topics.html#faust.topics.Topic.send
If you want a Faust producer only (not combined with a consumer/sink), the original question actually has the right bit of code, here's a fully functional script that publishes messages to a 'faust_test' Kafka topic that is consumable by any Kafka/Faust consumer.
Run the code below like this: python faust_producer.py worker
"""Simple Faust Producer"""
import faust
if __name__ == '__main__':
"""Simple Faust Producer"""
# Create the Faust App
app = faust.App('faust_test_app', broker='localhost:9092')
topic = app.topic('faust_test')
# Send messages
#app.timer(interval=1.0)
async def send_message(message):
await topic.send(value='my message')
# Start the Faust App
app.main()
So we just ran into the need to send a message to a topic other than the sink topics.
The easiest way we found was: foo = await my_topic.send_soon(value="wtfm8").
You can also use send directly like below using the asyncio event loop.
loop = asyncio.get_event_loop()
foo = await ttopic.send(value="wtfm8??")
loop.run_until_complete(foo)
Dont know how relevant this is anymore but I came across this issue when trying to learn Faust. From what I read, here is what is happening:
topic = app.topic('hello-topic', value_type=Greeting)
The misconception here is that the topic you have created is the topic you are trying to consume/read from. The topic you created currently does not do anything.
await hello.send(
value=Greeting(from_name='Faust', to_name='you'),
)
this essentially creates an intermediate kstream which sends the values to your hello(greetings) function. def hello(...) will be called when there is a new message to the stream and will process the message that is being sent.
#app.agent(topic)
async def hello(greetings):
async for greeting in greetings:
print(f'Hello from {greeting.from_name} to {greeting.to_name}')
This is receiving the kafka stream from hello.send(...) and simply printing it to the console (no output to the 'topic' created). This is where you can send a message to a new topic. so instead of printing you can do:
topic.send(value = "my message!")
Alternatively:
Here is what you are doing:
example_sender() sends a message to hello(...) (through intermediate kstream)
hello(...) picks up the message and prints it
NOTICE: no sending of messages to the correct topic
Here is what you can do:
example_sender() sends a message to hello(...) (through intermediate kstream)
hello(...) picks up the message and prints
hello(...) ALSO sends a new message to the topic created(assuming you are trying to transform the original data)
app = faust.App('hello-app', broker='kafka://localhost')
topic = app.topic('hello-topic', value_type=Greeting)
output_topic = app.topic('test_output_faust', value_type=str)
#app.agent(topic)
async def hello(greetings):
async for greeting in greetings:
new_message = f'Hello from {greeting.from_name} to {greeting.to_name}'
print(new_message)
await output_topic.send(value=new_message)
I found a solution to how to send data to kafka topics using Faust, but I don't really understand how it works.
There are several methods for this in Faust: send(), cast(), ask_nowait(), ask(). In the documentation they are called RPC operations.
After creating the sending task, you need to run the Faust application in the mode Client-Only Mode. (start_client(), maybe_start_client())
The following code (the produce() function) demonstrates their application (pay attention to the comments):
import asyncio
import faust
class Greeting(faust.Record):
from_name: str
to_name: str
app = faust.App('hello-app', broker='kafka://localhost')
topic = app.topic('hello-topic', value_type=Greeting)
result_topic = app.topic('result-topic', value_type=str)
#app.agent(topic)
async def hello(greetings):
async for greeting in greetings:
s = f'Hello from {greeting.from_name} to {greeting.to_name}'
print(s)
yield s
async def produce(to_name):
# send - universal method for sending data to a topic
await hello.send(value=Greeting(from_name='SEND', to_name=to_name), force=True)
await app.maybe_start_client()
print('SEND')
# cast - allows you to send data without waiting for a response from the agent
await hello.cast(value=Greeting(from_name='CAST', to_name=to_name))
await app.maybe_start_client()
print('CAST')
# ask_nowait - it seems to be similar to cast
p = await hello.ask_nowait(
value=Greeting(from_name='ASK_NOWAIT', to_name=to_name),
force=True,
reply_to=result_topic
)
# without this line, ask_nowait will not work; taken from the ask implementation
await app._reply_consumer.add(p.correlation_id, p)
await app.maybe_start_client()
print(f'ASK_NOWAIT: {p.correlation_id}')
# blocks the execution flow
# p = await hello.ask(value=Greeting(from_name='ASK', to_name=to_name), reply_to=result_topic)
# print(f'ASK: {p.correlation_id}')
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(produce('Faust'))
Starting Fast worker with the command faust -A <example> worker
Then we can launch the client part of the application and check that everything is working: python <example.py>
<example.py> output:
SEND
CAST
ASK_NOWAIT: bbbe6795-5a99-40e5-a7ad-a9af544efd55
It is worth noting that you will also see a traceback of some error that occurred after delivery, which does not interfere with the program (it seems so)
Faust worker output:
[2022-07-19 12:06:27,959] [1140] [WARNING] Hello from SEND to Faust
[2022-07-19 12:06:27,960] [1140] [WARNING] Hello from CAST to Faust
[2022-07-19 12:06:27,962] [1140] [WARNING] Hello from ASK_NOWAIT to Faust
I don't understand why it works this way, why it's so difficult and why very little is written about in the documentation 😓.

Python aiohttp (with asyncio) sends requests very slowly

Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.

Python asyncio with Slack bot

I'm trying to make a simple Slack bot using asyncio, largely using the example here for the asyncio part and here for the Slack bot part.
Both the examples work on their own, but when I put them together it seems my loop doesn't loop: it goes through once and then dies. If info is a list of length equal to 1, which happens when a message is typed in a chat room with the bot in it, the coroutine is supposed to be triggered, but it never is. (All the coroutine is trying to do right now is print the message, and if the message contains "/time", it gets the bot to print the time in the chat room it was asked in). Keyboard interrupt also doesn't work, I have to close the command prompt every time.
Here is my code:
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "MY TOKEN"
sc = SlackClient(token)
#asyncio.coroutine
def read_text(info):
if 'text' in info[0]:
print(info[0]['text'])
if r'/time' in info[0]['text']:
print(info)
resp = 'The time is ' + dt.datetime.strftime(dt.datetime.now(),'%H:%M:%S')
print(resp)
chan = info[0]['channel']
sc.rtm_send_message(chan, resp)
loop = asyncio.get_event_loop()
try:
sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
asyncio.async(read_text(info))
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print('step: loop.close()')
loop.close()
I think it's the loop part that's broken, since it never seems to get to the coroutine. So maybe a shorter way of asking this question is what is it about my try: statement that prevents it from looping like in the asyncio example I followed? Is there something about sc.rtm_connect() that it doesn't like?
I'm new to asyncio, so I'm probably doing something stupid. Is this even the best way to try and go about this? Ultimately I want the bot to do some things that take quite a while to compute, and I'd like it to remain responsive in that time, so I think I need to use asyncio or threads in some variety, but I'm open to better suggestions.
Thanks a lot,
Alex
I changed it to the following and it worked:
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "MY TOKEN"
sc = SlackClient(token)
#asyncio.coroutine
def listen():
yield from asyncio.sleep(1)
x = sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
if 'text' in info[0]:
print(info[0]['text'])
if r'/time' in info[0]['text']:
print(info)
resp = 'The time is ' + dt.datetime.strftime(dt.datetime.now(),'%H:%M:%S')
print(resp)
chan = info[0]['channel']
sc.rtm_send_message(chan, resp)
asyncio.async(listen())
loop = asyncio.get_event_loop()
try:
asyncio.async(listen())
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print('step: loop.close()')
loop.close()
Not entirely sure why that fixes it, but the key things I changed were putting the sc.rtm_connect() call in the coroutine and making it x = sc.rtm_connect(). I also call the listen() function from itself at the end, which appears to be what makes it loop forever, since the bot doesn't respond if I take it out. I don't know if this is the way this sort of thing is supposed to be set up, but it does appear to continue to accept commands while it's processing earlier commands, my slack chat looks like this:
me [12:21 AM]
/time
[12:21]
/time
[12:21]
/time
[12:21]
/time
testbotBOT [12:21 AM]
The time is 00:21:11
[12:21]
The time is 00:21:14
[12:21]
The time is 00:21:16
[12:21]
The time is 00:21:19
Note that it doesn't miss any of my /time requests, which it would if it weren't doing this stuff asynchronously. Also, if anyone is trying to replicate this you'll notice that slack brings up the built in command menu if you type "/". I got around this by typing a space in front.
Thanks for the help, please let me know if you know of a better way of doing this. It doesn't seem to be a very elegant solution, and the bot can't be restarted after I use the a cntrl-c keyboard interrupt to end it - it says
Task exception was never retrieved
future: <Task finished coro=<listen() done, defined at asynctest3.py:8> exception=AttributeError("'NoneType' object has no attribute 'recv'",)>
Traceback (most recent call last):
File "C:\Users\Dell-F5\AppData\Local\Programs\Python\Python35-32\Lib\asyncio\tasks.py", line 239, in _step
result = coro.send(None)
File "asynctest3.py", line 13, in listen
info = sc.rtm_read()
File "C:\Users\Dell-F5\Envs\sbot\lib\site-packages\slackclient\_client.py", line 39, in rtm_read
json_data = self.server.websocket_safe_read()
File "C:\Users\Dell-F5\Envs\sbot\lib\site-packages\slackclient\_server.py", line 110, in websocket_safe_read
data += "{0}\n".format(self.websocket.recv())
AttributeError: 'NoneType' object has no attribute 'recv'
Which I guess means it's not closing the websockets nicely. Anyway, that's just an annoyance, at least the main problem is fixed.
Alex
Making blocking IO calls inside a coroutine defeat the very purpose of using asyncio (e.g. info = sc.rtm_read()). If you don't have a choice, use loop.run_in_executor to run the blocking call in a different thread. Careful though, some extra locking might be needed.
However, it seems there's a few asyncio-based slack client libraries you could use instead:
slacker-asyncio - fork of slacker, based on aiohttp
butterfield - based on slacker and websockets
EDIT: Butterfield uses the Slack real-time messaging API. It even provides an echo bot example that looks very much like what you're trying to achieve:
import asyncio
from butterfield import Bot
#asyncio.coroutine
def echo(bot, message):
yield from bot.post(
message['channel'],
message['text']
)
bot = Bot('slack-bot-key')
bot.listen(echo)
butterfield.run(bot)

Categories

Resources