I have a scraper that requires the use of a websocket server (can't go into too much detail on why because of company policy) that I'm trying to turn into a template/module for easier use on other websites.
I have one main function that runs the loop of the server (e.g. ping-pongs to keep the connection alive and send work and stop commands when necessary) that I'm trying to turn into a generator that yields the HTML of scraped pages (asynchronously, of course). However, I can't figure out a way to turn the server into a generator.
This is essentially the code I would want (simplified to just show the main idea, of course):
import asyncio, websockets
needsToStart = False # Setting this to true gets handled somewhere else in the script
async def run(ws):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
yield data # Yielding the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
generator = websockets.serve(run, 'localhost', 9999)
while True:
html = await anext(generator)
# Do whatever with html
This, of course, doesn't work, giving the error "TypeError: 'Serve' object is not callable". But is there any way to set up something along these lines? An alternative I could try is creating an 'intermittent' object that holds the data which the end loop awaits, but that seems messier to me than figuring out a way to get this idea to work.
Thanks in advance.
I found a solution that essentially works backwards, for those in need of the same functionality: instead of yielding the data, I pass along the function that processes said data. Here's the updated example case:
import asyncio, websockets
from functools import partial
needsToStart = False # Setting this to true gets handled somewhere else in the script
def process(html):
pass
async def run(ws, htmlFunc):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
htmlFunc(data) # Processing the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
func = partial(run, htmlFunc=process)
websockets.serve(func, 'localhost', 9999)
Related
I'm setting a python websocket client that should make send and receive request's as described:
Connect to the websocket.
Send the request to get current timestamp.
Receive back the current timestamp.
Compare times , if times are synced continue, if not reply ("not_synced!").
Send the machine name (in this case it is defined in the config file)
The server response back with a timestamp in the future when it is
Expecting a ping, the time is saved in config file
Close connection and wait for current time to match the time in the future!
By now, I have perfectly created functions for reading/saving in strings in the config file, comparing the received time with current time.
The only issue I can`t figure out how to solve it's the communication to the server, actually I want to define one function that should do all the communication through.
Tried defining function without asyncio, I couldn't return received message.
While using asyncio, I couldn't pass the argument in function (actually the message string!)
import asyncio
import websockets
async def connect(msg):
async with websockets.connect("ws://connect.websocket.in /xnode?room_id=19210") as socket: # the opencfg function reads a file, in this case, line 4 of config file where url is stored
await socket.send(msg)
result =await socket.recv()
return result
asyncio.get_event_loop().run_until_complete(connect())
def connect2(msg):
soc= websockets.connect("ws://connect.websocket.in /xnode?room_id=19210")
soc.send(msg)
result=soc.recv()
return result
print(connect2("gettime"))
If you would try to send "gettime" , you will receive back the current timestamp, and after sending the "|online" you should receive back a value that is equal to current timestamp + 10.
You have the websocketurl so try it for yourself.
I changed your code to use asynio.gather to get the return value and passed "gettime" to the function:
import asyncio
import websockets
address = "ws://connect.websocket.in/xnode?room_id=19210"
async def connect(msg):
async with websockets.connect(address) as socket:
await socket.send(msg)
result = await socket.recv()
return result
result = asyncio.get_event_loop().run_until_complete(asyncio.gather(connect("gettime")))
print(result)
Output
['1564626191']
You can reuse the code by putting it into a function definition:
def get_command(command):
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(connect(command)))
return result
result = get_command("gettime")
print(result)
I am downloading some information from webpages in the form
http://example.com?p=10
http://example.com?p=20
...
The point is that I don't know how many they are. At some point I will receive an error from the server, or maybe at some point I want to stop the processing since I have enough. I want to run them in parallel.
def generator_query(step=10):
i = 0
yield "http://example.com?p=%d" % i
i += step
def task(url):
t = request.get(url).text
if not t: # after the last one
return None
return t
I can implement it with consumer/producer pattern with queues, but I am wondering it is possible to have an higher level implementation, for example with the concurrent module.
Non-concurrent example:
results = []
for url in generator_query():
results.append(task(url))
You could use concurrent's ThreadPoolExecutor. An example of how to use it is provided here.
You'll need to break out of the example's for-loop, when you're getting invalid answers from the server (the except section) or whenever you feel like you got enough data (you could count valid responses in the else section for example).
You could use aiohttp for this purpose:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def coro(step):
url = 'https://example.com?p={}'.format(step)
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
print(html)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
tasks = [coro(i*10) for i in range(10)]
loop.run_until_complete(asyncio.wait(tasks))
as for the page error, you might have to figure it yourself since I don't know what website you're dealing with. Maybe try...except?
Notice: if your python version is higher than 3.5, it might cause an ssl certificate verification error.
I am trying to create simple active users counter using aiohttp WebSockets and aioredis for storage. When I add a new tab in Google Chrome, my counter increments perfectly in all already opened tabs. However, when I close a tab, nothing changes in other tabs.
I think I should be missing something in whole async/await machinery, but cannot find what can be wrong.
Here is my app
import asyncio
import aiohttp
from aiohttp import web
import aioredis
class CounterView(web.View):
async def get(self):
request = self.request
app = request.app
ws = web.WebSocketResponse()
app['websockets'].append(ws)
await ws.prepare(request)
count = int(await app['db'].incr('counter'))
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
async for msg in ws:
if msg.type == aiohttp.WSMsgType.TEXT:
await ws.send_str(msg.data)
elif msg.type == aiohttp.WSMsgType.ERROR:
print('ws connection closed with exception %s' %
ws.exception())
app['websockets'].remove(ws)
# Execution stops here (on await app['db'] ...) and never returns
count = int(await app['db'].decr('counter'))
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
return ws
async def init_app(loop):
app = web.Application(loop=loop)
db = await aioredis.create_redis('redis://localhost', loop=loop)
app['db'] = db
app['websockets'] = []
app.add_routes([
web.get('', CounterView),
])
return app
if __name__ == '__main__':
loop = asyncio.get_event_loop()
web.run_app(init_app(loop))
And index.html template
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
How many people seeing this page now: <span id="counter"></span>
</body>
<script>
window.onload = function () {
const ws = new WebSocket('ws://localhost:8080');
ws.onmessage = function (event) {
const data = JSON.parse(event.data);
let span = document.getElementById('counter');
console.log(data.msg.count);
span.innerHTML = data.msg.count;
}
};
</script>
</html>
I have also tried in Firefox, and some really weird things happens there.
Opened two tabs, got counter = 2 on both. Then reload first - got 1 in it and still 2 in second one. Reload first tab again - got 2. After this, each reload gives 2.
Until I reload second tab - the same process (reload - 1 - reload - 2 happens there and repeats in first tab)
Also i tried to apply https://stackoverflow.com/a/48695448/6627564 this answer, but nothing changed.
Debugging shows that code executes up to count = int(await app['db'].decr('counter')) and then jumps somewhere to never return back.
Any help is greatly appreciated. As far as I understand, the event loop SHOULD return to execution after this line. Maybe the coroutine is somehow destroyed, but I haven't found any code in library doing this.
My problem is different from what is described in Python Asyncio Websocket not detecting a disconnect on wifi but does on localhost
First of all, my connections are all over localhost.
Secondly, the code after async for msg in ws loop actually starts executing, and debugging shows that ws.close() method is actually called. BUT there is a context switch on next await and execution doesn't go any further.
I have also tried using ws = web.WebSocketResponse(heartbeat=1.0) to activate ping-pong, but I can't see any messages in Dev Tools. I have added single await ws.ping() after await ws.prepare(request) and unfortunately no messages appeared in Dev Tools. Something is definitely going wrong here...
For anyone interested in this problem - the solution.
There are three issues in this code). Two of them are actually unrelated to asyncio.
First of all, app['websockets'] is list and for some reason remove(ws) fails to find correct WebSocketResponse instance and removes another WebSocketResponse from list.
The solution is to use set() instead of list for storing active websockets. This is because set.discard() uses __hash__ magic method, and list.remove() uses __eq__ method. Unfortunately, I cannot find the implementation detail for __eq__ in WebSocketResponse , but __hash__ is using builtin id function which guarantees correct work.
Secondly, look at this lines
ws = web.WebSocketResponse()
....
......
for ws in app['websockets']:
await ws.send_json({'msg': {'count': count}})
Local variable ws is overwritten in the for loop.
The solution is to just use other variable name for iterating, like other_ws
The third one is described in aiohttp's documentation Web Handler Cancellation.
It states that on every await call the handler can be terminated, if client has dropped connection. This is exactly the case - on the first await after dropped connection my handler died. The solutions is provided in the documentation as well, I decided to use asyncio.shield
.
Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.
I'm trying to make a simple Slack bot using asyncio, largely using the example here for the asyncio part and here for the Slack bot part.
Both the examples work on their own, but when I put them together it seems my loop doesn't loop: it goes through once and then dies. If info is a list of length equal to 1, which happens when a message is typed in a chat room with the bot in it, the coroutine is supposed to be triggered, but it never is. (All the coroutine is trying to do right now is print the message, and if the message contains "/time", it gets the bot to print the time in the chat room it was asked in). Keyboard interrupt also doesn't work, I have to close the command prompt every time.
Here is my code:
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "MY TOKEN"
sc = SlackClient(token)
#asyncio.coroutine
def read_text(info):
if 'text' in info[0]:
print(info[0]['text'])
if r'/time' in info[0]['text']:
print(info)
resp = 'The time is ' + dt.datetime.strftime(dt.datetime.now(),'%H:%M:%S')
print(resp)
chan = info[0]['channel']
sc.rtm_send_message(chan, resp)
loop = asyncio.get_event_loop()
try:
sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
asyncio.async(read_text(info))
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print('step: loop.close()')
loop.close()
I think it's the loop part that's broken, since it never seems to get to the coroutine. So maybe a shorter way of asking this question is what is it about my try: statement that prevents it from looping like in the asyncio example I followed? Is there something about sc.rtm_connect() that it doesn't like?
I'm new to asyncio, so I'm probably doing something stupid. Is this even the best way to try and go about this? Ultimately I want the bot to do some things that take quite a while to compute, and I'd like it to remain responsive in that time, so I think I need to use asyncio or threads in some variety, but I'm open to better suggestions.
Thanks a lot,
Alex
I changed it to the following and it worked:
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "MY TOKEN"
sc = SlackClient(token)
#asyncio.coroutine
def listen():
yield from asyncio.sleep(1)
x = sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
if 'text' in info[0]:
print(info[0]['text'])
if r'/time' in info[0]['text']:
print(info)
resp = 'The time is ' + dt.datetime.strftime(dt.datetime.now(),'%H:%M:%S')
print(resp)
chan = info[0]['channel']
sc.rtm_send_message(chan, resp)
asyncio.async(listen())
loop = asyncio.get_event_loop()
try:
asyncio.async(listen())
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print('step: loop.close()')
loop.close()
Not entirely sure why that fixes it, but the key things I changed were putting the sc.rtm_connect() call in the coroutine and making it x = sc.rtm_connect(). I also call the listen() function from itself at the end, which appears to be what makes it loop forever, since the bot doesn't respond if I take it out. I don't know if this is the way this sort of thing is supposed to be set up, but it does appear to continue to accept commands while it's processing earlier commands, my slack chat looks like this:
me [12:21 AM]
/time
[12:21]
/time
[12:21]
/time
[12:21]
/time
testbotBOT [12:21 AM]
The time is 00:21:11
[12:21]
The time is 00:21:14
[12:21]
The time is 00:21:16
[12:21]
The time is 00:21:19
Note that it doesn't miss any of my /time requests, which it would if it weren't doing this stuff asynchronously. Also, if anyone is trying to replicate this you'll notice that slack brings up the built in command menu if you type "/". I got around this by typing a space in front.
Thanks for the help, please let me know if you know of a better way of doing this. It doesn't seem to be a very elegant solution, and the bot can't be restarted after I use the a cntrl-c keyboard interrupt to end it - it says
Task exception was never retrieved
future: <Task finished coro=<listen() done, defined at asynctest3.py:8> exception=AttributeError("'NoneType' object has no attribute 'recv'",)>
Traceback (most recent call last):
File "C:\Users\Dell-F5\AppData\Local\Programs\Python\Python35-32\Lib\asyncio\tasks.py", line 239, in _step
result = coro.send(None)
File "asynctest3.py", line 13, in listen
info = sc.rtm_read()
File "C:\Users\Dell-F5\Envs\sbot\lib\site-packages\slackclient\_client.py", line 39, in rtm_read
json_data = self.server.websocket_safe_read()
File "C:\Users\Dell-F5\Envs\sbot\lib\site-packages\slackclient\_server.py", line 110, in websocket_safe_read
data += "{0}\n".format(self.websocket.recv())
AttributeError: 'NoneType' object has no attribute 'recv'
Which I guess means it's not closing the websockets nicely. Anyway, that's just an annoyance, at least the main problem is fixed.
Alex
Making blocking IO calls inside a coroutine defeat the very purpose of using asyncio (e.g. info = sc.rtm_read()). If you don't have a choice, use loop.run_in_executor to run the blocking call in a different thread. Careful though, some extra locking might be needed.
However, it seems there's a few asyncio-based slack client libraries you could use instead:
slacker-asyncio - fork of slacker, based on aiohttp
butterfield - based on slacker and websockets
EDIT: Butterfield uses the Slack real-time messaging API. It even provides an echo bot example that looks very much like what you're trying to achieve:
import asyncio
from butterfield import Bot
#asyncio.coroutine
def echo(bot, message):
yield from bot.post(
message['channel'],
message['text']
)
bot = Bot('slack-bot-key')
bot.listen(echo)
butterfield.run(bot)