Pyppeteer: Detect navigation on Page.click method - python

I have the following piece of code which loads a page and follows a link within it, using asyncio.gather as recommended in the documentation for the click method:
import asyncio
import pyppeteer
async def main(selector):
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto("https://example.org")
result = await asyncio.gather(
page.waitForNavigation(),
page.click(selector),
)
result = next(filter(None, result))
if result and isinstance(result, pyppeteer.network_manager.Response):
print(result.status, result.url)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main(selector="a"))
This works wonders when the click method does trigger a navigation event:
200 https://www.iana.org/domains/reserved
But I'm trying to write generic code that could handle any selector, even those which would not cause a navigation. If I change the selector to "h1" I get the following timeout, because there is no navigation to wait for:
Traceback (most recent call last):
File "stackoverflow.py", line 20, in <module>
loop.run_until_complete(main(selector="h1"))
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "stackoverflow.py", line 12, in main
page.click(selector),
File "/.../lib/python3.6/site-packages/pyppeteer/page.py", line 938, in waitForNavigation
raise error
pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
I could not find a way to detect whether or not the click event would produce an additional request, in order to avoid the Page.waitForNavigation call. Is it possible? Thanks in advance for your time!

Related

Opposite of asyncio.to_thread

How do I run asynchronous function in blocking style? I am not allowed to modify signature of my mock function f1() and can't easy switch to async def and so can't use await expression.
async def cc2():
await asyncio.sleep(1)
await asyncio.to_thread(print, "qwerty")
def f1():
t = asyncio.create_task(cc2())
# wait until the task finished before returning
async def main():
f1() # typical call from many places of a program
asyncio.run(main())
I tried asyncio.get_running_loop().run_until_complete(t), but the hack does not work and I get the next error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
return future.result()
File "<stdin>", line 2, in main
File "<stdin>", line 3, in f1
File "/usr/lib/python3.9/asyncio/base_events.py", line 618, in run_until_complete
self._check_running()
File "/usr/lib/python3.9/asyncio/base_events.py", line 578, in _check_running
raise RuntimeError('This event loop is already running')
RuntimeError: This event loop is already running
How do I run asynchronous function in blocking style?
If you are in sync code, you can call it with asyncio.run(async_function()). If you are in async code, you can await async_function() or asyncio.create_task(async_function()), with the latter scheduling it to run in the background. You are not allowed to use asyncio.run() (or run_until_complete, even with a newly created event loop object) inside async code because it blocks and could halt the outer event loop.
But if you need it for testing purposes, you can always do something like:
# XXX dangerous - could block the current event loop
with concurrent.futures.ThreadPoolExecutor() as pool:
pool.submit(asyncio.run, async_function()).result()

Avoid traceback after CTRL+C in async

I'm currently exploring async/await in python.
I'd like to start Pyrogram, then run the cronjob function. I created this code:
import asyncio
from pyrogram import Client, Filters, MessageHandler
import time
app = Client("pihome")
async def cronjob():
print("Cronjob started")
while True:
#print(f"time is {time.strftime('%X')}")
print(int(time.time()))
if int(time.strftime('%S')) == 10:
print("SECONDO 10") # Change this with checks in future
await asyncio.sleep(1)
async def main():
print(f"Starting {app.session_name}...\n")
# Pyrogram
await app.start()
print("Pyrogram started")
# Cronjob
loop = asyncio.new_event_loop()
loop.run_until_complete(loop.create_task(await cronjob()))
print("\nDone")
await Client.idle()
await app.stop()
print(f"\nStopping {app.session_name}...")
if __name__ == "__main__":
asyncio.get_event_loop().run_until_complete(main())
But when i want to stop it with Ctrl+C it gives me this traceback:
^CTraceback (most recent call last):
File "pihome.py", line 39, in <module>
asyncio.get_event_loop().run_until_complete(main())
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
self.run_forever()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
self._run_once()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
event_list = self._selector.select(timeout)
File "/usr/local/lib/python3.8/selectors.py", line 468, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
How can I solve it? Also, try/except blocks seem not working
To run the application, it is better to use asyncio.run(main()). In my opinion, it’s more clear.
Replace loop = asyncio.new_event_loop() by loop = asyncio.get_event_loop(). Because using asyncio.new_event_loop() after asyncio.get_event_loop().run_until_complete(main()) you create a second event loop in main thread which is prohibited. Only one event loop per thread is permitted!
Remove await from this line:
loop.run_until_complete(loop.create_task(await cronjob()))
Because create_task need to Coroutine as the first argument but you pass it None

Awaiting an asyncio.Future raises concurrent.futures._base.CancelledError instead of waiting for a value/exception to be set

When I run the following python code:
import asyncio
import logging
logging.basicConfig(level=logging.DEBUG)
async def read_future(fut):
print(await fut)
async def write_future(fut):
fut.set_result('My Value')
async def main():
loop = asyncio.get_running_loop()
fut = loop.create_future()
asyncio.gather(read_future(fut), write_future(fut))
asyncio.run(main(), debug=True)
Instead of read_future waiting for the result of fut to be set, the program crashes with the following error:
DEBUG:asyncio:Using selector: KqueueSelector
ERROR:asyncio:_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError() created at /Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/tasks.py:642>
source_traceback: Object created at (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 566, in run_until_complete
self.run_forever()
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 534, in run_forever
self._run_once()
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 1763, in _run_once
handle._run()
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
File "<stdin>", line 4, in main
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/tasks.py", line 766, in gather
outer = _GatheringFuture(children, loop=loop)
File "/Users/<user>/.pyenv/versions/3.7.4/lib/python3.7/asyncio/tasks.py", line 642, in __init__
super().__init__(loop=loop)
concurrent.futures._base.CancelledError
DEBUG:asyncio:Close <_UnixSelectorEventLoop running=False closed=False debug=True>
What am I doing wrong in this code? I want to be able to await the Future fut and continue after the Future has a value/exception set.
Your problem is that asyncio.gather is itself async (returns an awaitable); by not awaiting it, you never handed control back to the event loop, nor did you store the awaitable, so it was immediately cleaned up, implicitly cancelling it, and by extension, all of the awaitables it controlled.
To fix, just make sure you await the results of the gather:
await asyncio.gather(read_future(fut), write_future(fut))
Try it online!
From https://docs.python.org/3/library/asyncio-future.html#asyncio.Future.result:
result()
Return the result of the Future.
If the Future is done and has a result set by the set_result() method, the result value is returned.
If the Future is done and has an exception set by the set_exception() method, this method raises the exception.
If the Future has been cancelled, this method raises a CancelledError exception.
If the Future’s result isn’t yet available, this method raises a InvalidStateError exception.
(Bold added.)
I'm not sure why the Future is getting cancelled, but that seems to be the cause of the issue.

How to convert lots of HTML to JSON with asyncio in Python

This is my first attempt at using asyncio in python. Objective is to convert 40000+ htmls to jsons. Using a synchronous for loop takes about 3.5 minutes. I am interested to see the performance boost using asyncio. I am using the following code:
import glob
import json
from parsel import Selector
import asyncio
import aiofiles
async def read_html(path):
async with aiofiles.open(path, 'r') as f:
html = await f.read()
return html
async def parse_job(path):
html = await read_html(path)
sel_obj = Selector(html)
jobs = dict()
jobs['some_var'] = sel_obj.xpath('some-xpath').get()
return jobs
async def write_json(path):
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
# this function is from realpython tutorial.
# I have little understanding of whats going on with gather()
tasks = list()
for file in files:
tasks.append(write_json(file))
await asyncio.gather(*tasks)
if __name__ == "__main__":
files = glob.glob("some_folder_path/*.html")
asyncio.run(bulk_read_and_write(files))
After a few seconds of running, I am getting the following error.
Traceback (most recent call last):
File "06_extract_jobs_async.py", line 84, in <module>
asyncio.run(bulk_read_and_write(files))
File "/anaconda3/envs/py37/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/anaconda3/envs/py37/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "06_extract_jobs_async.py", line 78, in bulk_read_and_write
await asyncio.gather(*tasks)
File "06_extract_jobs_async.py", line 68, in write_json
job = await parse_job(path)
File "06_extract_jobs_async.py", line 35, in parse_job
html = await read_html(path)
File "06_extract_jobs_async.py", line 29, in read_html
async with aiofiles.open(path, 'r') as f:
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = yield from self._coro
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 35, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/anaconda3/envs/py37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
OSError: [Errno 24] Too many open files: '../html_output/jobs/6706538_478752_job.html'
What is going on here? Thanks in advance
You're making async calls as fast as you can, but the process of writing a file to disk is still an effectively synchronous task. Your OS can try to perform multiple writes at once, but there is a limit. By spawning async tasks as quickly as possible, you're getting a lot of results at once, meaning a huge amount of files open for writing at the same time. As your error suggests, there is a limit.
There are lots of good threads on here about limiting concurrency with asyncio, but the easiest solution is probably asyncio-pool with a reasonable size.
Try adding a limit to the number of parallel tasks:
# ...rest of code unchanged
async def write_json(path, limiter):
with limiter:
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
limiter = asyncio.Semaphore(1000)
tasks = []
for file in files:
tasks.append(write_json(file, limiter))
await asyncio.gather(*tasks)

Why am I getting "RuntimeError: This event loop is already running"

Im working on a slack bot using the new slack 2.0 python library. I am new to python decorators and I suspect that is part of my problem.
Here is my code...
#!/opt/rh/rh-python36/root/usr/bin/python
import os
import slack
# instantiate Slack client
slack_token = os.environ['SLACK_BOT_TOKEN']
rtmclient = slack.RTMClient(token=slack_token)
webclient = slack.WebClient(token=slack_token)
# get the id of my user
bot_id = webclient.auth_test()['user_id']
print('Bot ID: {0}'.format(bot_id))
def get_user_info(user_id):
user_info = webclient.users_info(user=user_id)['ok']
return user_info
#slack.RTMClient.run_on(event='message')
def parse_message(**payload):
data = payload['data']
user_id = data['user']
print(get_user_info(user_id))
rtmclient.start()
It outputs the Bot ID(using the webclient) when started but then crashes with RuntimeError: This event loop is already running when I make another call to webclient.
[root#slackbot-01 bin]# scl enable rh-python36 /root/slackbot/bin/slackbot.py
Bot ID: UBT547D31
Traceback (most recent call last):
File "/root/slackbot/bin/slackbot.py", line 24, in <module>
rtmclient.start()
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/rtm/client.py", line 197, in start
return self._event_loop.run_until_complete(future)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/asyncio/base_events.py", line 467, in run_until_complete
return future.result()
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/rtm/client.py", line 339, in _connect_and_read
await self._read_messages()
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/rtm/client.py", line 390, in _read_messages
await self._dispatch_event(event, data=payload)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/rtm/client.py", line 440, in _dispatch_event
self._execute_in_thread(callback, data)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/rtm/client.py", line 465, in _execute_in_thread
future.result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/slackbot/bin/slackbot.py", line 22, in parse_message
print(get_user_info(user_id))
File "/root/slackbot/bin/slackbot.py", line 15, in get_user_info
user_info = webclient.users_info(user=user_id)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/web/client.py", line 1368, in users_info
return self.api_call("users.info", http_verb="GET", params=kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/slack/web/base_client.py", line 154, in api_call
return self._event_loop.run_until_complete(future)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/asyncio/base_events.py", line 454, in run_until_complete
self.run_forever()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/asyncio/base_events.py", line 408, in run_forever
raise RuntimeError('This event loop is already running')
RuntimeError: This event loop is already running
The really confusing part to me is that if I comment out the line that makes the first call to webclient.auth_test(), I have no issues at all. My call to webclient.users_info() works every time rtmclient sends me data.
#!/opt/rh/rh-python36/root/usr/bin/python
import os
import slack
# instantiate Slack client
slack_token = os.environ['SLACK_BOT_TOKEN']
rtmclient = slack.RTMClient(token=slack_token)
webclient = slack.WebClient(token=slack_token)
# get the id of my user
#bot_id = webclient.auth_test()['user_id']
#print('Bot ID: {0}'.format(bot_id))
def get_user_info(user_id):
user_info = webclient.users_info(user=user_id)['ok']
return user_info
#slack.RTMClient.run_on(event='message')
def parse_message(**payload):
data = payload['data']
user_id = data['user']
print(get_user_info(user_id))
rtmclient.start()
[root#slackbot-01 bin]# scl enable rh-python36 /root/slackbot/bin/slackbot.py
True
True
^C[root#slackbot-01 bin]#
I need to get the bot id so that I can make sure it doesnt answer it's own messages. I don't why my code doesnt work after I get the bot id outside of the parse message function with a decorator.
What am I doing wrong here?
The python event loop is a tricky thing to program libraries around and there are some issues with the way the event queue is managed in the 2.0 version of SlackClient. It looks like some improvements were made with 2.1 but it appears to be a work in progress, and I still encounter this. I'd expect there will be future updates to make it more robust.
In the meantime, the following code at the top of your file (use pip to install) usually resolves it for me:
import nest_asyncio
nest_asyncio.apply()
Keep in mind this will alter the way the rest of your application is handling the event queue, if that's a factor.
If you're using RTM, the RTMClient creates a WebClient for you. The handle for it should be getting passed to you in the payload when you handle an event. You can check your ID by looking for the 'open' event which is always dispatched after RTM successfully connects and doing the lookup inside your 'open' event handler.

Categories

Resources