How to convert lots of HTML to JSON with asyncio in Python

How to convert lots of HTML to JSON with asyncio in Python - python

This is my first attempt at using asyncio in python. Objective is to convert 40000+ htmls to jsons. Using a synchronous for loop takes about 3.5 minutes. I am interested to see the performance boost using asyncio. I am using the following code:
import glob
import json
from parsel import Selector
import asyncio
import aiofiles
async def read_html(path):
async with aiofiles.open(path, 'r') as f:
html = await f.read()
return html
async def parse_job(path):
html = await read_html(path)
sel_obj = Selector(html)
jobs = dict()
jobs['some_var'] = sel_obj.xpath('some-xpath').get()
return jobs
async def write_json(path):
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
# this function is from realpython tutorial.
# I have little understanding of whats going on with gather()
tasks = list()
for file in files:
tasks.append(write_json(file))
await asyncio.gather(*tasks)
if __name__ == "__main__":
files = glob.glob("some_folder_path/*.html")
asyncio.run(bulk_read_and_write(files))
After a few seconds of running, I am getting the following error.
Traceback (most recent call last):
File "06_extract_jobs_async.py", line 84, in <module>
asyncio.run(bulk_read_and_write(files))
File "/anaconda3/envs/py37/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/anaconda3/envs/py37/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "06_extract_jobs_async.py", line 78, in bulk_read_and_write
await asyncio.gather(*tasks)
File "06_extract_jobs_async.py", line 68, in write_json
job = await parse_job(path)
File "06_extract_jobs_async.py", line 35, in parse_job
html = await read_html(path)
File "06_extract_jobs_async.py", line 29, in read_html
async with aiofiles.open(path, 'r') as f:
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = yield from self._coro
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 35, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/anaconda3/envs/py37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
OSError: [Errno 24] Too many open files: '../html_output/jobs/6706538_478752_job.html'
What is going on here? Thanks in advance

You're making async calls as fast as you can, but the process of writing a file to disk is still an effectively synchronous task. Your OS can try to perform multiple writes at once, but there is a limit. By spawning async tasks as quickly as possible, you're getting a lot of results at once, meaning a huge amount of files open for writing at the same time. As your error suggests, there is a limit.
There are lots of good threads on here about limiting concurrency with asyncio, but the easiest solution is probably asyncio-pool with a reasonable size.

Try adding a limit to the number of parallel tasks:
# ...rest of code unchanged
async def write_json(path, limiter):
with limiter:
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
limiter = asyncio.Semaphore(1000)
tasks = []
for file in files:
tasks.append(write_json(file, limiter))
await asyncio.gather(*tasks)

Related

I need to avoid conflict during writing a dictionary into a .json file using asyncio

async def load_abit_class(message: types.Message, state: FSMContext):
async with state.proxy() as abit_data:
abit_data['class'] = int(message.text)
async with open('''D:/Code/data.json''', 'w') as file:
json.dump(abit_data, file)
await state.finish()
That's my code. I am trying to make a tg bot, and I need to write the abit_data dictionary to data.json. But my bot is going to be used by many people and that's so I have to someway write that data avoiding saving conflicts, which can be caused by two people, using the bot at the same moment. I tried to use async but I got traceback:
Task exception was never retrieved
future: <Task finished name='Task-31' coro=<Dispatcher._process_polling_updates() done, defined at C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\dispatcher.py:407> exception=AttributeError('__aenter__')>
Traceback (most recent call last):
File "C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\dispatcher.py", line 415, in _process_polling_updates
for responses in itertools.chain.from_iterable(await self.process_updates(updates, fast)):
File "C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\dispatcher.py", line 235, in process_updates
return await asyncio.gather(*tasks)
File "C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\handler.py", line 116, in notify
response = await handler_obj.handler(*args, **partial_data)
File "C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\dispatcher.py", line 256, in process_update
return await self.message_handlers.notify(update.message)
File "C:\Users\vvvpe\Desktop\Connect\venv_2\lib\site-packages\aiogram\dispatcher\handler.py", line 116, in notify
response = await handler_obj.handler(*args, **partial_data)
File "C:\Users\vvvpe\Desktop\Connect\handlers\General.py", line 77, in load_abit_class
async with open('''D:/Code/data.json''', 'w') as file:
AttributeError: __aenter__
Help, please!

How to use reader, writer from asyncio.open_connection in parallel?

How to read and write in parallel from a asyncio reader, writer pair provided by asyncio.open_connection?
I tried asyncio.open_connection on 2 different loops like:
async def read():
reader, writer = await asyncio.open_connection('127.0.0.1', 5454)
while True:
readval = await reader.readline()
print(f"read {readval}")
async def write():
reader, writer = await asyncio.open_connection('127.0.0.1', 5454)
while True:
self.wsem.acquire()
msg = self.wq.popleft()
print("Writing " + msg)
writer.write(msg.encode())
await writer.drain()
threading.Thread(target=lambda: asyncio.run(write())).start()
threading.Thread(target=lambda: asyncio.run(read())).start()
but it seems like sometimes the write thread drains the read content from the read thread and it doesn't work well.
Then I tried sharing reader, writer between the 2 loops but it throws an exception
Exception in thread Thread-8:
Traceback (most recent call last):
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\threading.py", line 954, in _bootstrap_inner
self.run()
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Lenovo\PycharmProjects\testape-adb-adapter\adapter\device_socket.py", line 89, in <lambda>
threading.Thread(target=lambda: asyncio.run(read())).start()
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete
return future.result()
File "C:\Users\Lenovo\PycharmProjects\testape-adb-adapter\adapter\device_socket.py", line 72, in read
readval = await self.reader.readline()
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\streams.py", line 540, in readline
line = await self.readuntil(sep)
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\streams.py", line 632, in readuntil
await self._wait_for_data('readuntil')
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\streams.py", line 517, in _wait_for_data
await self._waiter
RuntimeError: Task <Task pending name='Task-2' coro=<DeviceSocket.connect.<locals>.read() running at C:\Users\Lenovo\PycharmProjects\testape-adb-adapter\adapter\device_socket.py:72> cb=[_run_until_complete_cb() at C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py:184]> got Future <Future pending> attached to a different loop
async def read():
connection_opened.wait()
while True:
print(f"reading")
if self.reader.at_eof():
continue
readval = await self.reader.readline()
print(f"read {readval}")
self.rq.append(readval.decode())
self.rsem.release(1)
async def write():
self.reader, self.writer = await asyncio.open_connection('127.0.0.1', 5454)
connection_opened.set()
while True:
self.wsem.acquire()
msg = self.wq.popleft()
print("Writing " + msg)
self.writer.write(msg.encode())
await self.writer.drain()
connection_opened = threading.Event()
threading.Thread(target=lambda: asyncio.run(write())).start()
threading.Thread(target=lambda: asyncio.run(read())).start()
I thought that should be a simple and rather common use case. What is the proper way to do that?

I suggest you change the thread functions into something like this:
t = self.loop.create_task(self.write)
and ends with:
loop.run_until_complete(t)
Because I'm missing the self.wq and self.wsem function and not sure what they means, I couldn't reproduce the Error message. Hope this sorts out the question for you.

Python script to download all the media from a Telegram Channel using Telethon

I was trying to use Telethon but turns out is really slow
So I tried using this gist as suggested in
this post
I have the following errors . Can anyone please help me?
Here is my code:
from telethon.sync import TelegramClient
from FastTelethon import download_file
import os
import asyncio
async def getAllMediaFromchannel():
os.chdir("/home/gtxtreme/Documents/McHumour")
api_hash = "<hidden>"
api_id = <hidden>
client = TelegramClient('MCHumour', api_id, api_hash)
client.start()
ch_entity = await client.get_entity("telegram.me/joinchat/AAAAAEXnb4jK7xyU1SfAsw")
messages = client.iter_messages(ch_entity, limit=50)
def progress_cb(current, total):
print('Uploaded', current, 'out of', total,
'bytes: {:.5%}'.format(current / total))
async for msg in messages:
result = await download_file(client, msg.document, "/home/gtxtreme/Documents/McHumour",
progress_callback=progress_cb)
print("*************************\nFile named {0} saved to {1} successfully\n********************".format(
msg.message, result))
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(getAllMediaFromchannel())
Here is my error
[gtxtreme#archlinux ~]$ python PycharmProjects/python_gtxtreme/tgBotrev1.py
PycharmProjects/python_gtxtreme/tgBotrev1.py:13: RuntimeWarning: coroutine 'AuthMethods._start' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
File "PycharmProjects/python_gtxtreme/tgBotrev1.py", line 31, in <module>
loop.run_until_complete(getAllMediaFromchannel())
File "/usr/lib/python3.8/asyncio/base_events.py", line 612, in run_until_complete
return future.result()
File "PycharmProjects/python_gtxtreme/tgBotrev1.py", line 14, in getAllMediaFromchannel
ch_entity = await client.get_entity("telegram.me/joinchat/AAAAAEXnb4jK7xyU1SfAsw")
File "/usr/lib/python3.8/site-packages/telethon/client/users.py", line 310, in get_entity
result.append(await self._get_entity_from_string(x))
File "/usr/lib/python3.8/site-packages/telethon/client/users.py", line 512, in _get_entity_from_string
invite = await self(
File "/usr/lib/python3.8/site-packages/telethon/client/users.py", line 30, in __call__
return await self._call(self._sender, request, ordered=ordered)
File "/usr/lib/python3.8/site-packages/telethon/client/users.py", line 56, in _call
future = sender.send(request, ordered=ordered)
File "/usr/lib/python3.8/site-packages/telethon/network/mtprotosender.py", line 170, in send
raise ConnectionError('Cannot send requests while disconnected')
ConnectionError: Cannot send requests while disconnected
[gtxtreme#archlinux ~]$
Also any other suitable way of doing it would be preferred

client.start is an async method so you should await it.
it only needs the await if it is inside a function. if you call it outside of a function telethon adds the await implicitly for convenience

aiohttp asyncio.TimeoutError from None using ClientSession

It's a weird error since when I try/catch it, it prints nothings.
I'm using sanic server to asyncio.gather a bunch of images concurrently, more than 3 thousand images.
I haven't got this error when dealing with a smaller sample size.
Simplified example :
from sanic import Sanic
from sanic import response
from aiohttp import ClientSession
from asyncio import gather
app = Sanic()
#app.listener('before_server_start')
async def init(app, loop):
app.session = ClientSession(loop=loop)
#app.route('/test')
async def test(request):
data_tasks = []
#The error only happened when a large amount of images were used
for imageURL in request.json['images']:
data_tasks.append(getRaw(imageURL))
await gather(*data_tasks)
return response.text('done')
async def getRaw(url):
async with app.session.get(url) as resp:
return await resp.read()
What could this error be? If it is some kind of limitation of my host/internet, how can I avoid it?
I'm using a basic droplet from DigitalOcean with 1vCPU and 1GB RAM if that helps
Full stack error :
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sanic/app.py", line 750, in handle_request
response = await response
File "server-sanic.py", line 53, in xlsx
await gather(*data_tasks)
File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
yield self # This tells Task to wait for completion.
File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
raise self._exception
File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
result = coro.throw(exc)
File "server-sanic.py", line 102, in add_data_to_sheet
await add_img_to_sheet(sheet, rowIndex, colIndex, val)
File "server-sanic.py", line 114, in add_img_to_sheet
image_data = BytesIO(await getRaw(imgUrl))
File "server-sanic.py", line 138, in getRaw
async with app.session.get(url) as resp:
File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 690, in __aenter__
self._resp = yield from self._coro
File "/usr/local/lib/python3.5/dist-packages/aiohttp/client.py", line 277, in _request
yield from resp.start(conn, read_until_eof)
File "/usr/local/lib/python3.5/dist-packages/aiohttp/client_reqrep.py", line 637, in start
self._continue = None
File "/usr/local/lib/python3.5/dist-packages/aiohttp/helpers.py", line 732, in __exit__
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError

There is no benefit to launching a million requests at once. Limit it to 10 or whatever works and wait for those before continuing the loop.
for imageURL in request.json['images']:
data_tasks.append(getRaw(imageURL))
if len(data_tasks) > 10:
await gather(*data_tasks)
data_tasks = []
await gather(*data_tasks)

aiohttp.web + aiopg + sqlalchemy: random "cursor.execute() called while another coroutine is already waiting for incoming data" under load

I have a trivial aiohttp.web application that executes SQL requests via aiopg sqlalchemy integration. It's as simple as:
import aiohttp.web
from aiopg.sa import create_engine
app = aiohttp.web.Application()
async def rows(request):
async with request.app["db"].acquire() as db:
return aiohttp.web.json_response(list(await db.execute("SELECT * FROM table")))
app.router.add_route("GET", "/rows", rows)
async def init(app):
app["db"] = await create_engine(host="postgres", user="visio", password="visio", database="visio")
if __name__ == "__main__":
loop = asyncio.get_event_loop()
handler = app.make_handler()
loop.run_until_complete(init(app))
loop.run_until_complete(loop.create_server(handler, "0.0.0.0", 80))
loop.run_forever()
When server load reaches 100 rps, this error starts appearing randomly:
RuntimeError: cursor.execute() called while another coroutine is already waiting for incoming data
File "aiohttp/server.py", line 261, in start
yield from self.handle_request(message, payload)
File "aiohttp/web.py", line 88, in handle_request
resp = yield from handler(request)
File "visio_longer/views/communicate/__init__.py", line 72, in legacy_communicate
device = await query_device(db, access_token)
File "visio_longer/views/communicate/__init__.py", line 31, in query_device
(Device.access_token == access_token)
File "aiopg/utils.py", line 72, in __await__
resp = yield from self._coro
File "aiopg/sa/connection.py", line 103, in _execute
yield from cursor.execute(str(compiled), post_processed_params[0])
File "aiopg/cursor.py", line 103, in execute
waiter = self._conn._create_waiter('cursor.execute')
File "aiopg/connection.py", line 186, in _create_waiter
'already waiting for incoming data' % func_name)
It happens with random queries at random times, once in a few days, sometimes these errors come in bunch of 4 or 2. Is there something wrong with my code? aiohttp and aiopg versions are latest from pip

It think you forgot to acquire a cursor
async with conn.cursor() as cur:
await cur.execute("SELECT 1")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert lots of HTML to JSON with asyncio in Python - python

Related

I need to avoid conflict during writing a dictionary into a .json file using asyncio

How to use reader, writer from asyncio.open_connection in parallel?

Python script to download all the media from a Telegram Channel using Telethon

aiohttp asyncio.TimeoutError from None using ClientSession

aiohttp.web + aiopg + sqlalchemy: random "cursor.execute() called while another coroutine is already waiting for incoming data" under load

Categories

Resources