I need to read from a StreamReader when a new connection is made in my Server instance created by using asyncio.start_server without consuming the internal buffer.
The first thing I tried to accomplish is to create a deep copy of the StreamReader object passed to the callback that handles the connection but it throws an error.
The second thing is to create a StreamReader from scratch by copying the buffer and the transport of the passed StreamReader
async def _handle_req(self, reader: StreamReader, writer: StreamWriter):
reader_copy = StreamReader()
reader_copy._buffer = deepcopy(reader._buffer)
reader_copy.set_transport(reader._transport)
However when it comes to invoke readline from reader_copy it throws an IncompleteReadError exception. After doing some debugging I found that the exception is thrown by the internal readuntil function of streams.py StreamReader class (line 626) because for some reason the internal self._eof flag is set to True. This happens at line 515 in the _wait_for_data method
...
self._waiter = self._loop.create_future()
try:
await self._waiter
finally:
self._waiter = None
because the await self._waiter for some reason set it to True.
Is there an alternative solution or a way to create a copy of the StreamReader?
Related
Context
I am trying to write a data pipeline using dask distributed and some legacy code from a previous project. get_data simply get url:str and session:ClientSession as arguments and return a pandas DataFrame.
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
session: ClientSession = connector.session_factory()
futures = client.map(
get_data, # function to get data (takes url and http session)
urls,
[session for _ in range(len(urls))], # PROBLEM IS HERE
retries=5,
)
r = client.map(loader.job, futures)
_ = client.gather(r)
Problem
I get the following error
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/worker.py", line 2952, in warn_dumps
b = dumps(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 58, in dumps
result = cloudpickle.dumps(x, **dump_kwargs)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'TaskStepMethWrapper' object
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f3042b2fa00>
My temptation was then to register a serializer and a deserializer for this exotic object following this doc
from distributed.protocol import dask_serialize, dask_deserialize
#dask_serialize.register(TaskStepMethWrapper)
def serialize(ctx: TaskStepMethWrapper) -> Tuple[Dict, List[bytes]]:
header = {} #?
frames = [] #?
return header, frames
#dask_deserialize.register(TaskStepMethWrapper)
def deserialize(header: Dict, frames: List[bytes]) -> TaskStepMethWrapper:
return TaskStepMethWrapper(frames) #?
The problem is that I don't know where to load TaskStepMethWrapper from. I know that class TaskStepMethWrapper is asyncio related
grep -rnw './' -e '.*TaskStepMethWrapper.*'
grep: ./lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so : fichiers binaires correspondent
But I couldn't find its definition anywhere in site-packages/aiohttp. I also tried to use a Client(asynchronous=True) with only resulted in a TypeError: cannot pickle '_contextvars.Context' object.
How do you handle exotic objects serializations in dask. Should I extend the dask serializer or use an additional serialization family?
client = Client('tcp://scheduler-address:8786',
serializers=['dask', 'pickle'], # BUT WHICH ONE
deserializers=['dask', 'msgpack']) # BUT WHICH ONE
There is a far easier to get around this: create your sessions within the mapped function. You would have been recreating the sessions in each worker anyway, they cannot survive a transfer
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
def func(u):
session: ClientSession = connector.session_factory()
return get_data(u, session)
futures = client.map(
func,
urls,
retries=5,
)
(I don't know what loader.job is, so I have omitted that).
Note that TaskStepMethWrapper (and anything to do with aiohttp) sounds like it should be called only in async code. Maybe func needs to be async and you need appropriate awaits.
What I am doing wrong? Is there a fix? I'm new to async programming; it's very confusing.
# myFile.py
import httpx
async def ping_api():
async with httpx.AsyncClient() as client:
sleep(1)
print('right after with')
sleep(1)
print('before await')
sleep(1)
response = await client.get(url, params=params)
sleep(1)
print('after await')
sleep(1)
data = response.json() # what's wrong here?
sleep(1)
print('after json')
sleep(1)
return data
# myFastAPI.py
from myFile import ping_api
#app...
async def main():
data = await ping_api()
Resulting error:
before await
after await
C:\Users\foo\grok\site-packages\httpx\_client.py:1772: UserWarning: Unclosed <authlib.integrations.httpx_client.oauth2_client.AsyncOAuth2Client object at 0x0000021F318EC5E0>. See https://www.python-httpx.org/async/#opening-and-closing-clients for details.
warnings.warn(
after json
Shouldn't the context manager automatically close the connection?
Is this a bug in the library or am I missing something?
Is this response.json() the cause or is the problem elsewhere but just happens to 'print' at this point?
https://github.com/encode/httpx/issues/1332
Found cause:
The issue was actually my usage of another library(tda-api). Found answer Here; "Do not attempt to use more than one Client object per token file, as
this will likely cause issues with the underlying OAuth2 session management". My mistake was causing the 'printing' of the error in function calls that had no obvious relationship to me (e.g. datetime.combine(), response.json()). Instead of invoking the creation of the client object within the function, I created it outside and passed the client object as a parameter argument to my various async def functions.
The error does not occur in sync def functions because it does not yield the thread to the event loop at any point before returning. This means there are no concurrent Client objects invoked at the same time. Thus, the 1:1 Client object:token file ratio is not violated in the synchronous case and creating the Client object inside the function is not an issue.
I have a need to async read StdIn in order to get messages (json terminated by \r\n) and after processing async write updated message to StdOut.
At the moment I am doing it synchronous like:
class SyncIOStdInOut():
def write(self, payload: str):
sys.stdout.write(payload)
sys.stdout.write('\r\n')
sys.stdout.flush()
def read(self) -> str:
payload=sys.stdin.readline()
return payload
How to do the same but asynchronously?
Here's an example of echo stdin to stdout using asyncio streams (for Unix).
import asyncio
import sys
async def connect_stdin_stdout():
loop = asyncio.get_event_loop()
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
w_transport, w_protocol = await loop.connect_write_pipe(asyncio.streams.FlowControlMixin, sys.stdout)
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
return reader, writer
async def main():
reader, writer = await connect_stdin_stdout()
while True:
res = await reader.read(100)
if not res:
break
writer.write(res)
if __name__ == "__main__":
asyncio.run(main())
As a ready-to-use solution, you could use aioconsole library. It implements a similar approach, but also provide additional useful asynchronous equivalents to input, print, exec and code.interact:
from aioconsole import get_standard_streams
async def main():
reader, writer = await get_standard_streams()
Update:
Let's try to figure out how the function connect_stdin_stdout works.
Get the current event loop:
loop = asyncio.get_event_loop()
Create StreamReader instance.
reader = asyncio.StreamReader()
Generally, StreamReader/StreamWriter classes are not intended to be directly instantiated and should only be used as a result of functions such as open_connection() and start_server().
StreamReader provides a buffered asynchronous interface to some data stream. Some source(library code) calls its functions such as feed_data, feed_eof, the data is buffered and can be read using the documented interface coroutine read(), readline(), and etc.
Create StreamReaderProtocol instance.
protocol = asyncio.StreamReaderProtocol(reader)
This class is derived from asyncio.Protocol and FlowControlMixin and helps to adapt between Protocol and StreamReader. It overrides such Protocol methods as data_received, eof_received and calls StreamReader methods feed_data.
Register standard input stream stdin in the event loop.
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
The connect_read_pipe function takes as a pipe parameter a file-like object. stdin is a file-like object. From now, all data read from the stdin will fall into the StreamReaderProtocol and then pass into StreamReader
Register standard output stream stdout in the event loop.
w_transport, w_protocol = await loop.connect_write_pipe(FlowControlMixin, sys.stdout)
In connect_write_pipe you need to pass a protocol factory that creates protocol instances that implement flow control logic for StreamWriter.drain(). This logic is implemented in the class FlowControlMixin. Also StreamReaderProtocol inherited from it.
Create StreamWriter instance.
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
This class forwards the data passed to it using functions write(), writelines() and etc. to the underlying transport.
protocol is used to support the drain() function to wait for the moment that the underlying transport has flushed its internal buffer and is available for writing again.
reader is an optional parameter and can be None, it is also used to support the drain() function, at the start of this function it is checked if an exception was set for the reader, for example, due to a connection lost (relevant for sockets and bidirectional connections), then drain() will also throw an exception.
You can read more about StreamWriter and drain() function in this great answer.
Update 2:
To read lines with \r\n separator readuntil can be used
This is another way you can async read from stdin (reads a single line at a time).
async def async_read_stdin()->str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, sys.stdin.readline)
I'm having this case where I need to get from the coroutine about which task failed.
I'm using asyncio.wait and when a task fails with an exception, I cannot tell which arguments caused the task to fail.
I tried to read the coro cr_frame but it seems after the coro runs, the cr_frame returns None
I tried other things too like trying to edit the coro Class and trying to put a attribute
coro.__mydata = data but it seems that i cannot add attributes dynamically on the coro (maybe its a limitation on python, don't know)
Here's some code
async def main():
"""
Basically this function send messages to an api
The class takes things like channelID, userID etc
Sometimes channelID, userID are wrong and the task would return an exception
"""
resp = await asyncio.wait(self._queue_send_task)
for r in resp[0]:
try:
sent = r.result()
except:
## Exception because of wrong args, I need the args to act upon them
coro = r.get_coro()
coro.cr_frame ## Returns none, normally this would return a frame if I were to call it before the coro start
I'm experimenting with Content-Disposition on tornado. My code for reading and writing of file looks like this:
with open(file_name, 'rb') as f:
while True:
data = f.read(4096)
if not data:
break
self.write(data)
self.finish()
I expected the memory usage to be consistent since it is not reading everything at once. But the resource monitor shows:
In use Available
12.7 GB 2.5GB
Sometimes it will even BSOD my computer...
How do I download a large file (say 12GB in size)?
tornado 6.0 provide a api download large file may use like below:
import aiofiles
async def get(self):
self.set_header('Content-Type', 'application/octet-stream')
# the aiofiles use thread pool,not real asynchronous
async with aiofiles.open(r"F:\test.xyz","rb") as f:
while True:
data = await f.read(1024)
if not data:
break
self.write(data)
# flush method call is import,it makes low memory occupy,beacuse it send it out timely
self.flush()
just use aiofiles but not use the self.flush() may not solve the trouble.
just look at the method self.write():
def write(self, chunk: Union[str, bytes, dict]) -> None:
"""Writes the given chunk to the output buffer.
To write the output to the network, use the `flush()` method below.
If the given chunk is a dictionary, we write it as JSON and set
the Content-Type of the response to be ``application/json``.
(if you want to send JSON as a different ``Content-Type``, call
``set_header`` *after* calling ``write()``).
Note that lists are not converted to JSON because of a potential
cross-site security vulnerability. All JSON output should be
wrapped in a dictionary. More details at
http://haacked.com/archive/2009/06/25/json-hijacking.aspx/ and
https://github.com/facebook/tornado/issues/1009
"""
if self._finished:
raise RuntimeError("Cannot write() after finish()")
if not isinstance(chunk, (bytes, unicode_type, dict)):
message = "write() only accepts bytes, unicode, and dict objects"
if isinstance(chunk, list):
message += (
". Lists not accepted for security reasons; see "
+ "http://www.tornadoweb.org/en/stable/web.html#tornado.web.RequestHandler.write" # noqa: E501
)
raise TypeError(message)
if isinstance(chunk, dict):
chunk = escape.json_encode(chunk)
self.set_header("Content-Type", "application/json; charset=UTF-8")
chunk = utf8(chunk)
self._write_buffer.append(chunk)
at the ending of the code:it just append the data you want send to the _write_buffer.
the data would be sent when the get or post method has finished and the finish method be called.
the document about tornado 's handler flush is :
http://www.tornadoweb.org/en/stable/web.html?highlight=flush#tornado.web.RequestHandler.flush