Tornado large file download

Tornado large file download - python

I'm experimenting with Content-Disposition on tornado. My code for reading and writing of file looks like this:
with open(file_name, 'rb') as f:
while True:
data = f.read(4096)
if not data:
break
self.write(data)
self.finish()
I expected the memory usage to be consistent since it is not reading everything at once. But the resource monitor shows:
In use Available
12.7 GB 2.5GB
Sometimes it will even BSOD my computer...
How do I download a large file (say 12GB in size)?

tornado 6.0 provide a api download large file may use like below:
import aiofiles
async def get(self):
self.set_header('Content-Type', 'application/octet-stream')
# the aiofiles use thread pool,not real asynchronous
async with aiofiles.open(r"F:\test.xyz","rb") as f:
while True:
data = await f.read(1024)
if not data:
break
self.write(data)
# flush method call is import,it makes low memory occupy,beacuse it send it out timely
self.flush()
just use aiofiles but not use the self.flush() may not solve the trouble.
just look at the method self.write():
def write(self, chunk: Union[str, bytes, dict]) -> None:
"""Writes the given chunk to the output buffer.
To write the output to the network, use the `flush()` method below.
If the given chunk is a dictionary, we write it as JSON and set
the Content-Type of the response to be ``application/json``.
(if you want to send JSON as a different ``Content-Type``, call
``set_header`` *after* calling ``write()``).
Note that lists are not converted to JSON because of a potential
cross-site security vulnerability. All JSON output should be
wrapped in a dictionary. More details at
http://haacked.com/archive/2009/06/25/json-hijacking.aspx/ and
https://github.com/facebook/tornado/issues/1009
"""
if self._finished:
raise RuntimeError("Cannot write() after finish()")
if not isinstance(chunk, (bytes, unicode_type, dict)):
message = "write() only accepts bytes, unicode, and dict objects"
if isinstance(chunk, list):
message += (
". Lists not accepted for security reasons; see "
+ "http://www.tornadoweb.org/en/stable/web.html#tornado.web.RequestHandler.write" # noqa: E501
)
raise TypeError(message)
if isinstance(chunk, dict):
chunk = escape.json_encode(chunk)
self.set_header("Content-Type", "application/json; charset=UTF-8")
chunk = utf8(chunk)
self._write_buffer.append(chunk)
at the ending of the code:it just append the data you want send to the _write_buffer.
the data would be sent when the get or post method has finished and the finish method be called.
the document about tornado 's handler flush is :
http://www.tornadoweb.org/en/stable/web.html?highlight=flush#tornado.web.RequestHandler.flush

Related

How to download a large file using FastAPI?

I am trying to download a large file (.tar.gz) from FastAPI backend. On server side, I simply validate the filepath, and I then use Starlette.FileResponse to return the whole file—just like what I've seen in many related questions on StackOverflow.
Server side:
return FileResponse(path=file_name, media_type='application/octet-stream', filename=file_name)
After that, I get the following error:
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 149, in serialize_response
return jsonable_encoder(response_content)
File "/usr/local/lib/python3.10/dist-packages/fastapi/encoders.py", line 130, in jsonable_encoder
return ENCODERS_BY_TYPE[type(obj)](obj)
File "pydantic/json.py", line 52, in pydantic.json.lambda
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I also tried using StreamingResponse, but got the same error. Any other ways to do it?
The StreamingResponse in my code:
#x.post("/download")
async def download(file_name=Body(), token: str | None = Header(default=None)):
file_name = file_name["file_name"]
# should be something like xx.tar
def iterfile():
with open(file_name,"rb") as f:
yield from f
return StreamingResponse(iterfile(),media_type='application/octet-stream')
Ok, here is an update to this problem.
I found the error did not occur on this api, but the api doing forward request of this.
#("/")
def f():
req = requests.post(url ="/download")
return req.content
And here if I returned a StreamingResponse with .tar file, it led to (maybe) encoding problems.
When using requests, remember to set the same media-type. Here is media_type='application/octet-stream'. And it works!

If you find yield from f being rather slow when using StreamingResponse with file-like objects, you could instead create a generator where you read the file in chunks using a specified chunk size; hence, speeding up the process. Examples can be found below.
Note that StreamingResponse can take either an async generator or a normal generator/iterator to stream the response body. In case you used the standard open() method that doesn't support async/await, you would have to declare the generator function with normal def. Regardless, FastAPI/Starlette will still work asynchronously, as it will check whether the generator you passed is asynchronous (as shown in the source code), and if is not, it will then run the generator in a separate thread, using iterate_in_threadpool, that is then awaited.
You can set the Content-Disposition header in the response (as described in this answer, as well as here and here) to indicate if the content is expected to be displayed inline in the browser (if you are streaming, for example, a .mp4 video, .mp3 audio file, etc), or as an attachment that is downloaded and saved locally (using the specified filename).
As for the media_type (also known as MIME type), there are two primary MIME types (see Common MIME types):
text/plain is the default value for textual files. A textual file should be human-readable and must not contain binary data.
application/octet-stream is the default value for all other cases. An unknown file type should use this type.
For a file with .tar extension, as shown in your question, you can also use a different subtype from octet-stream, that is, x-tar. Otherwise, if the file is of unknown type, stick to application/octet-stream. See the linked documentation above for a list of common MIME types.
Option 1 - Using normal generator
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
CHUNK_SIZE = 1024 * 1024 # = 1MB - adjust the chunk size as desired
some_file_path = 'large_file.tar'
app = FastAPI()
#app.get('/')
def main():
def iterfile():
with open(some_file_path, 'rb') as f:
while chunk := f.read(CHUNK_SIZE):
yield chunk
headers = {'Content-Disposition': 'attachment; filename="large_file.tar"'}
return StreamingResponse(iterfile(), headers=headers, media_type='application/x-tar')
Option 2 - Using async generator with aiofiles
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import aiofiles
CHUNK_SIZE = 1024 * 1024 # = 1MB - adjust the chunk size as desired
some_file_path = 'large_file.tar'
app = FastAPI()
#app.get('/')
async def main():
async def iterfile():
async with aiofiles.open(some_file_path, 'rb') as f:
while chunk := await f.read(CHUNK_SIZE):
yield chunk
headers = {'Content-Disposition': 'attachment; filename="large_file.tar"'}
return StreamingResponse(iterfile(), headers=headers, media_type='application/x-tar')

How to pass a video uploaded via FastAPI to OpenCV VideoCapture?

I am trying to upload an mp4 video file using UploadFile in FastAPI.
However, the uploaded format is not readable by OpencCV (cv2).
This is my endpoint:
from fastapi import FastAPI, File, UploadFile
from fastapi.responses import PlainTextResponse
#app.post("/video/test", response_class=PlainTextResponse)
async def detect_faces_in_video(video_file: UploadFile):
contents = await video_file.read()
print(type(video_file)) # <class 'starlette.datastructures.UploadFile'>
print(type(contents)) # <class 'bytes'>
return ""
and the two file formats (i.e., bytes and UploadFile) are not readable by OpenCV.

You are trying to pass either the file contents (bytes) or UploadFile object; however, VideoCapture() accepts either a video filename, capturing device or or an IP video stream.
UploadFile is basically a SpooledTemporaryFile (a file-like object) that operates similar to a TemporaryFile. However, it does not have a visible name in the file system. As you mentioned that you wouldn't be keeping the files on the server after processing them, you could copy the file contents to a NamedTemporaryFile that "has a visible name in the file system, which can be used to open the file" (using the name attribute), as described here and here. As per the documentation:
Whether the name can be used to open the file a second time, while the
named temporary file is still open, varies across platforms (it can be
so used on Unix; it cannot on Windows). If delete is true (the
default), the file is deleted as soon as it is closed.
Hence, on Windows you need to set the delete argument to False when instantiating a NamedTemporaryFile, and once you are done with it, you can manually delete it, using the os.remove() or os.unlink() method.
Below are given two options on how to do that. Option 1 implements a solution using a def endpoint, while Option 2 uses an async def endpoint (utilising the aiofiles library). For the difference between def and async def, please have a look at this answer. If you are expecting users to upload rather large files in size that wouldn't fit into memory, have a look at this and this answer on how to read the uploaded video file in chunks instead.
Option 1 - Using def endpoint
from fastapi import FastAPI, File, UploadFile
from tempfile import NamedTemporaryFile
import os
#app.post("/video/detect-faces")
def detect_faces(file: UploadFile = File(...)):
temp = NamedTemporaryFile(delete=False)
try:
try:
contents = file.file.read()
with temp as f:
f.write(contents);
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
res = process_video(temp.name) # Pass temp.name to VideoCapture()
except Exception:
return {"message": "There was an error processing the file"}
finally:
#temp.close() # the `with` statement above takes care of closing the file
os.remove(temp.name)
return res
Option 2 - Using async def endpoint
from fastapi import FastAPI, File, UploadFile
from tempfile import NamedTemporaryFile
from fastapi.concurrency import run_in_threadpool
import aiofiles
import asyncio
import os
#app.post("/video/detect-faces")
async def detect_faces(file: UploadFile = File(...)):
try:
async with aiofiles.tempfile.NamedTemporaryFile("wb", delete=False) as temp:
try:
contents = await file.read()
await temp.write(contents)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
await file.close()
res = await run_in_threadpool(process_video, temp.name) # Pass temp.name to VideoCapture()
except Exception:
return {"message": "There was an error processing the file"}
finally:
os.remove(temp.name)
return res

python asyncio how to read StdIn and write to StdOut?

I have a need to async read StdIn in order to get messages (json terminated by \r\n) and after processing async write updated message to StdOut.
At the moment I am doing it synchronous like:
class SyncIOStdInOut():
def write(self, payload: str):
sys.stdout.write(payload)
sys.stdout.write('\r\n')
sys.stdout.flush()
def read(self) -> str:
payload=sys.stdin.readline()
return payload
How to do the same but asynchronously?

Here's an example of echo stdin to stdout using asyncio streams (for Unix).
import asyncio
import sys
async def connect_stdin_stdout():
loop = asyncio.get_event_loop()
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
w_transport, w_protocol = await loop.connect_write_pipe(asyncio.streams.FlowControlMixin, sys.stdout)
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
return reader, writer
async def main():
reader, writer = await connect_stdin_stdout()
while True:
res = await reader.read(100)
if not res:
break
writer.write(res)
if __name__ == "__main__":
asyncio.run(main())
As a ready-to-use solution, you could use aioconsole library. It implements a similar approach, but also provide additional useful asynchronous equivalents to input, print, exec and code.interact:
from aioconsole import get_standard_streams
async def main():
reader, writer = await get_standard_streams()
Update:
Let's try to figure out how the function connect_stdin_stdout works.
Get the current event loop:
loop = asyncio.get_event_loop()
Create StreamReader instance.
reader = asyncio.StreamReader()
Generally, StreamReader/StreamWriter classes are not intended to be directly instantiated and should only be used as a result of functions such as open_connection() and start_server().
StreamReader provides a buffered asynchronous interface to some data stream. Some source(library code) calls its functions such as feed_data, feed_eof, the data is buffered and can be read using the documented interface coroutine read(), readline(), and etc.
Create StreamReaderProtocol instance.
protocol = asyncio.StreamReaderProtocol(reader)
This class is derived from asyncio.Protocol and FlowControlMixin and helps to adapt between Protocol and StreamReader. It overrides such Protocol methods as data_received, eof_received and calls StreamReader methods feed_data.
Register standard input stream stdin in the event loop.
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
The connect_read_pipe function takes as a pipe parameter a file-like object. stdin is a file-like object. From now, all data read from the stdin will fall into the StreamReaderProtocol and then pass into StreamReader
Register standard output stream stdout in the event loop.
w_transport, w_protocol = await loop.connect_write_pipe(FlowControlMixin, sys.stdout)
In connect_write_pipe you need to pass a protocol factory that creates protocol instances that implement flow control logic for StreamWriter.drain(). This logic is implemented in the class FlowControlMixin. Also StreamReaderProtocol inherited from it.
Create StreamWriter instance.
writer = asyncio.StreamWriter(w_transport, w_protocol, reader, loop)
This class forwards the data passed to it using functions write(), writelines() and etc. to the underlying transport.
protocol is used to support the drain() function to wait for the moment that the underlying transport has flushed its internal buffer and is available for writing again.
reader is an optional parameter and can be None, it is also used to support the drain() function, at the start of this function it is checked if an exception was set for the reader, for example, due to a connection lost (relevant for sockets and bidirectional connections), then drain() will also throw an exception.
You can read more about StreamWriter and drain() function in this great answer.
Update 2:
To read lines with \r\n separator readuntil can be used

This is another way you can async read from stdin (reads a single line at a time).
async def async_read_stdin()->str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, sys.stdin.readline)

How to Upload File using FastAPI?

I am using FastAPI to upload a file according to the official documentation, as shown below:
#app.post("/create_file")
async def create_file(file: UploadFile = File(...)):
file2store = await file.read()
# some code to store the BytesIO(file2store) to the other database
When I send a request using Python requests library, as shown below:
f = open(".../file.txt", 'rb')
files = {"file": (f.name, f, "multipart/form-data")}
requests.post(url="SERVER_URL/create_file", files=files)
the file2store variable is always empty. Sometimes (rarely seen), it can get the file bytes, but almost all the time it is empty, so I can't restore the file on the other database.
I also tried the bytes rather than UploadFile, but I get the same results. Is there something wrong with my code, or is the way I use FastAPI to upload a file wrong?

First, as per FastAPI documentation, you need to install python-multipart—if you haven't already—as uploaded files are sent as "form data". For instance:
pip install python-multipart
The below examples use the .file attribute of the UploadFile object to get the actual Python file (i.e., SpooledTemporaryFile), which allows you to call SpooledTemporaryFile's methods, such as .read() and .close(), without having to await them. It is important, however, to define your endpoint with def in this case—otherwise, such operations would block the server until they are completed, if the endpoint was defined with async def. In FastAPI, a normal def endpoint is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server).
The SpooledTemporaryFile used by FastAPI/Starlette has the max_size attribute set to 1 MB, meaning that the data are spooled in memory until the file size exceeds 1 MB, at which point the data are written to a temporary file on disk, under the OS's temp directory. Hence, if you uploaded a file larger than 1 MB, it wouldn't be stored in memory, and calling file.file.read() would actually read the data from disk into memory. Thus, if the file is too large to fit into your server's RAM, you should rather read the file in chunks and process one chunk at a time, as described in "Read the File in chunks" section below. You may also have a look at this answer, which demonstrates another approach to upload a large file in chunks, using Starlette's .stream() method and streaming-form-data package that allows parsing streaming multipart/form-data chunks, which results in considerably minimising the time required to upload files.
If you have to define your endpoint with async def—as you might need to await for some other coroutines inside your route—then you should rather use asynchronous reading and writing of the contents, as demonstrated in this answer. Moreover, if you need to send additional data (such as JSON data) together with uploading the file(s), please have a look at this answer. I would also suggest you have a look at this answer, which explains the difference between def and async def endpoints.
Upload Single File
app.py
from fastapi import File, UploadFile
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
contents = file.file.read()
with open(file.filename, 'wb') as f:
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
Read the File in chunks
As described earlier and in this answer, if the file is too big to fit into memory—for instance, if you have 8GB of RAM, you can't load a 50GB file (not to mention that the available RAM will always be less than the total amount installed on your machine, as other applications will be using some of the RAM)—you should rather load the file into memory in chunks and process the data one chunk at a time. This method, however, may take longer to complete, depending on the chunk size you choose—in the example below, the chunk size is 1024 * 1024 bytes (i.e., 1MB). You can adjust the chunk size as desired.
from fastapi import File, UploadFile
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
with open(file.filename, 'wb') as f:
while contents := file.file.read(1024 * 1024):
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
Another option would be to use shutil.copyfileobj(), which is used to copy the contents of a file-like object to another file-like object (have a look at this answer too). By default, the data is read in chunks with the default buffer (chunk) size being 1MB (i.e., 1024 * 1024 bytes) for Windows and 64KB for other platforms, as shown in the source code here. You can specify the buffer size by passing the optional length parameter. Note: If negative length value is passed, the entire contents of the file will be read instead—see f.read() as well, which .copyfileobj() uses under the hood (as can be seen in the source code here).
from fastapi import File, UploadFile
import shutil
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
with open(file.filename, 'wb') as f:
shutil.copyfileobj(file.file, f)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
test.py
import requests
url = 'http://127.0.0.1:8000/upload'
file = {'file': open('images/1.png', 'rb')}
resp = requests.post(url=url, files=file)
print(resp.json())
Upload Multiple (List of) Files
app.py
from fastapi import File, UploadFile
from typing import List
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
contents = file.file.read()
with open(file.filename, 'wb') as f:
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
Read the Files in chunks
As described earlier in this answer, if you expect some rather large file(s) and don't have enough RAM to accommodate all the data from the beginning to the end, you should rather load the file into memory in chunks, thus processing the data one chunk at a time (Note: adjust the chunk size as desired, below that is 1024 * 1024 bytes).
from fastapi import File, UploadFile
from typing import List
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
with open(file.filename, 'wb') as f:
while contents := file.file.read(1024 * 1024):
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
or, using shutil.copyfileobj():
from fastapi import File, UploadFile
from typing import List
import shutil
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
with open(file.filename, 'wb') as f:
shutil.copyfileobj(file.file, f)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
test.py
import requests
url = 'http://127.0.0.1:8000/upload'
files = [('files', open('images/1.png', 'rb')), ('files', open('images/2.png', 'rb'))]
resp = requests.post(url=url, files=files)
print(resp.json())

#app.post("/create_file/")
async def image(image: UploadFile = File(...)):
print(image.file)
# print('../'+os.path.isdir(os.getcwd()+"images"),"*************")
try:
os.mkdir("images")
print(os.getcwd())
except Exception as e:
print(e)
file_name = os.getcwd()+"/images/"+image.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(image.file.read())
f.close()
file = jsonable_encoder({"imagePath":file_name})
new_image = await add_image(file)
return {"filename": new_image}

How can I read exactly one response chunk with python's http.client?

Using http.client in Python 3.3+ (or any other builtin python HTTP client library), how can I read a chunked HTTP response exactly one HTTP chunk at a time?
I'm extending an existing test fixture (written in python using http.client) for a server which writes its response using HTTP's chunked transfer encoding. For the sake of simplicity, let's say that I'd like to be able to print a message whenever an HTTP chunk is received by the client.
My code follows a fairly standard pattern for reading a large response:
conn = http.client.HTTPConnection(...)
conn.request(...)
response = conn.getresponse()
resbody = []
while True:
chunk = response.read(1024)
if len(chunk):
resbody.append(chunk)
else:
break
conn.close();
But this reads 1024 byte chunks regardless of whether or not the server is sending 10 byte chunks or 10MiB chunks.
What I'm looking for would be something like the following:
while True:
chunk = response.readchunk()
if len(chunk):
resbody.append(chunk)
else
break
If this is not possible with http.client, is it possible with another builtin http client library? If it's not possible with a builtin client lib, is it possible with pip installable module?

I found it easier to use the requests library like so
r = requests.post(url, data=foo, headers=bar, stream=True)
for chunk in (r.raw.read_chunked()):
print(chunk)

Update:
The benefit of chunked transfer encoding is to allow the transmission of dynamically generated content. Whether a HTTP library lets you read individual chunks or not is a separate issue (see RFC 2616 - Section 3.6.1).
I can see how what you are trying to do would be useful, but the standard python http client libraries don't do what you want without some hackery (see http.client and httplib).
What you are trying to do may be fine for use in your test fixture, but in the wild there are no guarantees. It is possible for the chunking of the data read by your client to be be different from the chunking of the data sent by your server. E.g. the data could have been "re-chunked" by a proxy server before it arrived (see RFC 2616 - Section 3.2 - Framing Techniques).
The trick is to tell the response object that it isn't chunked (resp.chunked = False) so that it returns the raw bytes. This allows you to parse the size and data of each chunk as it is returned.
import http.client
conn = http.client.HTTPConnection("localhost")
conn.request('GET', "/")
resp = conn.getresponse()
resp.chunked = False
def get_chunk_size():
size_str = resp.read(2)
while size_str[-2:] != b"\r\n":
size_str += resp.read(1)
return int(size_str[:-2], 16)
def get_chunk_data(chunk_size):
data = resp.read(chunk_size)
resp.read(2)
return data
respbody = ""
while True:
chunk_size = get_chunk_size()
if (chunk_size == 0):
break
else:
chunk_data = get_chunk_data(chunk_size)
print("Chunk Received: " + chunk_data.decode())
respbody += chunk_data.decode()
conn.close()
print(respbody)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tornado large file download - python

Related

How to download a large file using FastAPI?

How to pass a video uploaded via FastAPI to OpenCV VideoCapture?

python asyncio how to read StdIn and write to StdOut?

How to Upload File using FastAPI?

How can I read exactly one response chunk with python's http.client?

Categories

Resources