I have the following Werkzeug application for returning a file to the client:
from werkzeug.wrappers import Request, Response
#Request.application
def application(request):
fileObj = file(r'C:\test.pdf','rb')
response = Response( response=fileObj.read() )
response.headers['content-type'] = 'application/pdf'
return response
The part I want to focus on is this one:
response = Response( response=fileObj.read() )
In this case the response takes about 500 ms (C:\test.pdf is a 4 MB file. Web server is in my local machine).
But if I rewrite that line to this:
response = Response()
response.response = fileObj
Now the response takes about 1500 ms. (3 times slower)
And if write it like this:
response = Response()
response.response = fileObj.read()
Now the response takes about 80 seconds (that's right, 80 SECONDS).
Why is there that much difference between the 3 methods?
And why is the third method sooooo slow?
The answer to that is pretty simple:
x.read() <- reads the whole file into memory, inefficient
setting response to a file: very inefficient as the protocol for that object is an iterator. So you will send the file line by line. If it's binary you will send it with random chunk sizes even.
setting response to a string: bad idea. It's an iterator as mentioned before, so you are now sending each character in the string as a separate packet.
The correct solution is to wrap the file in the file wrapper provided by the WSGI server:
from werkzeug.wsgi import wrap_file
return Response(wrap_file(environ, yourfile), direct_passthrough=True)
The direct_passthrough flag is required so that the response object does not attempt to iterate over the file wrapper but leaves it untouched for the WSGI server.
After some testing I think I've figure out the mistery.
#Armin already explained why this...
response = Response()
response.response = fileObj.read()
...is so slow. But that doesn't explain why this...
response = Response( response=fileObj.read() )
...is so fast. They appear to be the same thing, but obviously they are not. Otherwise there wouldn't be that tremendous difference is speed.
The key here is in this part of the docs: http://werkzeug.pocoo.org/docs/wrappers/
Response can be any kind of iterable or string. If it’s a string it’s considered being an iterable with one item which is the string passed.
i.e. when you give a string to the constructor, it's converted to an iterable with the string being it's only element. But when you do this: response.response = fileObj.read(), the string is treated as is.
So to make it behave like the constructor, you have to do this:
response.response = [ fileObj.read() ]
and now the file is sent as fast as possible.
I can't give you a precise answer as to why this occurs, however http://werkzeug.pocoo.org/docs/wsgi/#werkzeug.wsgi.wrap_file may help address your underling problem.
Related
I am trying to download a large file (.tar.gz) from FastAPI backend. On server side, I simply validate the filepath, and I then use Starlette.FileResponse to return the whole file—just like what I've seen in many related questions on StackOverflow.
Server side:
return FileResponse(path=file_name, media_type='application/octet-stream', filename=file_name)
After that, I get the following error:
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 149, in serialize_response
return jsonable_encoder(response_content)
File "/usr/local/lib/python3.10/dist-packages/fastapi/encoders.py", line 130, in jsonable_encoder
return ENCODERS_BY_TYPE[type(obj)](obj)
File "pydantic/json.py", line 52, in pydantic.json.lambda
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I also tried using StreamingResponse, but got the same error. Any other ways to do it?
The StreamingResponse in my code:
#x.post("/download")
async def download(file_name=Body(), token: str | None = Header(default=None)):
file_name = file_name["file_name"]
# should be something like xx.tar
def iterfile():
with open(file_name,"rb") as f:
yield from f
return StreamingResponse(iterfile(),media_type='application/octet-stream')
Ok, here is an update to this problem.
I found the error did not occur on this api, but the api doing forward request of this.
#("/")
def f():
req = requests.post(url ="/download")
return req.content
And here if I returned a StreamingResponse with .tar file, it led to (maybe) encoding problems.
When using requests, remember to set the same media-type. Here is media_type='application/octet-stream'. And it works!
If you find yield from f being rather slow when using StreamingResponse with file-like objects, you could instead create a generator where you read the file in chunks using a specified chunk size; hence, speeding up the process. Examples can be found below.
Note that StreamingResponse can take either an async generator or a normal generator/iterator to stream the response body. In case you used the standard open() method that doesn't support async/await, you would have to declare the generator function with normal def. Regardless, FastAPI/Starlette will still work asynchronously, as it will check whether the generator you passed is asynchronous (as shown in the source code), and if is not, it will then run the generator in a separate thread, using iterate_in_threadpool, that is then awaited.
You can set the Content-Disposition header in the response (as described in this answer, as well as here and here) to indicate if the content is expected to be displayed inline in the browser (if you are streaming, for example, a .mp4 video, .mp3 audio file, etc), or as an attachment that is downloaded and saved locally (using the specified filename).
As for the media_type (also known as MIME type), there are two primary MIME types (see Common MIME types):
text/plain is the default value for textual files. A textual file should be human-readable and must not contain binary data.
application/octet-stream is the default value for all other cases. An unknown file type should use this type.
For a file with .tar extension, as shown in your question, you can also use a different subtype from octet-stream, that is, x-tar. Otherwise, if the file is of unknown type, stick to application/octet-stream. See the linked documentation above for a list of common MIME types.
Option 1 - Using normal generator
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
CHUNK_SIZE = 1024 * 1024 # = 1MB - adjust the chunk size as desired
some_file_path = 'large_file.tar'
app = FastAPI()
#app.get('/')
def main():
def iterfile():
with open(some_file_path, 'rb') as f:
while chunk := f.read(CHUNK_SIZE):
yield chunk
headers = {'Content-Disposition': 'attachment; filename="large_file.tar"'}
return StreamingResponse(iterfile(), headers=headers, media_type='application/x-tar')
Option 2 - Using async generator with aiofiles
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import aiofiles
CHUNK_SIZE = 1024 * 1024 # = 1MB - adjust the chunk size as desired
some_file_path = 'large_file.tar'
app = FastAPI()
#app.get('/')
async def main():
async def iterfile():
async with aiofiles.open(some_file_path, 'rb') as f:
while chunk := await f.read(CHUNK_SIZE):
yield chunk
headers = {'Content-Disposition': 'attachment; filename="large_file.tar"'}
return StreamingResponse(iterfile(), headers=headers, media_type='application/x-tar')
I have a problem with getting my test running using Robot Framework and robotframework-requests. I need to send a POST request and a binary data in the body. I looked at this question already, but it's not really answered. Here's how my test case looks like:
Upload ${filename} file
Create Session mysession http://${ADDRESS}
${data} = Get Binary File ${filename}
&{headers} = Create Dictionary Content-Type=application/octet-stream Accept=application/octet-stream
${resp} = Post Request mysession ${CGIPath} data=${data} headers=&{headers}
[Return] ${resp.status_code} ${resp.text}
The problem is that my binary data is about 250MB. When the data is read with Get Binary File I see that memory consumption goes up to 2.x GB. A few seconds later when the Post Request is triggered my test is killed by OOM. I already looked at files parameter, but it seems it uses multipart encoding upload, which is not what I need.
My other thought was about passing open file handler directly to underlying requests library, but I guess that would require robotframework-request modification. Another idea is to fall back to curl for this test only.
Am I missing something in my test? What is the better way to address this?
I proceeded with the idea of robotframework-request modification and added this method
def post_request_binary(
self,
alias,
uri,
path=None,
params=None,
headers=None,
allow_redirects=None,
timeout=None):
session = self._cache.switch(alias)
redir = True if allow_redirects is None else allow_redirects
self._capture_output()
method_name = "post"
method = getattr(session, method_name)
with open(path, 'rb') as f:
resp = method(self._get_url(session, uri),
data=f,
params=self._utf8_urlencode(params),
headers=headers,
allow_redirects=allow_redirects,
timeout=self._get_timeout(timeout),
cookies=self.cookies,
verify=self.verify)
self._print_debug()
# Store the last session object
session.last_resp = resp
self.builtin.log(method_name + ' response: ' + resp.text, 'DEBUG')
return resp
I guess I can improve it a bit and create a pull request.
I was trying to modify the code example here but it seems that jsonify is making it hard... I did the following without jsonify:
#app.errorhandler(InvalidUsage)
def handle_invalid_usage(error):
response = error.to_dict()
response.status_code = error.status_code
return response
Originally, the third line was like:
response = jsonify(error.to_dict())
How can I make this work? I don't want to use JSON. Only text/html
Well, I presume that error.to_dict() returns a dict, which wouldn't have a status code attribute (it's just a regular old dict). You might try this instead:
#app.errorhandler(InvalidUsage)
def handle_invalid_usage(error):
response = error.to_dict()
response["status_code"] = error.status_code
return response
That said, it seems odd that the dictionary wouldn't already have the error code in it. Maybe a bit more detail on what what you are trying to accomplish would help? If you don't call the second-to-last line, and just return response, what do you see?
I have looked and perhaps i missed it. I currently have a file such as the one below:
PUT /URL/TO/SEND/REQUEST
Host: 127.0.0.1
Connection: keep-alive
...
bunch of data here
This file contains the header & the data i want to send over ssl. I know on windows i can use fiddler etc.. to send this raw data BUT i was hoping to use python. I tried looking (may be not hard enough) at urllib2 urllib & httplib to see if i could just send this file as the entire request i don't want to deal with parsing the file etc... Is this possible?
I did notice that in httplib i can use request where "body can be a file object." but from the description seems as though it still sends the header seperately and that file is only for the data being sent.
Thanks
It isn't documented, but it looks like you should be able to use httplib.HTTPConnection.send() for this:
In [13]: httplib.HTTPConnection.send??
Type: instancemethod
String Form:<unbound method HTTPConnection.send>
File: /usr/local/lib/python2.7/httplib.py
Definition: httplib.HTTPConnection.send(self, data)
Source:
def send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
else:
self.sock.sendall(data)
The request() method combines the header and body and passes it to this function, which looks like it should handle strings or file objects.
Of course you will still need to know the host so that you can create the HTTPConnection object, so your code might look something like this (untested):
import httplib
conn = httplib.HTTPConnection('127.0.0.1')
conn.send(open(filename))
response = conn.getresponse()
edit: It turns out there is some internal state stuff that keeps this from working as is, here is a workaround (full example with google main page), but it is a bit of a hack. Tested using Python 2.6 and 2.7, does not appear to work on 3.x by just replacing httplib with http.client:
import httplib
conn = httplib.HTTPConnection('www.google.com')
conn.send('GET / HTTP/1.1\r\nHost: www.google.com\r\n\r\n')
conn._HTTPConnection__state = httplib._CS_REQ_SENT
response = conn.getresponse()
The key part here is setting conn.__state (mangled name) to the httplib._CS_REQ_SENT after calling send().
I am building a simple web service that requires all requests to be signed. The signature hash is generated using request data including the request body. My desire is to have a middleware component that validates the request signature, responding with an error if the signature is invalid. The problem is the middleware needs to read the request body using env['wsgi.input'].read(). This advances the pointer for the request body string to the end, which makes the data inaccessible to other components further down in the chain of execution.
Is there any way to make it so env['wsgi.input'] can be read twice?
Ex:
from myapp.lib.helpers import sign_request
from urlparse import parse_qs
import json
class ValidateSignedRequestMiddleware(object):
def __init__(self, app, secret):
self._app = app
self._secret = secret
def __call__(self, environ, start_response):
auth_params = environ['HTTP_AUTHORIZATION'].split(',', 1)
timestamp = auth_params[0].split('=', 1)[1]
signature = auth_params[1].split('=', 1)[1]
expected_signature = sign_request(
environ['REQUEST_METHOD'],
environ['HTTP_HOST'],
environ['PATH_INFO'],
parse_qs(environ['QUERY_STRING']),
environ['wsgi.input'].read(),
timestamp,
self._secret
)
if signature != expected_signature:
start_response('400 Bad Request', [('Content-Type', 'application/json')])
return [json.dumps({'error': ('Invalid request signature',)})]
return self._app(environ, start_response)
You can try seeking back to the beginning, but you may find that you'll have to replace it with a StringIO containing what you just read out.
The following specification deals with that exact problem, providing explanation of the problem as well as the solution including source code and special cases to take into account:
http://wsgi.readthedocs.org/en/latest/specifications/handling_post_forms.html