Fastapi Limiting Upload File Size Problem [duplicate]

Fastapi Limiting Upload File Size Problem [duplicate] - python

This question already has an answer here:
How to Upload a large File (≥3GB) to FastAPI backend?
(1 answer)
Closed 13 days ago.
I would like to set a file upload limit with fastapi, but I have a problem where I can't check the size of the file before the person uploads the whole file. For example, if there is a 15 MB file upload limit and the person uploads more than 15 MB, I want to prevent them from uploading to the server. I don't want to use Content-Length to prevent it because it won't prevent any attack. I have found different solutions, but I haven't found a system that can check the file before it is uploaded to the system. As a result, if I can't prevent this and the person tries to upload a 100 GB file to the system and I don't have that much space on my machine, what will happen? Thank you in advance
https://github.com/tiangolo/fastapi/issues/362
I've read and tried what's written on this subject, I also tried with chatgpt, but I couldn't find anything.

You describe a problem faced by all web servers. As per dmontagu's response to the fastapi issue, the general solution is to let a mature web server or load balancer enforce it for you. Example: Apache LimitRequestBody . These products are one of several lines of defense on a hostile Internet, so their implementations will hopefully be more resilient than anything you can write.
The client is completely untrustworthy due to the way the Internet is built peer to peer. There is no inherent identification/trust provision in the HTTP protocol (or the Internet structure), so this behaviour must be built into our applications. In order to protect your web API from malicious sized uploads, you would need to provide an authorised client program that can check the source data before transmission, and an authorisation process for your special app to connect to the API to prevent bypassing the authorised client. Such client-side code is vulnerable to reverse-engineering, and many users would object to installing your software for the sake of an upload!
It is more pragmatic to build our web services with inherent distrust of clients and block malicious requests on sight. The linked Apache directive above will prevent the full 100 GB being received, and similar options exist for nginx and other web servers. Other techniques include IP bans for excessive users, authentication to allow you to audit users individually, or some other profiling of requests.
If you must DIY in Python, then Tiangolo's own solution is the right approach. Either you spool to file to limit memory impact as he proposes, or you would have to run an in-memory accumulator on the request body and abort when you hit the threshold. The Starlette documentation talks about how to stream a request body. Something like the following starlette-themed suggestion:
body = b''
async for chunk in request.stream():
body += chunk
if (len(body) > 10000000):
return Response(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE)
...
In going down this road, you've drained the request body and will need to send it directly to disk, or repackage it for fastapi. fastapi itself exists "above" the untrusted user problem and offers no solutions.

Your request doesn't reach the ASGI app directly. It goes through reverse proxy (Nginx, Apache), ASGI server (uvicorn, hypercorn, gunicorn) before handled by an ASGI app.
Reverse Proxy
For Nginx, the body size is controlled by client_max_body_size, which defaults to 1MB.
For Apache, the body size could be controlled by LimitRequestBody, which defaults to 0.
ASGI Server
The ASGI servers don't have a limit of the body size. At least it's the case for gunicorn, uvicorn, hypercorn.
Large request body attack
This attack is of the second type and aims to exhaust the server’s memory by inviting it to receive a large request body (and hence write the body to memory). A poorly configured server would have no limit on the request body size and potentially allow a single request to exhaust the server.
FastAPI solution
You could require the Content-Length header and check it and make sure that it's a valid value. E.g.
from fastapi import FastAPI, File, Header, Depends, UploadFile
async def valid_content_length(content_length: int = Header(..., lt=50_000_000)):
return content_length
app = FastAPI()
#app.post('/upload', dependencies=[Depends(valid_content_length)])
async def upload_file(file: UploadFile = File(...)):
# do something with file
return {"ok": True}
note: ⚠️ but it probably won't prevent an attacker from sending a valid Content-Length header and a body bigger than what your app can take ⚠️
Another option would be to, on top of the header, read the data in chunks. And once it's bigger than a certain size, throw an error.
from typing import IO
from tempfile import NamedTemporaryFile
import shutil
from fastapi import FastAPI, File, Header, Depends, UploadFile, HTTPException
from starlette import status
async def valid_content_length(content_length: int = Header(..., lt=80_000)):
return content_length
app = FastAPI()
#app.post("/upload")
def upload_file(
file: UploadFile = File(...), file_size: int = Depends(valid_content_length)
):
real_file_size = 0
temp: IO = NamedTemporaryFile(delete=False)
for chunk in file.file:
real_file_size += len(chunk)
if real_file_size > file_size:
raise HTTPException(
status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="Too large"
)
temp.write(chunk)
temp.close()
shutil.move(temp.name, "/tmp/some_final_destiny_file")
return {"ok": True}
Reference:
https://pgjones.gitlab.io/quart/discussion/dos_mitigations.html#large-request-body
https://github.com/tiangolo/fastapi/issues/362#issuecomment-584104025

Related

How to make a large file accessible to external APIs?

I'm new to webdev, and I have this use case where a user sends a large file (e.g., a video file) to the API, and then this file needs to be accessible to other APIs (which could possibly be on different servers) for further processing.
I'm using FastAPI for the backend, defining a file parameter with a type of UploadFile to receive and store the files. But what would be the best way to make this file accessible to other APIs? Is there a way I can get a publicly accessible URL out of the saved file, which other APIs can use to download the file?

Returning a File Response
First, to return a file that is saved on disk from a FastAPI backend, you could use FileResponse (in case the file was already fully loaded into memory, see here). For example:
from fastapi import FastAPI
from fastapi.responses import FileResponse
some_file_path = "large-video-file.mp4"
app = FastAPI()
#app.get("/")
def main():
return FileResponse(some_file_path)
In case the file is too large to fit into memory—as you may not have enough memory to handle the file data, e.g., if you have 16GB of RAM, you can’t load a 100GB file—you could use StreamingResponse. That way, you don't have to read it all first in memory, but, instead, load it into memory in chunks, thus processing the data one chunk at a time. Example is given below. If you find yield from f being rather slow when using StreamingResponse, you could instead create a custom generator, as described in this answer.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
some_file_path = "large-video-file.mp4"
app = FastAPI()
#app.get("/")
def main():
def iterfile():
with open(some_file_path, mode="rb") as f:
yield from f
return StreamingResponse(iterfile(), media_type="video/mp4")
Exposing the API to the public
As for exposing your API to the public—i.e., external APIs, users, developers, etc.—you can use ngrok (or expose, as suggested in this answer).
Ngrok is a cross-platform application that enables developers to expose a local development server to the Internet with minimal effort. To embed the ngrok agent into your FastAPI application, you could use pyngrok—as suggested here (see here for a FastAPI integration example). If you would like to run and expose your FastAPI app through Google Colab (using ngrok), instead of your local machine, please have a look at this answer (plenty of tutorials/examples can also be found on the web).
If you are looking for a more permanent solution, you may want to have a look at cloud platforms—more specifically, a Platform as a Service (PaaS)—such as Heroku. I would strongly recommend you thoroughly read FastAPI's Deployment documentation. Have a closer look at About HTTPS and Deployments Concepts.
Important to note
By exposing your API to the outside world, you are also exposing it to various forms of attack. Before exposing your API to the public—even if it’s for free—you need to make sure you are offering secure access (use HTTPS), as well as authentication (verify the identity of a user) and authorisation (verify their access rights; in other words, verify what specific routes, files and data a user has access to)—take a look at 1. OAuth2 and JWT tokens, 2. OAuth2 scopes, 3. Role-Based Access Control (RBAC), 4. Get Current User and How to Implement Role based Access Control With FastAPI.
Addtionally, if you are exposing your API to be used publicly, you may want to limit the usage of the API because of expensive computation, limited resources, DDoS attacks, Brute-force attacks, Web scraping, or simply due to monthly cost for a fixed amount of requests. You can do that at the application level using, for instance, slowapi (related post here), or at the platform level by setting the rate limit through your hosting service (if permitted). Furthermore, you would need to make sure that the files uploaded by users have the permitted file extension, e.g., .mp4, and are not files with, for instance, a .exe extension that are potentially harmful to your system. Finally, you would also need to ensure that the uploaded files do not exceed a predefined MAX_FILE_SIZE limit (based on your needs and system's resources), so that authenticated users, or an attacker, would be prevented from uploading extremely large files that would result in consuming server resources in a way that the application may end up crashing. You shouldn't rely, though, on the Content-Length header being present in the request to do that, as this might be easily altered, or even removed, by the client. You should rather use an approach similar to this answer (have a look at the "Update" section) that uses request.stream() to process the incoming data in chunks as they arrive, instead of loading the entire file into memory first. By using a simple counter, e.g., total_len += len(chunk), you can check if the file size has exceeded the MAX_FILE_SIZE, and if so, raise an HTTPException with HTTP_413_REQUEST_ENTITY_TOO_LARGE status code (see this answer as well, for more details and code examples).
Read more on FastAPI's Security documentation and API Security on Cloudflare.

Faking http-host headers with pytest (and potential security issues)?

I am running a django app with a REST API and I am protecting my API endpoint with a custom permission. It looks like this:
class MyPermission(permissions.BasePermission):
def has_permission(self, request, view):
host = request.META.get('HTTP_HOST', None)
return host == "myhost.de"
)
The idea is that my API is only accessible via "myhost.de".
Right now I am testing this with pytest. I can set my headers with:
#pytest.fixture()
def request_unauth(self, client):
result = client.get(
"myurl",
headers={'HTTP_HOST', 'myhost.de'},
content_type="application/json",
)
return result
def test_host(request_unauth):
assert request_unauth.status_code == 200
Since I can easily fake my headers I assume that this might also be easily done with other tools? How can MyPermission be evaluated from a security perspective?
Thanks so much for any help and hints. Very much appreciated.

Checking the Host header like that does not make sense, and will not protect against 3rd party clients as you described in comments. An attacker can create an arbitrary client and send request to your API, and that request can (and will) include the correct Host header as any other legitimate request.
Also based on your comments, you want to authenticate the client application, which is not technically possible, as it has been discussed many times. With some work (the amount of which you can influence somewhat) anybody can create a different client to your API, and there is no secure way you could prevent that, because anything you include in your client will be known to users (attackers), and will allow them to copy it. You can and probably should authenticate your users though, check access patterns, implement rate limiting, revoke user access in case of suspicious activity and so on - but this is all based on user authentication.
You can also prevent access from a client running in a standard browser on different domain, by sending the correct CORS headers (or not sending CORS headers at all) in your API.

'CORS request did not succeed' when uploading an image and Flask raises error (Firefox only)

I am trying to debug a CORS issue with my app. Specifically, it fails only in Firefox and, it seems, only with somewhat bigger files.
I am using flask on the backend and I am trying to upload a "faulty" image to my service. When I say faulty, I mean that the backend should reject the image with a 400 (only accept PNG, not JPG). Uploading a PNG of any size works ok. However, when I reject the JPG file, the browser request fails with Network error and I cannot capture the 400-error to display a user-friendly message. From the backend's side, everything is the same, always same headers returned, be it accepted or rejected request, POST or OPTIONS.
However, I have noticed that it only fails with somewhat bigger files. If I send a JPG of a few KBs, it works. If I send a JPG of a few MBs, it fails.
I have looked at everything
curl-ing the backend gives all the right headers
there are no OPTIONS requests logged by the browsers, but if there were, I've also checked those with curl for the right headers
I'm only using HTTP (not HTTPS), so no problems with certificates
I have disabled all extensions, so no possible blocking from the browser
maybe other things that I cannot remember
What can possibly be the cause? Note that everything works as expected
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://localhost:8083/api/image. (Reason: CORS request did not succeed).

Well, after a couple of hours of trials, it appears this has nothing to do with CORS. This is probably the most confusing error message. To cite from Firefox' documentation on this (emphasis mine):
The HTTP request which makes use of CORS failed because the HTTP connection failed at either the network or protocol level. The error is not directly related to CORS, but is a fundamental network error of some kind. In many cases, it is caused by a browser plugin (e.g. an ad blocker or privacy protector) blocking the request.
So, this should actually indicate that the problem is on the backend, although it is very subtle.
Since in my code I am rejecting the request based on the trasmitted filename, I never read the content of the request if the name ends with .jpg. Instead, I reject it immediately. This is a problem with Flask's development server, which does not empty the input stream in such cases (issue here).
So, if you want to deal with this while keeping the development server, you should consume the input. In my case, I added a custom error handler, like so:
class BadRequestError(ValueError):
"""Raised when a request does not conform to the protocol"""
pass
#app.errorhandler(BadRequestError)
def bad_request_handler(error):
# throw away the request data to avoid closing the connection before receiving all of it
# http://flask.pocoo.org/snippets/47/
_ = request.data
_ = request.form
response = jsonify(str(error))
response.status_code = 400
return response
and then, in the code, I always raise BadRequestError('...'), instead of just returning a 400-response.

Can I persist an http connection (or other data) across Flask requests?

I'm working on a Flask app which retrieves the user's XML from the myanimelist.net API (sample), processes it, and returns some data. The data returned can be different depending on the Flask page being viewed by the user, but the initial process (retrieve the XML, create a User object, etc.) done before each request is always the same.
Currently, retrieving the XML from myanimelist.net is the bottleneck for my app's performance and adds on a good 500-1000ms to each request. Since all of the app's requests are to the myanimelist server, I'd like to know if there's a way to persist the http connection so that once the first request is made, subsequent requests will not take as long to load. I don't want to cache the entire XML because the data is subject to frequent change.
Here's the general overview of my app:
from flask import Flask
from functools import wraps
import requests
app = Flask(__name__)
def get_xml(f):
#wraps(f)
def wrap():
# Get the XML before each app function
r = requests.get('page_from_MAL') # Current bottleneck
user = User(data_from_r) # User object
response = f(user)
return response
return wrap
#app.route('/one')
#get_xml
def page_one(user_object):
return 'some data from user_object'
#app.route('/two')
#get_xml
def page_two(user_object):
return 'some other data from user_object'
if __name__ == '__main__':
app.run()
So is there a way to persist the connection like I mentioned? Please let me know if I'm approaching this from the right direction.

I think you aren't approaching this from the right direction because you place your app too much as a proxy of myanimelist.net.
What happens when you have 2000 users? Your app end up doing tons of requests to myanimelist.net, and a mean user could definitely DoS your app (or use it to DoS myanimelist.net).
This is a much cleaner way IMHO :
Server side :
Create a websocket server (ex: https://github.com/aaugustin/websockets/blob/master/example/server.py)
When a user connects to the websocket server, add the client to a list, remove it from the list on disconnect.
For every connected users, do frequently check myanimelist.net to get the associated xml (maybe lower the frequence the more online users you get)
for every xml document, make a diff with your server local version, and send that diff to the client using the websocket channel (assuming there is a diff).
Client side :
on receiving diff : update the local xml with the differences.
disconnect from websocket after n seconds of inactivity + when disconnected add a button on the interface to reconnect
I doubt you can do anything much better assuming myanimelist.net doesn't provide a "push" API.

App Engine TimeOut Error: Serving a third-party API with an image stored on App Engine

I'm building an application in Python on App Engine. My application receives images as email attachments. When an email comes in, I grab the image and need to send it to a third party API.
The first thing I did was:
1) make a POST request to the third party API with the image data
I stopped this method because I had some pretty bad encoding problems with urllib2 and a MultipartPostHandler.
The second thing I'm doing right now is
1) Put the image in the incoming email in the Datastore
2) Put it in the memcache
3) Send to the API an URL that serves the image (using the memcache or, if not found in the memcache, the Datastore)
The problem I read on my logs is: DeadlineExceededError: ApplicationError: 5
More precisely, I see two requests in my logs:
- first, the incoming email
- then, the third party API HTTP call to my image on the URL I gave him
The incoming email ends up with the DeadlineExceededError.
The third party API call to my application ends up fine, serving correctly the image.
My interpretation:
It looks like App Engine waits for a response from the third party API, then closes because of a timeout, and then serves the request made by the third party API for the image. Unfortunately, as the connection is closed, I cannot get the useful information provided by the third party API once it has received my image.
My questions:
1) Can App Engine handle a request from a host it supposes to get a response of?
2) If not, how can I bypass this problem?

If you directly use the App Engine URLfetch API, you can adjust the timeout for your request. The default is 5 seconds, and it can be increased to 10 seconds for normal handlers, or to 10 minutes for fetches within task queue tasks or cron jobs.
If the external API is going to take more than 10 seconds to respond, probably your best bet would be to have your email handler fire off a task that calls the API with a very high timeout set (although almost certainly it would be better to fix your "pretty bad encoding problems"; how bad can encoding binary data to POST be?)
To answer your first question: if you're using dev_appserver, no, you can't handle any requests at all while you've got an external request pending; dev_appserver is single-threaded and handles 1 request at a time. The production environment should be able to scale to do this; however, if you have handlers that are waiting 10 seconds for a urlfetch, the scheduler might not scale your application well since the latency of incoming requests is one of the factors in auto-scaling.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.