How to make a large file accessible to external APIs?

How to make a large file accessible to external APIs? - python

I'm new to webdev, and I have this use case where a user sends a large file (e.g., a video file) to the API, and then this file needs to be accessible to other APIs (which could possibly be on different servers) for further processing.
I'm using FastAPI for the backend, defining a file parameter with a type of UploadFile to receive and store the files. But what would be the best way to make this file accessible to other APIs? Is there a way I can get a publicly accessible URL out of the saved file, which other APIs can use to download the file?

Returning a File Response
First, to return a file that is saved on disk from a FastAPI backend, you could use FileResponse (in case the file was already fully loaded into memory, see here). For example:
from fastapi import FastAPI
from fastapi.responses import FileResponse
some_file_path = "large-video-file.mp4"
app = FastAPI()
#app.get("/")
def main():
return FileResponse(some_file_path)
In case the file is too large to fit into memory—as you may not have enough memory to handle the file data, e.g., if you have 16GB of RAM, you can’t load a 100GB file—you could use StreamingResponse. That way, you don't have to read it all first in memory, but, instead, load it into memory in chunks, thus processing the data one chunk at a time. Example is given below. If you find yield from f being rather slow when using StreamingResponse, you could instead create a custom generator, as described in this answer.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
some_file_path = "large-video-file.mp4"
app = FastAPI()
#app.get("/")
def main():
def iterfile():
with open(some_file_path, mode="rb") as f:
yield from f
return StreamingResponse(iterfile(), media_type="video/mp4")
Exposing the API to the public
As for exposing your API to the public—i.e., external APIs, users, developers, etc.—you can use ngrok (or expose, as suggested in this answer).
Ngrok is a cross-platform application that enables developers to expose a local development server to the Internet with minimal effort. To embed the ngrok agent into your FastAPI application, you could use pyngrok—as suggested here (see here for a FastAPI integration example). If you would like to run and expose your FastAPI app through Google Colab (using ngrok), instead of your local machine, please have a look at this answer (plenty of tutorials/examples can also be found on the web).
If you are looking for a more permanent solution, you may want to have a look at cloud platforms—more specifically, a Platform as a Service (PaaS)—such as Heroku. I would strongly recommend you thoroughly read FastAPI's Deployment documentation. Have a closer look at About HTTPS and Deployments Concepts.
Important to note
By exposing your API to the outside world, you are also exposing it to various forms of attack. Before exposing your API to the public—even if it’s for free—you need to make sure you are offering secure access (use HTTPS), as well as authentication (verify the identity of a user) and authorisation (verify their access rights; in other words, verify what specific routes, files and data a user has access to)—take a look at 1. OAuth2 and JWT tokens, 2. OAuth2 scopes, 3. Role-Based Access Control (RBAC), 4. Get Current User and How to Implement Role based Access Control With FastAPI.
Addtionally, if you are exposing your API to be used publicly, you may want to limit the usage of the API because of expensive computation, limited resources, DDoS attacks, Brute-force attacks, Web scraping, or simply due to monthly cost for a fixed amount of requests. You can do that at the application level using, for instance, slowapi (related post here), or at the platform level by setting the rate limit through your hosting service (if permitted). Furthermore, you would need to make sure that the files uploaded by users have the permitted file extension, e.g., .mp4, and are not files with, for instance, a .exe extension that are potentially harmful to your system. Finally, you would also need to ensure that the uploaded files do not exceed a predefined MAX_FILE_SIZE limit (based on your needs and system's resources), so that authenticated users, or an attacker, would be prevented from uploading extremely large files that would result in consuming server resources in a way that the application may end up crashing. You shouldn't rely, though, on the Content-Length header being present in the request to do that, as this might be easily altered, or even removed, by the client. You should rather use an approach similar to this answer (have a look at the "Update" section) that uses request.stream() to process the incoming data in chunks as they arrive, instead of loading the entire file into memory first. By using a simple counter, e.g., total_len += len(chunk), you can check if the file size has exceeded the MAX_FILE_SIZE, and if so, raise an HTTPException with HTTP_413_REQUEST_ENTITY_TOO_LARGE status code (see this answer as well, for more details and code examples).
Read more on FastAPI's Security documentation and API Security on Cloudflare.

Related

Fastapi Limiting Upload File Size Problem [duplicate]

This question already has an answer here:
How to Upload a large File (≥3GB) to FastAPI backend?
(1 answer)
Closed 13 days ago.
I would like to set a file upload limit with fastapi, but I have a problem where I can't check the size of the file before the person uploads the whole file. For example, if there is a 15 MB file upload limit and the person uploads more than 15 MB, I want to prevent them from uploading to the server. I don't want to use Content-Length to prevent it because it won't prevent any attack. I have found different solutions, but I haven't found a system that can check the file before it is uploaded to the system. As a result, if I can't prevent this and the person tries to upload a 100 GB file to the system and I don't have that much space on my machine, what will happen? Thank you in advance
https://github.com/tiangolo/fastapi/issues/362
I've read and tried what's written on this subject, I also tried with chatgpt, but I couldn't find anything.

You describe a problem faced by all web servers. As per dmontagu's response to the fastapi issue, the general solution is to let a mature web server or load balancer enforce it for you. Example: Apache LimitRequestBody . These products are one of several lines of defense on a hostile Internet, so their implementations will hopefully be more resilient than anything you can write.
The client is completely untrustworthy due to the way the Internet is built peer to peer. There is no inherent identification/trust provision in the HTTP protocol (or the Internet structure), so this behaviour must be built into our applications. In order to protect your web API from malicious sized uploads, you would need to provide an authorised client program that can check the source data before transmission, and an authorisation process for your special app to connect to the API to prevent bypassing the authorised client. Such client-side code is vulnerable to reverse-engineering, and many users would object to installing your software for the sake of an upload!
It is more pragmatic to build our web services with inherent distrust of clients and block malicious requests on sight. The linked Apache directive above will prevent the full 100 GB being received, and similar options exist for nginx and other web servers. Other techniques include IP bans for excessive users, authentication to allow you to audit users individually, or some other profiling of requests.
If you must DIY in Python, then Tiangolo's own solution is the right approach. Either you spool to file to limit memory impact as he proposes, or you would have to run an in-memory accumulator on the request body and abort when you hit the threshold. The Starlette documentation talks about how to stream a request body. Something like the following starlette-themed suggestion:
body = b''
async for chunk in request.stream():
body += chunk
if (len(body) > 10000000):
return Response(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE)
...
In going down this road, you've drained the request body and will need to send it directly to disk, or repackage it for fastapi. fastapi itself exists "above" the untrusted user problem and offers no solutions.

Your request doesn't reach the ASGI app directly. It goes through reverse proxy (Nginx, Apache), ASGI server (uvicorn, hypercorn, gunicorn) before handled by an ASGI app.
Reverse Proxy
For Nginx, the body size is controlled by client_max_body_size, which defaults to 1MB.
For Apache, the body size could be controlled by LimitRequestBody, which defaults to 0.
ASGI Server
The ASGI servers don't have a limit of the body size. At least it's the case for gunicorn, uvicorn, hypercorn.
Large request body attack
This attack is of the second type and aims to exhaust the server’s memory by inviting it to receive a large request body (and hence write the body to memory). A poorly configured server would have no limit on the request body size and potentially allow a single request to exhaust the server.
FastAPI solution
You could require the Content-Length header and check it and make sure that it's a valid value. E.g.
from fastapi import FastAPI, File, Header, Depends, UploadFile
async def valid_content_length(content_length: int = Header(..., lt=50_000_000)):
return content_length
app = FastAPI()
#app.post('/upload', dependencies=[Depends(valid_content_length)])
async def upload_file(file: UploadFile = File(...)):
# do something with file
return {"ok": True}
note: ⚠️ but it probably won't prevent an attacker from sending a valid Content-Length header and a body bigger than what your app can take ⚠️
Another option would be to, on top of the header, read the data in chunks. And once it's bigger than a certain size, throw an error.
from typing import IO
from tempfile import NamedTemporaryFile
import shutil
from fastapi import FastAPI, File, Header, Depends, UploadFile, HTTPException
from starlette import status
async def valid_content_length(content_length: int = Header(..., lt=80_000)):
return content_length
app = FastAPI()
#app.post("/upload")
def upload_file(
file: UploadFile = File(...), file_size: int = Depends(valid_content_length)
):
real_file_size = 0
temp: IO = NamedTemporaryFile(delete=False)
for chunk in file.file:
real_file_size += len(chunk)
if real_file_size > file_size:
raise HTTPException(
status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="Too large"
)
temp.write(chunk)
temp.close()
shutil.move(temp.name, "/tmp/some_final_destiny_file")
return {"ok": True}
Reference:
https://pgjones.gitlab.io/quart/discussion/dos_mitigations.html#large-request-body
https://github.com/tiangolo/fastapi/issues/362#issuecomment-584104025

Size Limit On Download From Cloud Storage With App Engine

tldr: Is there a file size limit to send a file from Cloud Storage to my user's web browser as a download? Am I using the Storage Python API wrong, or do I need to increase the resources set by my App Engine YAML file?
This is just an issue with downloads. Uploads work great up to any file size, using chunking.
The symptoms
I created a file transfer application with App Engine Python 3.7 Standard environment. Users are able to upload files of any size, and that is working well. But users are running into what appears to be a size limit on downloading the resulting file from Cloud Storage.
The largest file I have successfully sent and received through my entire upload/download process was 29 megabytes. Then I sent myself a 55 megabyte file, but when I attempted to receive it as a download, Flask gives me this error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Application Structure
To create my file transfer application, I used Flask to set up two Services, internal and external, each with its own Flask routing file, its own webpage/domain, and its own YAML files.
In order to test the application, I visit the internal webpage which I created. I use it to upload a file in chunks to my application, which successfully composes the chunks in Cloud Storage. I then log into Google Cloud Platform Console as an admin, and when I look at Cloud Storage, it will show me the 55 megabyte file which I uploaded. It will allow me to download it directly through the Cloud Platform Console, and the file is good.
(Up to that point of the process, this even worked for a 1.5 gigabyte file.)
Then I go to my external webpage as a non-admin user. I use the form to attempt to receive that same file as a download. I get the above error. However, this whole process encounters no errors for my 29 megabyte test file, or smaller.
Stacktrace Debugger logs for this service reveal:
logMessage: "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)"
Possible solutions
I added the following lines to my external service YAML file:
resources:
memory_gb: 100
disk_size_gb: 100
The error remained the same. Apparently this is not a limit of system resources?
Perhaps I'm misusing the Python API for Cloud Storage. I'm importing storage from google.cloud. Here is where my application responds to the user's POST request by sending the user the file they requested:
#app.route('/download', methods=['POST'])
def provide_file():
return external_download()
This part is in external_download:
storage_client = storage.Client()
bucket = storage_client.get_bucket(current_app.cloud_storage_bucket)
bucket_filename = request.form['filename']
blob = bucket.blob(bucket_filename)
return send_file(io.BytesIO(blob.download_as_string()),
mimetype="application/octet-stream",
as_attachment=True,
attachment_filename=filename)
Do I need to implement chunking for downloads, not just uploads?

I wouldn’t recommend to use Flask's send_file() method for managing large files transfer, Flask file handling methods were intended for use by developers or API's, to exchange system messages mostly, like logs, cookies and other light objects.
Besides, the download_as_string() method might indeed conceal a buffer limitation, I did reproduce your scenario and got the same error message with files larger than 30mb, however I couldn’t find more information about such constraint. It might be intentional, induced by the method’s purpose (download content as string, not fitted for large objects).
Proven ways to handle file transfer efficiently with Cloud Storage and Python:
Use methods of the Cloud Storage API directly, to download and upload objects without using Flask. As #FridayPush mentioned, it will offload your application, and you can control access with Signed URL’s.
Use the Blobstore API, a straightforward, lightweight, and simple solution to handle file transfer, fully integrated with GCS buckets and intended for this type of situation.
Use the built-in Python Requests module, requires to create your own handlers to communicate with GCS.

Django, global variables and tokens

I'm using django to develop a website. On the server side, I need to transfer some data that must be processed on the second server (on a different machine). I then need a way to retrieve the processed data. I figured that the simplest would be to send back to the Django server a POST request, that would then be handled on a view dedicated for that job.
But I would like to add some minimum security to this process: When I transfer the data to the other machine, I want to join a randomly generated token to it. When I get the processed data back, I expect to also get back the same token, otherwise the request is ignored.
My problem is the following: How do I store the generated token on the Django server?
I could use a global variable, but I had the impression browsing here and there on the web, that global variables should not be used for safety reason (not that I understand why really).
I could store the token on disk/database, but it seems to be an unjustified waste of performance (even if in practice it would probably not change much).
Is there third solution, or a canonical way to do such a thing using Django?

You can store your token in django cache, it will be faster from database or disk storage in most of the cases.
Another approach is to use redis.
You can also calculate your token:
save some random token in settings of both servers
calculate token based on current timestamp rounded to 10 seconds, for example using:
token = hashlib.sha1(secret_token)
token.update(str(rounded_timestamp))
token = token.hexdigest()
if token generated on remote server when POSTing request match token generated on local server, when getting response, request is valid and can be processed.

The simple obvious solution would be to store the token in your database. Other possible solutions are Redis or something similar. Finally, you can have a look at distributed async tasks queues like Celery...

Is there a better way to access my public api?

I am new to Flask.
I have a public api, call it api.example.com.
#app.route('/api')
def api():
name = request.args.get('name')
...
return jsonify({'address':'100 Main'})
I am building an app on top of my public api (call it www.coolapp.com), so in another app I have:
#app.route('/make_request')
def index():
params = {'name':'Fred'}
r = requests.get('http://api.example.com', params=params)
return render_template('really_cool.jinja2',address=r.text)
Both api.example.com and www.coolapp.com are hosted on the same server. It seems inefficient the way I have it (hitting the http server when I could access the api directly). Is there a more efficient way for coolapp to access the api and still be able to pass in the params that api needs?

Ultimately, with an API powered system, it's best to hit the API because:
It's user testing the API (even though you're the user, it's what others still access);
You can then scale easily - put a pool of API boxes behind a load balancer if you get big.
However, if you're developing on the same box you could make a virtual server that listens on localhost on a random port (1982) and then forwards all traffic to your api code.
To make this easier I'd abstract the API_URL into a setting in your settings.py (or whatever you are loading in to Flask) and use:
r = requests.get(app.config['API_URL'], params=params)
This will allow you to make a single change if you find using this localhost method isn't for you or you have to move off one box.
Edit
Looking at your comments you are hoping to hit the Python function directly. I don't recommend doing this (for the reasons above - using the API itself is better). I can also see an issue if you did want to do this.
First of all we have to make sure the api package is in your PYTHONPATH. Easy to do, especially if you're using virtualenvs.
We from api import views and replace our code to have r = views.api() so that it calls our api() function.
Our api() function will fail for a couple of reasons:
It uses the flask.request to extract the GET arg 'name'. Because we haven't made a request with the flask WSGI we will not have a request to use.
Even if we did manage to pass the request from the front end through to the API the second problem we have is using the jsonify({'address':'100 Main'}). This returns a Response object with an application type set for JSON (not just the JSON itself).
You would have to completely rewrite your function to take into account the Response object and handle it correctly. A real pain if you do decide to go back to an API system again...

Depending on how you structure your code, your database access, and your functions, you can simply turn the other app into package, import the relevant modules and call the functions directly.
You can find more information on modules and packages here.
Please note that, as Ewan mentioned, there's some advantages to using the API. I would advise you to use requests until you actually need faster requests (this is probably premature optimization).
Another idea that might be worth considering, depending on your particular code, is creating a library that is used by both applications.

Low-level reading of a POST multipart request?

I have an example which I'm trying to create which, preferably using Django (or some other comparable framework), will immediately compress uploaded contents chunk-by-chunk into a strange compression format (be it LZMA, 7zip, etc.) which is then written out to another upload request to S3.
Essentially, this is what will happen:
A user initiates a multipart upload to my endpoint at ^/upload/?$.
As chunks are received on the server, (could be 1024 bytes or some other number) they are then passed through a compression algorithm in chunks.
The compressed output is written out over the wire to a S3 bucket.
Step 3 is optional; I could store the file locally and have a message queue do the uploads in a deferred way.
Is step 2 possible using a framework like Django? Is there a low-level way of accessing the incoming data in a file-like object?

The Django Request object provides a file-like interface so you can stream data from it. But, since Django always reads the whole Request into memory (or a temporary File if the file upload is too large) you can only use this API after the whole request is received. If your temporary storage directory is big enough and you do not mind buffering the data on your server you do not need to do anything special. Just upload the data to S3 inside the view. Be careful with timeouts though. If the upload to S3 takes too long the browser will receive a timeout. Therefore I would recommend moving the temporary files to a more permanent directory and initiating the upload via a worker queue like Celery.
If you want to stream directly from the client into Amazon S3 via your server I recommend using gevent. Using gevent you could write a simple greenlet that reads from a queue and writes to S3. This queue is filled by the original greenlet which reads from the request.
You could use a special upload URL like http://upload.example.com/ where you deploy that special server. The Django functions can be used from outside the Django framework if you set the DJANGO_SETTINGS_MODULE environment variable and take care of some things that the middlewares normally do for you (db connect/disconnect, transaction begin/commit/rollback, session handling, etc.).
It is even possible to run your custom WSGI app and Django together in the same WSGI container. Just wrap the Django WSGI app and intercept requests to /upload/. In this case I would recommend using gunicorn with the gevent worker-class as server.
I am not too familiar with the Amazon S3 API, but as far as I know you can also generate a temporary token for file uploads directly from your users. That way you would not need to tunnel the data through your server at all.
Edit: You can indeed allow anonymous uploads to your buckets. See this question which talks about this topic: S3 - Anonymous Upload - Key prefix

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.