Best way to upload large csv files using python flask

Best way to upload large csv files using python flask - python

Requirement: To Upload files using flask framework. Once uploaded to the server user should be able to see the file in UI.
Current code: In order to meet above requirement i wrote the code to upload sufficiently large files and its working fine with (~30 MB file, yes ofcourse not that fast). But when i am trying to upload (~100 MB) file, It is taking too long and process never completes.
This is what currently i am doing:
UPLOAD_FOLDER = '/tmp'
file = request.files['filename']
description = request.form['desc']
filename = secure_filename(file.filename)
try:
file.save(os.path.join(UPLOAD_FOLDER, filename))
filepath = os.path.join(UPLOAD_FOLDER, filename)
except Exception as e:
return e
data = None
try:
with open(filepath) as file:
data = file.read()
except Exception as e:
log.exception(e)
So what i am doing is first saving the file to temporary location in server and then from then reading the data and putting it into our database. I think this is where i am struggling i am not sure what is the best approach.
Should i take the input from user and return the success message(obviously user won't be able to access the file immediately then) and make putting the data into database a background process, using some kind of queue system. Or What else should be done to optimize the code.

On the flask side make sure you have the MAX_CONTENT_LENGTH config value set high enough:
app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024 # 100MB limit
Also you may want to look into the Flask-Upload extension.
There is another SO post similar to this one: Large file upload in Flask.
Other than that you problem may be a timeouts somewhere along the line. What does the rest of your stack look like? Apache? Nginx and Gunicorn? Are you getting a Connection reset error, Connection timed out error or does it just hang?
If you are using Nginx try setting proxy_read_timeout to a value high enough for the upload to finish. Apache may also have a default setting causing you trouble if that is what you are using. It's hard to tell without knowing more about your stack and what the error is that you are getting and what the logs are showing.

Related

Uploading mp4 to Heroku -- file is not saved

I'm trying to upload a .mp4 video to be processed by Python server in Heroku. In my server.py, I've have following code:
# Initialize the Flask application
MYDIR = os.path.dirname(__file__)
app = Flask(__name__)
app.config['UPLOAD_PATH'] = 'uploads'
cors = CORS(app, resources={r"/api/*": {"origins": "*"}})
# route http posts to this method
#app.route('/api/uploadFile', methods=['POST'])
def uploadFile():
r = request # this is the incoming request
uploaded_file = r.files['uploadedfile']
# save the incoming video file locally
if uploaded_file.filename != '':
content = r.files['uploadedfile'].read()
uploaded_file.save(os.path.join(MYDIR, uploaded_file.filename))
However, when I run heroku run bash and do ls I don't see the file at all. I've looked at the previous posts regarding this and doing what many have suggested, i.e., to have an absolute path os.path.join(MYDIR, uploaded_file.filename). But I still don't see the file. I know that Heroku is an ephemeral system and doesn't store files, but all I want to do is process the video file and after it's done, I'll myself delete it. Oh by the way, the video is approx 25MB in size. I tried a low quality version which is around 4MB and the problem still persists, so my guess is it is not the size of the video that's the problem.
Also, when I read the content via content = r.files['uploadedfile'].read(), I do see that content is approximately of length 25M suggesting that I'm indeed receiving the video correctly.
Ideally, I would love to convert the byte stream to something like list of frames and then process it using opencv. But I thought, I can, as a first step, just save it locally, then read and process. I don't care that it is deleted after 5 minutes of my processing is done.
Any help would be greatly appreciated. Banged my head around for a day looking for a solution. Thanks!!!
Just want to add that the commands above give me no error. So it seems like everything worked but then when I do heroku run bash, I see no file anywhere. It would've been great if there was an error or an warning provided by the system :-(.
Also, I tried running things locally on localhost server and it does store the file. So it works locally on my machine, just can't seem to save it in Heroku.

Why is Docker able to save files to the source-directory?

I don't really know if this even classifies as a problem, but it somehow surprised me, since I thought that Docker containers run in sandboxed enviroments.
I have a running docker-container that includes a django-rest server.
When making the appropiate requests from the frontend (that's also served through a seperate NGINX-container in the same docker network) the request successfully triggers the response of the django server that includes writing the contents of the requests to a text-file.
Now, the path to the text file points to a subdirectory in the folder where the script itself is located.
But now the problem: the file is saved in the directory on the host-machine itself and (to my knowledge) not in the container.
Now is that something that docker is capable of doing or am I just not aware that django itself (which is running in the WSGI development server mind you) writes the file to the host machine.
now = datetime.now()
timestamp = datetime.timestamp(now)
polygon = json.loads(data['polygon'])
string = ''
for key in polygon:
string += f'{key}={polygon[key]}\r\n'
with open(f"./path/to/images/image_{timestamp}.txt", "w+") as f:
f.write(string)
return 200
Is there something I am missing or something I do not understand about docker or django yet?

Prefered way of using Flask and S3 for large files

I know that this is a little bit open ended but I am confused as to what strategy/method to apply for a large file upload service developed using Flask and boto3. For smaller files and all it is fine. But it would be really nice to see what you guys think when the size exceeds 100 MB
What I have in mind are following -
a) Stream the file to Flask app using some kind of AJAX uploader(What I am trying to build is just a REST interface using Flask-Restful. Any example of using these components, e.g. Flask-Restful, boto3 and streaming large files are welcome.). The upload app is going to be (I believe) part of a microservices platform that we are building. I do not know whether there will be a Nginx proxy in front of the flask app or it will be directly served from a Kubernetes pod/service. In case it is directly served, is there something that I have to change for large file upload either in kubernetes and/or Flask layer?
b) Using a direct JS uploader (like http://www.plupload.com/) and stream the file into s3 bucket directly and when finished get the URL and pass it to the Flask API app and store it in DB. The problem with this is, the credentials need to be there somewhere in JS which means a security threat. (Not sure if any other concerns are there)
What among them (or something different I did not think about at all) you think is the best way and where can I find some code example for that?
Thanks in advance.
[EDIT]
I have found this - http://blog.pelicandd.com/article/80/streaming-input-and-output-in-flask where the author is dealing with kind of similar situation like me and he proposed a solution. But he is opening a file already present in disk. What if I want to directly upload the file as it comes in as one single object in a s3 bucket? I feel that this can be a base of a solution but not the solution itself.

Alternatively you can use Minio-py client library, its Open Source and compatible with S3 API. It handles multipart upload for you natively.
A simple put_object.py example:
import os
from minio import Minio
from minio.error import ResponseError
client = Minio('s3.amazonaws.com',
access_key='YOUR-ACCESSKEYID',
secret_key='YOUR-SECRETACCESSKEY')
# Put a file with default content-type.
try:
file_stat = os.stat('my-testfile')
file_data = open('my-testfile', 'rb')
client.put_object('my-bucketname', 'my-objectname', file_data, file_stat.st_size)
except ResponseError as err:
print(err)
# Put a file with 'application/csv'
try:
file_stat = os.stat('my-testfile.csv')
file_data = open('my-testfile.csv', 'rb')
client.put_object('my-bucketname', 'my-objectname', file_data,
file_stat.st_size, content_type='application/csv')
except ResponseError as err:
print(err)
You can find list of complete API operations with examples here
Installing Minio-Py library
$ pip install minio
Hope it helps.
Disclaimer: I work for Minio

Flask can only use the memory to save all http request body, so there is no feature such as disk buffing as I know.
Nginx upload module is a really good way to do large file upload. the document is here.
You can also use html5, flash to send trunked file data and process the data in Flask, but it's complicated.
Try to look up if s3 offer the one time token.

Using the link I have posted above I finally ended up doing the following. Please tell me if you think it is a good solution
import boto3
from flask import Flask, request
.
.
.
#app.route('/upload', methods=['POST'])
def upload():
s3 = boto3.resource('s3', aws_access_key_id="key", aws_secret_access_key='secret', region_name='us-east-1')
s3.Object('bucket-name','filename').put(Body=request.stream.read(CHUNK_SIZE))
.
.
.

django python cumulus - How to deal with uploading a large number of files to cloud file storage

I have a number of files processed and saved in temp folder on my server and I now want to move them into my default_storage location, (default_storage is set to rackspace cloud files using django-cumulus).
The process begins uploading the files correctly but only manages less then half the files before stopping. My guess is its a memory issue, but I am not sure how to go about solving it. Here is the relevant code:
listing = os.listdir(path + '/images')
listing.sort()
for infile in listing:
image = open(path + '/images/' + infile, 'r')
image_loc = default_storage.save(infile, ContentFile(image.read()))
image.flush()
image.close()
Just in case it makes a difference my server setup is a rackspace cloud nginx and gunicorn on ubuntu

You could give django-storages a try. It s a custom backend which is easy to integrate, which also supports rackspace.

In the end the answer came in several parts. First I had to add a TIMEOUT setting to cumulus (which is not mentioned in the django-cumulus documentation). Second I increased timeout for gunicorn. Finally I increased the timeout parameter of nginx.

Image upload: iPhone Client - Django - S3

I have a general question regarding uploads from a client (in this case an iPhone App) to S3. I'm using Django to write my webservice on an EC2 instance. The following method is the bare minimum to upload a file to S3 and it works very well with smaller files (jpgs or pngs < 1 MB):
def store_in_s3(filename, content):
conn = S3Connection(settings.ACCESS_KEY, settings.PASS_KEY) # gets access key and pass key from settings.py
bucket = conn.create_bucket('somebucket')
k = Key(bucket) # create key on this bucket
k.key = filename
mime = mimetypes.guess_type(filename)[0]
k.set_metadata('Content-Type', mime)
k.set_contents_from_string(content)
k.set_acl('public-read')
def uploadimage(request, name):
if request.method == 'PUT':
store_in_s3(name,request.raw_post_data)
return HttpResponse("Uploading raw data to S3 ...")
else:
return HttpResponse("Upload not successful!")
I'm quite new to all of this, so I still don't understand what happens here. Is it the case that:
Django receives the file and saves it in the memory of my EC2 instance?
Should I avoid using raw_post_data and rather do chunks to avoid memory issues?
Once Django has received the file, will it pass the file to the function store_in_s3?
How do I know if the upload to S3 was successful?
So in general, I wonder if one line of my code will be executed after another. E.g. when will return HttpResponse("Uploading raw data to S3 ...") fire? While the file is still uploading or after it was successfully uploaded?
Thanks for your help. Also, I'd be grateful for any docs that treat this subject. I had a look at the chapters in O'Reilly's Python & AWS Cookbook, but as it's only code samples, it doesn't really answer my questions.

Django stores small uploaded files in memory. Over a certain size, and it will store it on a temp file on disk.
Yes, chunking is helpful for memory savings as well:
for file_chunk in uploaded_file.chunks():
saved_file.write(file_chunk)
All of these operations are synchronous, so Django will wait until the file is fully uploaded before it will attempt to store it in S3. S3 will have to complete its upload before it will return as well, so you are guaranteed that it will be uploaded through Django and to S3 before you will receive your HttpResponse()
Django's File Uploads documentation has a lot of great info on how to handle various upload situations.

You might want to look at django-storages. It facilitates storing files on S3 and a bunch of other services/platforms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.