I am having uploads on a site, and using flask as the back-end. The files are all sent in one POST request from the client to the server, and I'm handling them individually by using the getlist() method of request, and iterating through with a for loop:
if request.method == 'POST':
files = request.files.getlist('f[]')
Problem is I want to limit the size of EACH file uploaded to 50mb, but I'm assuming MAX_CONTENT_LENGTH limits the size of the entire request. Is there a way I can evaluate the size of each individual file in the request object and reject that file if it is too large? The user can upload a set number of files, but each one of them needs to be under 50mb.
There are two pieces of information you can use here:
Sometimes the Content-Length header is set; parsing ensures that this is accurate for the actual data uploaded. If so, you can get this value from the FileStorage.content_length attribute.
The files uploaded are file objects (either temporary files on disk or in-memory file-like objects); just use file.seek() and file.tell() on these to determine their size without having to read the whole object. It may be that an in-memory file object doesn't support seeking, at which point you should be able to read the whole file into memory as it'll be small enough not to need a temporary on-disk file.
Combined, the best way to test for individual file sizes then is:
def get_size(fobj):
if fobj.content_length:
return fobj.content_length
try:
pos = fobj.tell()
fobj.seek(0, 2) #seek to end
size = fobj.tell()
fobj.seek(pos) # back to original position
return size
except (AttributeError, IOError):
pass
# in-memory file object that doesn't support seeking or tell
return 0 #assume small enough
then in your loop:
for fobj in request.files.getlist('f[]'):
if get_size(fobj) > 50 * (1024 ** 2):
abort(413) # request entity too large
This avoids having to read data into memory altogether.
Set MAX_CONTENT_LENGTH to something reasonable for the total size of all files and then just check the file size for each file before processing.
if request.method == 'POST':
files = request.files.getlist('f[]')
for f in files:
if len(f.read()) < (50 * 1024 * 1024):
# do something
Related
The official snippet code for downloading a blob from Microsoft Docs is:
# Download the blob to a local file
# Add 'DOWNLOAD' before the .txt extension so you can see both files in the data directory
download_file_path = os.path.join(local_path, str.replace(local_file_name ,'.txt', 'DOWNLOAD.txt'))
blob_client = blob_service_client.get_container_client(container= container_name)
print("\nDownloading blob to \n\t" + download_file_path)
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob(blob.name).readall())
The problem is that readall reads the blob content to the memory. Giant blobs (hundreds of Gigabytes) cannot be held in memory.
I didn't find a way to download a blob directly to the file (can use a buffer internally, but not hold all the file's content). Is there any way to do so?
In case of large blobs, you would want to use download_blob method in BlobClient. This method allows you to read a range of data as a stream.
The way this would work is you call this method multiple times and each time when you call it, you would set the offset and length parameter to a different value.
For example, let's say your blob size is 10MB and you want to download it in chunks of 1MB, when you call this method first time, you will set the offset as 0 and length as 1048576 (1 MB) and you will get a stream of 1MB data. Next time you call this method, you set the offset as 1048576 and so on.
This way you will be able to progressively download a blob.
I'm writing an AWS Lambda in Python 3.6
I have a large amount of large space separated text files and I need to loop through these files and pull out the first N (in this case 1000) lines of text. Once I have those lines I need to put them in to a new file and upload that to S3.
I'm also not a python developer, so the language and environment is new to me.
Right now I'm collecting the S3 object summaries, and for each of those, I'm running a check on them and then getting the object's data, opening that as a file-like object and also opening the output variable as a file-like object, and then doing my processing.
I've given my Lambda 3GB RAM but the lambda is running out of memory before it can process any files (Each file is about 800MB and there are about 210 of them).
for item in object_summary:
# Check if the object exists, and skip it if so
try:
head_object_response = s3Client.head_object(Bucket=target_bucket_name, Key=item)
logger.info('%s: Key alredy exists.' % item)
except:
# if the key does not exist, we need to swallow the 404 that comes from boto3
pass
# and then do our logic to headify the files
logger.info('Key does not exist in target, headifying: %s' % item)
# If the file doesn't exist, get the full object
s3_object = s3Client.get_object(Bucket=inputBucketName, Key=item)
long_file = s3_object['Body']._raw_stream.data
file_name = item
logger.info('%s: Processing 1000 lines of input.' % file_name)
'''
Looks like the Lambda hits a memory limit on the line below.
It crashes with 2500MB of memory used, the file it's trying
to open at that stage is 800MB large which puts it over the
max allocation of 3GB
'''
try:
with open(long_file, 'r') as input_file, open(file_name, 'w') as output_file:
for i in range(1000):
output_file.write(input_file.readline())
except OSError as exception:
if exception.errno ==36:
logger.error('File name: %s' %exception.filename)
logger.error(exception.__traceback__)
I put the whole function for completeness above, but I think that the specific area I can improve it is the try: while: block that handles the file processing.
Have I got that right? Is there anywhere else I can improve it?
Think more simply.
I suggest just handling a single file per lambda call, then you should be within your 3GB easily. In anycase, with an increase in the number of files to process eventually your lambda function will hit the max 15 minute execution limit, so it's better to think of lambda processing in roughly consistently sized chunks.
If necessary you can introduce an intermediate chunker lambda function to chunk out the processing.
If your files are really only 800MB I would think that your processing should be ok in terms of memory. The input file may still be streaming in, you may want to try deleting it (del s3_object['Body']?)
from io import StringIO
def handle_file(key_name):
# Check if the object exists, and skip it
try:
head_object_response = s3Client.head_object(
Bucket=target_bucket_name,
Key=item
)
logger.info(f'{item} - Key already exists.')
return None, 0
except ClientError as e:
logger.exception(e)
logger.info(f'{item} - Does Not exist.')
# If the file doesn't exist, get the full object
s3_object = s3Client.get_object(Bucket=inputBucketName, Key=item)
long_file = StringIO(s3_object['Body'])
max_lines = 1000
lines = []
for n, line in enumerate(long_file):
lines.append(line)
if len(lines) == max_lines:
break
output = StringIO()
output.writelines(lines)
output.seek(0)
response = s3Client.put_object(Body=output, Bucket=outputBucketName, Key=item)
return item, len(lines)
As a side note I really recommend zappa if your using lambda, it makes lambda development fun. (and it would make chunking out code sections easy in the same code using Asynchronous Task Execution)
Try to check yor logs or traceback for the exact line of the the error - the point you point to in the code really will read one line at a time (with the OS behind the scenes casing stuff, but that would be a couple hundred KB at most).
It is more likely that methods such as s3Client.get_object(Bucket=inputBucketName, Key=item) or attribute accesses like long_file = s3_object['Body']._raw_stream.data are eagerly bringing the file actual contents in to memory,.
You have to check the docs for those, and how to stream data from the S3 and dump it to disk, instead of having it all in memory. The fact that the attribute is named ._raw_stream, beggining with an _ indicate it is a private attribute, and it is not advised to make use of it directly.
Also, you are using pass which does nothing, the remaining of the loop will run the sameway - you might want to use continue there. And an empty except clause , without logging the error, is among the worst mistakes possible in Python code - if there is an error there, you have to log it, not just "pretend it did not happen". (It is even ilegal syntax in Python 3)
I am working on a personal project which involves reading in large files of JSON objects, which consist of potentially millions of entries, which are compressed using GZip. The problem that I am having is in determining how to efficiently parse these objects line-by-line and store them in memory such that they do not use up all of the RAM on my system. It must be able to access or construct these objects at a later time for analysis. What I have attempted thus far is as follows
def parse_data(file):
accounts = []
with gzip.open(file, mode='rb') as accounts_data:
for line in accounts_data:
# if line is not empty
if len(line,strip()) != 0:
account = BytesIO(line)
accounts.append(account)
return accounts
def getaccounts(accounts, idx):
account = json.load(accounts[idx])
# creates account object using fields in account dict
return account_from_dict(account)
A major problem with this implementation is that I am unable to access the same object in accounts twice without it resulting in a JSONDecodeError's being generated. I also am not sure whether or not this is the most compact way I could be doing this.
Any assistance would be much appreciated.
Edit: The format of the data stored in these files are as follows:
{JSON Object 1}
{JSON Object 2}
...
{JSON Object n}
Edit: It is my intention to use the information stored in these JSON account entries to form a graph of similarities or patterns in account information.
Here's how to randomly access JSON objects in the gzipped file by first uncompressing it into a temporary file and then using tell() and seek() to retrieve them by index — thus requiring only enough memory to hold the offsets of each one.
I'm posting this primarily because you asked me for an example of doing it in the comments...which I wouldn't have otherwise, because it not quite the same thing as streaming data. The major difference is that, unlike doing that, it gives access to all the data including being able to randomly access any of the objects at will.
Uncompressing the entire file first does introduce some additional overhead, so unless you need to be able to access the JSON object more than once, probably wouldn't be worth it. The implementation shown could probably be sped-up by caching previous loaded objects, but without knowing precisely what the access patterns will be, it hard to say for sure.
import collections.abc
import gzip
import json
import random
import tempfile
class GZ_JSON_Array(collections.abc.Sequence):
""" Allows objects in gzipped file of JSON objects, one-per-line, to be
treated as an immutable sequence of JSON objects.
"""
def __init__(self, gzip_filename):
self.tmpfile = tempfile.TemporaryFile('w+b')
# Decompress a gzip file into a temp file and save offsets of the
# start of each line in it.
self.offsets = []
with gzip.open(gzip_filename, mode='rb') as gzip_file:
for line in gzip_file:
line = line.rstrip().decode('utf-8')
if line:
self.offsets.append(self.tmpfile.tell())
self.tmpfile.write(bytes(line + '\n', encoding='utf-8'))
def __len__(self):
return len(self.offsets)
def __iter__(self):
for index in range(len(self)):
yield self[index]
def __getitem__(self, index):
""" Return a JSON object at offsets[index] in the given open file. """
if index not in range(len(self.offsets)):
raise IndexError
self.tmpfile.seek(self.offsets[index])
try:
size = self.offsets[index+1] - self.offsets[index] # Difference with next.
except IndexError:
size = -1 # Last one - read all remaining data.
return json.loads(self.tmpfile.read(size).decode())
def __del__(self):
try:
self.tmpfile.close() # Allow it to auto-delete.
except Exception:
pass
if __name__ == '__main__':
gzip_filename = 'json_objects.dat.gz'
json_array = GZ_JSON_Array(gzip_filename)
# Randomly access some objects in the JSON array.
for index in random.sample(range(len(json_array)), 3):
obj = json_array[index]
print('object[{}]: {!r}'.format(index, obj))
Hhi, perhaps use an incremental json reader such as ijson. That does not require loading the entire structure into memory at once.
Based on your answers in the comments, it seems like you just need to scan through the objects:
def evaluate_accounts(file):
results = {}
with gzip.open(file) as records:
for json_rec in records:
if json_rec.strip():
account = json.loads(json_rec)
results[account['id']] = evaluate_account(account)
return results
I have an app with manages a set of files, but those files are actually stored in Rackspace's CloudFiles, because most of the files will be ~100GB. I'm using the Cloudfile's TempURL feature to allow individual files, but sometimes, the user will want to download a set of files. But downloading all those files and generating a local Zip file is impossible since the server only have 40GB of disk space.
From the user view, I want to implement it the way GMail does when you get an email with several pictures: It gives you a link to download a Zip file with all the images in it, and the download is immediate.
How to accomplish this with Python/Django? I have found ZipStream and looks promising because of the iterator output, but it still only accepts filepaths as arguments, and the writestr method would need to fetch all the file data at once (~100GB).
Since Python 3.5 it is possible to create zip chunks stream of huge files/folders. You can use the unseekable stream. So no need to use ZipStream now.
See my answer here.
And live example here: https://repl.it/#IvanErgunov/zipfilegenerator
If you don't have filepath, but have chunks of bytes you can exclude open(path, 'rb') as entry from example and replace iter(lambda: entry.read(16384), b'') with your iterable of bytes. And prepare ZipInfo manually:
zinfo = ZipInfo(filename='any-name-of-your-non-existent-file', date_time=time.localtime(time.time())[:6])
zinfo.compress_type = zipfile.ZIP_STORED
# permissions:
if zinfo.filename[-1] == '/':
# directory
zinfo.external_attr = 0o40775 << 16 # drwxrwxr-x
zinfo.external_attr |= 0x10 # MS-DOS directory flag
else:
# file
zinfo.external_attr = 0o600 << 16 # ?rw-------
You should also remember that the zipfile module writes chunks of its zipfile own size. So, if you send a piece of 512 bytes the stream will receive a piece of data only when and only with size the zipfile module decides to do it. It depends on the compression algorithm, but I think it is not a problem, because the zipfile module makes small chunks <= 16384.
You can use https://pypi.python.org/pypi/tubing. Here's an example using s3, you could pretty easily create a rackspace clouldfile Source. Create a customer Writer (instead of sinks.Objects) to stream the data some where else and custom Transformers to transform the stream.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)
Check this out - it's part of the Python Standard Library:
http://docs.python.org/3/library/zipfile.html#zipfile-objects
You can give it an open file or file-like-object.
I want to get the size of uploading image to control if it is greater than max file upload limit. I tried this one:
#app.route("/new/photo",methods=["POST"])
def newPhoto():
form_photo = request.files['post-photo']
print form_photo.content_length
It printed 0. What am I doing wrong? Should I find the size of this image from the temp path of it? Is there anything like PHP's $_FILES['foo']['size'] in Python?
There are a few things to be aware of here - the content_length property will be the content length of the file upload as reported by the browser, but unfortunately many browsers dont send this, as noted in the docs and source.
As for your TypeError, the next thing to be aware of is that file uploads under 500KB are stored in memory as a StringIO object, rather than spooled to disk (see those docs again), so your stat call will fail.
MAX_CONTENT_LENGTH is the correct way to reject file uploads larger than you want, and if you need it, the only reliable way to determine the length of the data is to figure it out after you've handled the upload - either stat the file after you've .save()d it:
request.files['file'].save('/tmp/foo')
size = os.stat('/tmp/foo').st_size
Or if you're not using the disk (for example storing it in a database), count the bytes you've read:
blob = request.files['file'].read()
size = len(blob)
Though obviously be careful you're not reading too much data into memory if your MAX_CONTENT_LENGTH is very large
If you don't want save the file to disk first, use the following code, this work on in-memory stream
import os
file = request.files['file']
# os.SEEK_END == 2
# seek() return the new absolute position
file_length = file.seek(0, os.SEEK_END)
# also can use tell() to get current position
# file_length = file.tell()
# seek back to start position of stream,
# otherwise save() will write a 0 byte file
# os.SEEK_END == 0
file.seek(0, os.SEEK_SET)
otherwise, this will better
request.files['file'].save('/tmp/file')
file_length = os.stat('/tmp/file').st_size
The proper way to set a max file upload limit is via the MAX_CONTENT_LENGTH app configuration. For example, if you wanted to set an upload limit of 16 megabytes, you would do the following to your app configuration:
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
If the uploaded file is too large, Flask will automatically return status code 413 Request Entity Too Large - this should be handled on the client side.
The following section of the code should meet your purpose..
form_photo.seek(0,2)
size = form_photo.tell()
As someone else already suggested, you should use the
app.config['MAX_CONTENT_LENGTH']
to restrict file sizes. But
Since you specifically want to find out the image size, you can do:
import os
photo_size = os.stat(request.files['post-photo']).st_size
print photo_size
You can go by popen from os import
save it first
photo=request.files['post-photo']
photo.save('tmp')
now, just get the size
os.popen('ls -l tmp | cut -d " " -f5').read()
this in bytes
for Megabytes or Gigabytes, use the flag --b M or --b G