I have to process very large images (size > 2 GB) stored in aws s3.
Before processing I actually want to display some of them.
Download time is infeasible, is it possible to display them without downloading using only Python?
You could give a URL to the user to open in a web browser. This does involve downloading the image, but it would be done outside of Python.
If you want to present them with a "thumbnail", then you would need a method of converting the image. This could be done with an AWS Lambda function that:
Loads the image into memory (it's too big for the default disk space)
Resizes the image to a smaller size
Stores it in Amazon S3
Provides a URL to the smaller image
This is similar to Tutorial: Using AWS Lambda with Amazon S3 but it would need a tweak to manipulate the image in memory instead of downloading the image to the Lambda function's disk storage (that is limited to 512MB).
Related
I have an API that saves an the image to S3 bucket and returns the S3 URL but the saving part of the PIL image is slow. Here is a snippet of code:
from PIL import Image
import io
import boto3
BUCKET = ''
s3 = boto3.resource('s3')
def convert_fn(args):
pil_image = Image.open(args['path']).convert('RGBA')
.
.
.
in_mem_file = io.BytesIO()
pil_image.save(in_mem_file, format='PNG') #<--- This takes too long
in_mem_file.seek(0)
s3.meta.client.upload_fileobj(
in_mem_file,
BUCKET,
'outputs/{}.png'.format(args['save_name']),
ExtraArgs={
'ACL': 'public-read',
'ContentType':'image/png'
}
)
return json.dumps({"Image saved in": "https://{}.s3.amazonaws.com/outputs/{}.png".format(BUCKET, args['save_name'])})
How can I speed up the upload?, Would it be easier to return the bytes?
The Image.save method is the most time consuming part of my script. I want to increase the performance of my app and I'm thinking that returning as a stream of bytes may be the fastest way to return the image.
Compressing image data to PNG takes time - CPU time. There might be a better performant lib to that than PIL, but you'd have to interface it with Python, and it still would take sometime.
"Returning bytes" make no sense - you either want to have image files saved on S3 or don't. And the "bytes" will only represent an image as long as they are properly encoded into an image file, unless you have code to compose back an image from raw bytes.
For speeding this up, you could either create an AWS lambda project that will take the unencoded array, generate the png file and save it to S3 in an async mode, or, easier, you might try saving the image in an uncompressed format, that will spare you from the CPU time to compress PNG: try saving it as a .tga or .bmp file instead of a .png, but expect final files to be 10 to 30 times larger than the equivalent .PNGs.
Also, it is not clear from the code if this is in a web-api view, and you'd like to speedup the API return, and it would be ok if the image would be generated and uploaded in background after the API returns.
In that case, there are ways to improve the responsivity of your app, but we need to have the "web code": i.e. which framework you are using, the view function itself, and the calling to the function presented here.
In PIL.Image.save when saving PNG there is an argument called compression_level with a compression_level=0 we can create faster savings at the cost of no compression. Docs
I have mounted a Blob Storage Account in to Databricks, and can access it fine, so i know that it works.
What i want to do though, is list out the names all of the files at a given path.. currently i'm doing this with:
list = dbutils.fs.ls('dbfs:/mnt/myName/Path/To/Files/2019/03/01')
df = spark.createDataFrame(list).select('name')
The issue i have though, is that it's exceptionally slow.. due to there being around 160,000 blobs at that location (storage explorer shows this as ~1016106592 bytes which is 1Gb!)
This surely can't be pulling down all this data, all i need/want is the filename..
Is blob storage my bottle neck, or can i (somehow) get Databricks to execute the command in parallel or something?
Thanks.
Per my experience and based on my understanding for Azure Blob Storage, all operations in SDK or others on Azure Blob Storage will be translated to REST API calling. So your dbutils.fs.ls calling is actually calling the related REST API List Blobs on a
Blob container.
Therefore, I'm sure the performance neck of your code is really affected by transfering the data of amount size of the XML response body of blobs list on Blob Storage to extract blob names to the list variable , even there is around 160,000 blobs.
Meanwhile, all blob names will be wrapped in many slices of XML response, and there is a MaxResults limit per slice, and to get next slice is depended on the NextMarker value of previous slice. The above reason is why to list blobs slow, and it can not be parallelism.
My suggestion for enhancing the efficiency of loading blob list is to cache the result of list blobs in advance, such as to generate a blob to write the blob list line by line. Considering for realtime update, you can try to use Azure Function with Blob Trigger to add the blob name record to an Append Blob when an event of blob creation happened.
I have a store of images in Google Cloud Storage and I am looking to read them into OpenCV in Datalab. I can find information on how to read text files but can't find anthing on how I can read in an image. How would I go about doing this?
I am not really familiar with OpenCV, so let me cover the Datalab ⟷ GCS part and I hope that is enough for you to go on with the OpenCV part.
In Datalab, you can use two different approaches to access Google Cloud Storage resources. They are both documented (with working examples) in these Jupyter notebooks: access GCS using Storage commands ( %%gcs ) or access GCS using Storage APIs ( google.datalab.storage ).
I'll provide an example using Storage commands, but feel free to adapt it to the Datalab GCS Python library if you prefer.
# Imports
from google.datalab import Context
from IPython.display import Image
# Define the bucket and and an example image to read
bucket_path = "gs://BUCKET_NAME"
bucket_object = bucket_path + "/google.png"
# List all the objects in your bucket, and read the example image file
%gcs list --objects $bucket_path
%gcs read --object $bucket_object -v img
# Print the image content (see it is in PNG format) and show it
print(type(img))
img
Image(img)
Using the piece of code I shared, you are able to perform a simple object-listing for all the objects in your bucket and also read an example PNG image. Having its content stored in a Python variable, I hope you are able to consume it in OpenCV.
I am running a website on google app engine written in python with jinja2. I have gotten memcached to work for most of my content from the database and I am fuzzy on how I can increase the efficiency of images served from the blobstore. I don't think it will be much different on GAE than any other framework but I wanted to mention it just in case.
Anyway are there any recommended methods for caching images or preventing them from eating up my read and write quotas?
Blobstore is fine.
Just make sure you set the HTTP cache headers in your url handler. This allows your files to be either cached by the browser (in which case you pay nothing) or App Engine's Edge Cache, where you'll pay for bandwidth but not blobstore accesses.
Be very careful with edge caching though. If you set an overly long expiry, users will never see an updated version. Often the solution to this is to change the url when you change the version.
You can use google images api
https://developers.google.com/appengine/docs/python/images/functions
What I usually do is on upload, i store the url created by the images.get_serving_url(blob_key). Not sure if its cheaper but on my dev server each call to get_serving_url creates a datastore write.
My advice would be to use Google Cloud Storage for storing your images. It's better suited and recommended for serving static files. The good thing is that now you can use the same Images api for that:
Note: You can also serve images stored in Google Cloud Storage. To do this, you need to generate a Blob Key using the Blobstore API create_gs_key() function. You also need to set a default object ACL on the bucket being used that gives your app FULL_CONTROL permissions, so that the image service can add its own ACL to the objects. For information on ACLs and permissions, see the documentation for Google Cloud Storage.
PS. Another great feature I like here, is that you don't have to store different resolutions of your image if you need to serve them in different sizes. You can just add the parameters to the url which is returned by get_serving_url and that will do it. Also you only need to call get_serving_url once, store this url somewhere and use it whenever you need to serve the image. Plus you can reuse the same url for serving the same image in all different sizes.
URL Modifications:
=sXX To resize an image, append =sXX to the end of the image URL, where XX is an integer from 0–1600 representing the new image size in
pixels. The maximum size is defined in IMG_SERVING_SIZES_LIMIT. The
API resizes the image to the supplied value, applying the specified
size to the image's longest dimension and preserving the original
aspect ratio. For example, if you use =s32 to resize a 1200x1600
image, the resulting image is a 24x32. If that image were 1600x1200,
the resized image would be 32x24 pixels.
=sXX-c To crop and resize an image, append =sXX-c to the end of the image URL, where XX is an integer from 0–1600 representing the new
image size in pixels. The maximum size is defined in
IMG_SERVING_SIZES_LIMIT. The API resizes the image to the supplied
value, applying the specified size to the image's longest dimension
and preserving the original aspect ratio. If the image is portrait,
the API slices evenly from the top and bottom to make a square. If the
image is landscape, the API slices evenly from the left and right to
make a square. After cropping, the API resizes the image to the
specified size.
In Google App Engine, I need to be able to take an uploaded PDF and convert it to an image (or maybe one day a number of tiled images) for storing and serving back out. Is there a library that will read PDF files that is also 100% python (so it can be uploaded with my app)?
From what I've gathered so far...
PIL does not read PDF files, only writes them.
GhostScript is the standard FOSS PDF reader, but I don't believe I'll be able to upload it with my app to GAE since I don't believe it's 100% python.
Is there anything else I might be able to use? Or maybe even a web service that I can call?)
You may want to look into using the GAE Conversion API (not yet fully released). There's a tester signup form here, with a link to further details.
From the doc:
Conversions can be performed in any direction between PDF, HTML, TXT, and image formats, and OCR will be employed if necessary. Note that while PNG, GIF, JPEG, and BMP image formats are supported as input formats, only PNG is available for output.