I have system that creates uploadable link to Google Cloud Storage Bucket uploads. After that user is uploading it directly there from Frontend.
Is there a way to verify this image file there without downloading it to a Backend app and verify there (e.g. using PIL for python)?
Verification for:
is it an image at all;
is it fully uploaded;
is it not broken;
etc.
P.S. is there anything similar for PDF?
Cloud Storage doesn't directly offer any direct support for any particular formats, be it JPEG or PDF or anything else. To fully validate what's in a file, you need to download it and check.
You can, however, get part of the way there.
First, you can have your client validate the file, then capture the size and/or a checksum (either MD5 or CRC32c) of the original file, and you can specify them as part of the upload to ensure that they are uploaded exactly as intended. If your server can know the intended file size or checksum, you can ask Cloud Storage for just the metadata of an object without downloading it to verify that it is as intended.
Second, many files, including JPEG, have particular headers or footers that describe their contents. Instead of downloading what is potentially a very large image, you could download only the first few bytes from Cloud Storage. If the first two bytes aren't 0xFF and 0xD8, then it's not a JPEG file. Similar magic numbers exist for many other formats.
Related
Overview
I have a GCP storage bucket, which has a .json file and 5 jpeg files. In the .json file the image names match the jpeg file names. I want to know a way which i can access each of the object within the storage account based upon the image name.
Method 1 (Current Method):
Currently, a python script is been used to to get the images from the storage bucket. This is been done by looping through the .json file of image names, getting each individual image name, then building a URL based on the bucket/image name and retrieving the image and displaying it on a flask App Engine site.
This current method requires the bucket objects to be public, which poses a security issue with the internet granted access to this bucket, secondly it is computational expensive, with each image having to be pulled down from the bucket separately. The bucket will eventually contain 10000 images, which will result in the images been slow to load and display on the web page.
Requirement (New Method):
Is there a method in which i can pull down images from the bucket, not all the images at once, and display them on a web page. I want to be able to access individual images from the bucket and display their corresponding image data, retrieved from the .json file.
Lastly i want to ensure that neither the bucket or the objects are public and can only be accessed via the app engine.
Thanks
Would be helpful to see the Python code that's doing the work right now. You shouldn't need the storage objects to be public. They should be able to be retrieved using the Google Cloud Storage (GCS) API and a service account token that has view-only permissions on storage (although depending on whether or not you know the object names and need to get the bucket name, it might require more permissions on the service account).
As for the performance, you could either do things on the front end to be smart about how many you're showing and fetch only what you want to display as the user scrolls, or you could paginate your results from the GCS bucket.
Links to the service account and API pieces here:
https://cloud.google.com/iam/docs/service-accounts
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Information about pagination for retrieving GCS objects here:
How does paging work in the list_blobs function in Google Cloud Storage Python Client Library
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?
I want to rotate and save an image which was already stored in blobstore. For this I tried using images.Image.rotate method
img = images.Image(blob_key=image.blob)
img.rotate(180)
final_image = img.execute_transforms(output_encoding=images.PNG)
I don't know how to save the rotated image again to the blobstore.
A transformed image is just a collection of bytes that you can write back to the Cloud Storage - as a new object or overwrite an existing one (e.g. cloudstorage.open with mode set to "w" using Python GS Client).
Writing to the blobstore used to be possible using the Files API, which is now deprecated.
You can use GCS instead for writing the image (GCS is recommended over Blobstore anyways).
You can still keep the blobstore API with GCS if you want. I think it should be possible to even mix blobstore and GCS transparently for your users so that you don't have to migrate all your existing images from the blobstore to GCS.
I'm working on aws S3 multipart upload, And I am facing following issue.
Basically I am uploading a file chunk by chunk to s3, And during the time if any write happens to the file locally, I would like to reflect that change to the s3 object which is in current upload process.
Here is the procedure that I am following,
Initiate multipart upload operation.
upload the parts one by one [5 mb chunk size.] [do not complete that operation yet.]
During the time if a write goes to that file, [assuming i have the details for the write [offset, no_bytes_written] ].
I will calculate the part no for that write happen locally, And read that chunk from the s3 uploaded object.
Read the same chunk from the local file and write to read part from s3.
Upload the same part to s3 object.
This will be an a-sync operation. I will complete the multipart operation at the end.
I am facing an issue in reading the uploaded part that is in multipart uploading process. Is there any API available for the same?
Any help would be greatly appreciated.
There is no API in S3 to retrieve a part of a multi-part upload. You can list the parts but I don't believe there is any way to retrieve an individual part once it has been uploaded.
You can re-upload a part. S3 will just throw away the previous part and use the new one in it's place. So, if you had the old and new versions of the file locally and were keeping track of the parts yourself, I suppose you could, in theory, replace individual parts that had been modified after the multipart upload was initiated. However, it seems to me that this would be a very complicated and error-prone process. What if the change made to a file was to add several MB's of data to it? Wouldn't that change your boundaries? Would that potentially affect other parts, as well?
I'm not saying it can't be done but I am saying it seems complicated and would require you to do a lot of bookkeeping on the client side.
I am writing zip files into google cloud storage using GCS client library. Then I retrieve the blob key using create_gs_key() function. Immediately after creating the file I try to download it using a second http request. I write the blob key obtained in the previous call to X-AppEngine-BlobKey header response.
When the file is relatively big, usually about 30 MB or more, the first try sometimes results in an incomplete file, a few MB smaller than the target size. If you wait a little the next try is usually fine.
I had the same problem when I tried to write files into blobstore using the API that is now deprecated.
Is it guaranteed that when you close the file it should already be available for serving?