BlobInfo object from a BlobKey created using blobstore.create_gs_key - python

I am converting code away from the deprecated files api.
I have the following code that works fine in the SDK server but fails in production. Is what I am doing even correct? If yes what could be wrong, any ideas how to troubleshoot it?
# Code earlier writes the file bs_file_name. This works fine because I can see the file
# in the Cloud Console.
bk = blobstore.create_gs_key( "/gs" + bs_file_name)
assert(bk)
if not isinstance(bk,blobstore.BlobKey):
bk = blobstore.BlobKey(bk)
assert isinstance(bk,blobstore.BlobKey)
# next line fails here in production only
assert(blobstore.get(bk)) # <----------- blobstore.get(bk) returns None

Unfortunately, as per the documentation, you can't get a BlobInfo object for GCS files.
https://developers.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
Note: Once you obtain a blobKey for the GCS object, you can pass it around, serialize it, and otherwise use it interchangeably anywhere you can use a blobKey for objects stored in Blobstore. This allows for usage where an app stores some data in blobstore and some in GCS, but treats the data otherwise identically by the rest of the app. (However, BlobInfo objects are currently not available for GCS objects.)

I encountered this exact same issue today and it feels very much like a bug within the blobstore api when using google cloud storage.
Rather than leveraging the blobstore api I made use of the google cloud storage client library. The library can be downloaded here: https://developers.google.com/appengine/docs/python/googlecloudstorageclient/download
To access a file on GCS:
import cloudstorage as gcs
with gcs.open(GCSFileName) as f:
blob_content = f.read()
print blob_content

It sucks that GAE has different behaviours when using blobInfo in local mode or the production environment, it took me a while to find out that, but a easy solution is that:
You can use a blobReader to access the data when you have the blob_key.
def getBlob(blob_key):
logging.info('getting blob('+blob_key+')')
with blobstore.BlobReader(blob_key) as f:
data_list = []
chunk = f.read(1000)
while chunk != "":
data_list.append(chunk)
chunk = f.read(1000)
data = "".join(data_list)
return data`
https://developers.google.com/appengine/docs/python/blobstore/blobreaderclass

Related

How to convert FileStorage object to b2sdk.v2.AbstractUploadSource in Python

I am using Backblaze B2 and b2sdk.v2 in Flask to upload files.
This is code I tried, using the upload method:
# I am not showing authorization code...
def upload_file(file):
bucket = b2_api.get_bucket_by_name(bucket_name)
file = request.files['file']
bucket.upload(
upload_source=file,
file_name=file.filename,
)
This shows an error like this
AttributeError: 'SpooledTemporaryFile' object has no attribute 'get_content_length'
I think it's because I am using a FileStorage instance for the upload_source parameter.
I want to know whether I am using the API correctly or, if not, how should I use this?
Thanks
You're correct - you can't use a Flask FileStorage instance as a B2 SDK UploadSource. What you need to do is to use the upload_bytes method with the file's content:
def upload_file(file):
bucket = b2_api.get_bucket_by_name(bucket_name)
file = request.files['file']
bucket.upload_bytes(
data_bytes=file.read(),
file_name=file.filename,
...other parameters...
)
Note that this reads the entire file into memory. The upload_bytes method may need to restart the upload if something goes wrong (with the network, usually), so the file can't really be streamed straight through into B2.
If you anticipate that your files will not fit into memory, you should look at using create_file_stream to upload the file in chunks.

How to get file size of objects from google cloud python library?

Problem
Hello everyone. I am attempting to obtain the file size of an object using the google-cloud python library. This is my current code.
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket("example-bucket-name")
object = bucket.blob("example-object-name.jpg")
print(object.exists())
>>> True
print(object.chunk_size)
>>> None
It appears to me that the google-cloud library is choosing not to load data into the attributes such as chunk_size, content_type, etc.
Question
How can I make the library explicitly load actual data into the metadata attributes of the blob, instead of defaulting everything to None?
Call get_blob instead of blob.
Review the source code for the function blob_metadata at this link. It shows how to get a variety of metadata attributes of a blob, including its size.
If the above link dies, try looking around in this directory: Storage Client Samples
Call size on Blob.
from google.cloud import storage
# create client
client: storage.client.Client = storage.Client('projectname')
# get bucket
bucket: storage.bucket.Bucket = client.get_bucket('bucketname')
size_in_bytes = bucket.get_blob('filename').size

send data from blobstore as email attachment in GAE

Why isn't the code below working? The email is received, and the file comes through with the correct filename (it's a .png file). But when I try to open the file, it doesn't open correctly (Windows Gallery reports that it can't open this photo or video and that the file may be unsupported, damaged or corrupted).
When I download the file using a subclass of blobstore_handlers.BlobstoreDownloadHandler (basically the exact handler from the GAE docs), and the same blob key, everything works fine and Windows reads the image.
One more bit of info - the binary files from the download and the email appear very similar, but have a slightly different length.
Anyone got any ideas on how I can get email attachments sending from GAE blobstore? There are similar questions on S/O, suggesting other people have had this issue, but there don't appear to be any conclusions.
from google.appengine.api import mail
from google.appengine.ext import blobstore
def send_forum_post_notification():
blob_reader = blobstore.BlobReader('my_blobstore_key')
blob_info = blobstore.BlobInfo.get('my_blobstore_key')
value = blob_reader.read()
mail.send_mail(
sender='my.email#address.com',
to='my.email#address.com',
subject='this is the subject',
body='hi',
reply_to='my.email#address.com',
attachments=[(blob_info.filename, value)]
)
send_forum_post_notification()
I do not understand why you use a tuple for the attachment. I use :
message = mail.EmailMessage(sender = ......
message.attachments = [blob_info.filename,blob_reader.read()]
I found that this code doesn't work on dev_appserver but does work when pushed to production.
I ran into a similar problem using the blobstore on a Python Google App Engine application. My application handles PDF files instead of images, but I was also seeing a "the file may be unsupported, damaged or corrupted" error using code similar to your code shown above.
Try approaching the problem this way: Call open() on the BlobInfo object before reading the binary stream. Replace this line:
value = blob_reader.read()
... with these two lines:
bstream = blob_info.open()
value = bstream.read()
Then you can remove this line, too:
blob_reader = blobstore.BlobReader('my_blobstore_key')
... since bstream above will be of type BlobReader.
Relevant documentation from Google is located here:
https://cloud.google.com/appengine/docs/python/blobstore/blobinfoclass#BlobInfo_filename

How can I copy files bigger than 5 GB in Amazon S3?

Amazon S3 REST API documentation says there's a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there's no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto's source it doesn't seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?
As far as I know there's no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That's correct, it's pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object - Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that's indeed not possible for objects/files greater than 5 GB:
Note
[...] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information [...], go to Uploading Objects Using Multipart Upload [...] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn't documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven't tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]
The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()!
I've added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
https://gist.github.com/joshuadfranklin/5130355
I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here's the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()
The now standard .copy method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')

App Engine - Save response from an API in the data store as file (blob)

I'm banging my head against the wall with this one:
What I want to do is store a file that is returned from an API in the data store as a blob.
Here is the code that I use on my local machine (which of course works due to an existing file system):
client.convertHtml(html, open('html.pdf', 'wb'))
Since I cannot write to a file on App Engine I tried several ways to store the response, without success.
Any hints on how to do this? I was trying to do it with StringIO and managed to store the response but then weren't able to store it as a blob in the data store.
Thanks,
Chris
Found the error. Here is how it looks like right now (simplified).
output = StringIO.StringIO()
try:
client.convertURI("example.com", output)
Report.pdf = db.Blob(output.getvalue())
Report.put()
except pdfcrowd.Error, why:
logging.error('PDF creation failed %s' % why)
I was trying to save the output without calling "getvalue()", that was the problem. Perhaps this is of use to someone in the future :)

Categories

Resources