Large file upload to Video Indexer through python - python

I'm trying to upload a large video (around 1.5 GB) through Video Indexer API. My machine however takes up lot of RAM to do so. The deployment system has quite a small amount of RAM. I want to use the API so that the video is uploaded in multiple parts without using up too much memory (around 100MB would suffice).
I've tried to use ffmpeg to split the video in chunks and upload it piece by piece but Video Indexer recognizes them as different videos and gives separate insights for each. It would be better if the video is aggregated online.
How can I do chunked video upload to MS Video Indexer?

Let my guess. Previously, you followed the offical tutorial Tutorial: Use the Video Indexer API and the Upload Video API reference (the Python sample code at the end of API reference page as the figure below) to upload your large video.
It cost a lot of memory because the code below send the data block {body} read from memory, and its value comes from the code open("<your local file name>").read().
conn.request("POST", "/{location}/Accounts/{accountId}/Videos?name={name}&accessToken={accessToken}&%s" % params, "{body}", headers)
However, if you read the subsection videoUrl of the document Upload and index your videos and the following C# code carefully, even the explaination for videoUrl in API reference, you will see the video file passed as a multipart/form body content is not the only way.
videoUrl
A URL of the video/audio file to be indexed. The URL must point at a media file (HTML pages are not supported). The file can be protected by an access token provided as part of the URI and the endpoint serving the file must be secured with TLS 1.2 or higher. The URL needs to be encoded.
If the videoUrl is not specified, the Video Indexer expects you to pass the file as a multipart/form body content.
The screenshot for the C# code with videoUrl
The screenshot for the videoUrl parameter in API reference
You can first upload a large video file to Azure Blob Storage or other online services satisfied the videoUrl requirement via Python streaming upload code or other tools like azcopy or Azure Storage Explorer, then using Azure Blob Storage as example to generate a blob url with sas token (Python code as below) to pass it as videoUrl to API request for uploading.
Python code for generating a blob url with sas token
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlockBlobService, BlobPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'
blob_name = '<your blob name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_blob_shared_access_signature(container_name, blob_name, BlobPermissions.READ, datetime.utcnow() + timedelta(hours=1),)
blobUrlWithSas = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}"
Hope it helps.

Related

Download Blob From Blob Storage Using Python

I am trying to download an excel file on a blob. However, it keeps generating the error "The specified blob does not exist". This error happens at blob_client.download_blob() although I can get the blob_client. Any idea why or other ways I can connect using managed identity?
default_credential = DefaultAzureCredential()
blob_url = BlobServiceClient('url', credential = default_credential)
container_client = blob_url.get_container_client('xx-xx-data')
blob_client = container_client.get_blob_client('TEST.xlsx')
downloaded_blob = blob_client.download_blob()
df=pd.read_excel(downloaded_blob.content_as_bytes(), sheet_name='Test',skiprows=2)
Turns out that I have to also provide 'Reader' access on top of 'Storage Blob Data Contributor' to be able to identify the blob. There was no need for SAS URL.
The reason you're getting this error is because each request to Azure Blob Storage must be an authenticated request. Only exception to this is when you're reading (downloading) a blob from a public blob container. In all likelihood, the blob container holding this blob is having a Private ACL and since you're sending an unauthenticated request, you're getting this error.
I would recommend using a Shared Access Signature (SAS) URL for the blob with Read permission instead of simple blob URL. Since a SAS URL has authorization information embedded in the URL itself (sig portion), you should be able to download the blob provided SAS is valid and has not expired.
Please see this for more information on Shared Access Signature: https://learn.microsoft.com/en-us/rest/api/storageservices/delegate-access-with-shared-access-signature.

Is there a temporary directory or direct way to upload a file in azure storage?

so I try to make a python API so the user can upload a pdf file then the API directly sends it to Azure storage. what I found is I must have a directory i.e.
container_client = ContainerClient.from_connection_string(conn_str=conn_str,container_name='mycontainer')
with open('mylocalpath/myfile.pdf',"rb") as data:
container_client.upload_blob(name='myblockblob.pdf', data=data)
another solution is I have to store it on VM and then replace the local path to it, but I don't want to make my VM full.
If you want to upload it directly from the client-side to azure storage blob instead of receiving that file to your API you can use azure shared access signature inside your storage account and from your API you can make a function to generate Pre-Signed URL using that shared access signature service and return that URL to your client it will allow the client to upload file to your blob via that URL.
To generate URL can you follow the below code:
from datetime import datetime, timedelta
from azure.storage.blob import generate_blob_sas, BlobSasPermissions
blobname= "<blobname>"
accountkey="<accountkey>" #get this from access key section in azure storage.
containername = "<containername>"
def getpushurl(filename):
token = generate_blob_sas(
account_name=blobname,
container_name=containername,
account_key=accountkey,
permission=BlobSasPermissions(write=True),
expiry=datetime.utcnow() + timedelta(seconds=100),
blob_name=filename,
)
url = f"https://{blobname}.blob.core.windows.net/{containername}/{filename}?{token}"
return url
pdfpushurl = getpushurl("demo.text")
print(pdfpushurl)
So after generating this URL give it to the client so client could directly send the file to the URL received with PUT request and it will get uploaded directly to azure storage.
You can generate a SAS token with write permission for your users so that your users could upload .pdf files directly on their side without storing them on the server. For details, pls see my previous post here.
Try the code below to generate a SAS token with container write permission:
from azure.storage.blob import BlobServiceClient,ContainerSasPermissions,generate_container_sas
from datetime import datetime, timedelta
storage_connection_string=''
container_name = ''
block_blob_service = BlobServiceClient.from_connection_string(storage_connection_string)
container_client = block_blob_service.get_container_client(container_name)
sasToken = generate_container_sas(account_name=container_client.account_name,
container_name=container_client.container_name,
account_key= container_client.credential.account_key,
#grant write permission only
permission=ContainerSasPermissions(write=True),
start=datetime.utcnow() - timedelta(minutes=1),
#1 hour vaild time
expiry=datetime.utcnow() + timedelta(hours=1)
)
print(sasToken)
After you have replied to this SAS token to your user, just see this official guide to upload files from a HTML page, I think it would be helpful if you are developing a web app.

Get content_type from Google Cloud file

I have two api endpoints, one that takes a file from an http request and uploads it to a google cloud bucket using the python api, and another that downloads it again. in the first view, i get the file content type from the http request and upload it to the bucket,setting that metadata:
from google.cloud import storage
file_obj = request.FILES['file']
client = storage.Client.from_service_account_json(path.join(
path.realpath(path.dirname(__file__)),
'..',
'settings',
'api-key.json'
))
bucket = client.get_bucket('storage-bucket')
blob = bucket.blob(filename)
blob.upload_from_string(
file_text,
content_type=file_obj.content_type
)
Then in another view, I download the file:
...
bucket = client.get_bucket('storage-bucket')
blob = bucket.blob(filename)
blob.download_to_filename(path)
How can I access the file metadata I set earlier (content_type) ? It's not available on the blob object anymore since a new one was instantiated, but it still holds the file.
You should try
blob = bucket.get_blob(blob_name)
blob.content_type

How to access user google drive from backend?

I am connecting user on frontend which is in ReactJS and backend is in python.
Now when I connect the user, I get following data:
{
"aud": "some token",
"scope": "https://www.googleapis.com/auth/drive",
"exp": "1544798733",
"expires_in": "3598",
"access_type": "online"
}
Now, when I am connecting through python,to upload a file to google drive, I need many more fields as the user credentials to successfully upload the file. How can I connect/upload file to drive? is there any other solution?
I am referring to this doc for drive access using python.
In the doc provided by google drive, there are several steps that is related to the user credential in step 1. Did you choose them as what it provided in the doc?
Here is the documentation that will guide you to upload a file in the drive.
You can send upload requests in any of the following ways:
Simple upload: uploadType=media. For quick transfer of a small file (5 MB or less). To perform a simple upload, refer to
Performing a Simple Upload.
Multipart upload: uploadType=multipart. For quick transfer of a small file (5 MB or less) and metadata describing the file, all in
a single request. To perform a multipart upload, refer to
Performing a Multipart Upload.
Resumable upload: uploadType=resumable. For more reliable transfer, especially important with large files. Resumable uploads
are a good choice for most applications, since they also work for
small files at the cost of one additional HTTP request per upload.
To perform a resumable upload, refer to Performing a Resumable
Upload.
If you are using python, here is a sample code of basic upload.
file_metadata = {'name': 'photo.jpg'}
media = MediaFileUpload('files/photo.jpg',
mimetype='image/jpeg')
file = drive_service.files().create(body=file_metadata,
media_body=media,
fields='id').execute()
print 'File ID: %s' % file.get('id')
For the rest of details, you can refer to the docs.

How do you get Google App Engine to gunzip during download?

I am trying to get Google App Engine to gunzip my .gz blob file (single file compressed) automatically by setting the response headers as follows:
class download(blobstore_handlers.BlobstoreDownloadHandler):
def get(self, resource):
resource = str(urllib.unquote(resource))
blob_info = blobstore.BlobInfo.get(resource)
self.response.headers['Content-Encoding'] = str('gzip')
# self.response.headers['Content-type'] = str('application/x-gzip')
self.response.headers['Content-type'] = str(blob_info.content_type)
self.response.headers['Content-Length'] = str(blob_info.size)
cd = 'attachment; filename=%s' % (blob_info.filename)
self.response.headers['Content-Disposition'] = str(cd)
self.response.headers['Cache-Control'] = str('must-revalidate, post-check=0, pre-check=0')
self.response.headers['Pragma'] = str(' public')
self.send_blob(blob_info)
When this runs, the file is downloaded without the .gz extension. However, the downloaded file is still gzipped. The file size of the downloaded data match the .gz file size on the server. Also, I can confirm this by manually gunzipping the downloaded file. I am trying to avoid the manual gunzip step.
I am trying to get the blob file to automatically gunzip during the download. What am I doing wrong?
By the way, the gzip file contains only a single file. On my self-hosted (non Google) server, I could accomplish the automatic gunzip by setting same response headers; albeit, my code there is written in PHP.
UPDATE:
I rewrote the handler to serve data from the bucket. However, this generates HTML 500 error. The file is partially downloaded before the failure. The rewrite is as follows:
class download(blobstore_handlers.BlobstoreDownloadHandler):
def get(self, resource):
resource = str(urllib.unquote(resource))
blob_info = blobstore.BlobInfo.get(resource)
file = '/gs/mydatabucket/%s' % blob_info.filename
print file
self.response.headers['Content-Encoding'] = str('gzip')
self.response.headers['Content-Type'] = str('application/x-gzip')
# self.response.headers['Content-Length'] = str(blob_info.size)
cd = 'filename=%s' % (file)
self.response.headers['Content-Disposition'] = str(cd)
self.response.headers['Cache-Control'] = str('must-revalidate, post-check=0, pre-check=0')
self.response.headers['Pragma'] = str(' public')
self.send_blob(file)
This downloads 540,672 bytes of the 6,094,848 bytes file to the client before the server terminate and issued a 500 error. When I issue 'file' on the partially downloaded file from the command line, Mac OS seems to correctly identify the file format as 'SQLite 3.x database' file. Any idea of why the 500 error on the server? How can I fix the problem?
You should first check to see if your requesting client supports gzipped content. If it does support gzip content encoding, then you may pass the gzipped blob as is with the proper content-encoding and content-type headers, otherwise you need to decompress the blob for the client. You should also verify that your blob's content_type isn't gzip (this depends on how you created your blob to begin with!)
You may also want to look at Google Cloud Storage as this automatically handles gzip transportation so long as you properly compress the data before storing it with the proper content-encoding and content-type metadata.
See this SO question: Google cloud storage console Content-Encoding to gzip
Or the GCS Docs: https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata#content-encoding
You may use GCS as easily (if not more easily) as you use the blobstore in AppEngine and it seems to be the preferred storage layer to use going forward. I say this because the File API has been deprecated which made blobstore interaction easier and great efforts and advancements have been made to the GCS libraries making the API similar to the base python file interaction API
UPDATE:
Since the objects are stored in GCS, you can use 302 redirects to point users to files rather than relying on the Blobstore API. This eliminates any unknown behavior of the Blobstore API and GAE delivering your stored objects with the content-type and content-encoding you intended to use. For objects with a public-read ACL, you may simply direct them to either storage.googleapis.com/<bucket>/<object> or <bucket>.storage.googleapis.com/<object>. Alternatively, if you'd like to have application logic dictate access, you should keep the ACL to the objects private and can use GCS Signed URLs to create short lived URLs to use when doing a 302 redirect.
Its worth noting that if you want users to be able to upload objects via GAE, you'd still use the Blobstore API to handle storing the file in GCS, but you'd have to modify the object after it was uploaded to ensure proper gzip compressing and content-encoding meta data is used.
class legacy_download(blobstore_handlers.BlobstoreDownloadHandler):
def get(self, resource):
filename = str(urllib.unquote(resource))
url = 'https://storage.googleapis.com/mybucket/' + filename
self.redirect(url)
GAE already serves everything using gzip if the client supports it.
So I think what's happening after your update is that the browser expects there to be more of the file, but GAE thinks it's already at the end of the file since it's already gzipped. That's why you get the 500.
(if that makes sense)
Anyway, since GAE already handles compression for you, the easiest way is probably to put non compressed files in GCS and let the Google infrastructure handle the compression automatically for you when you serve them.

Categories

Resources