I have been messing with AWS MediaConvert for boto3 for the python library and I find the docs incredibly confusing.
There are so many settings.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html
and amazon does a absoultely terrible job of labeling what is necessary to do a basic job.
what would be the correct json for a simple job.
taking a video with audio file and turning it into a CMAF file
and taking a audio only file and turning it into A CMAF file.
I am trying to establish the baseline use of this technology. And there is so much extra that I don't know what I absolutely need and what is extra settings for specific use cases.
The solve for this is to use the MediaConvert UI in AWS then use the copy json button save it and then use the json created.
Never mind trying to create it yourself. Unless you like pain.
Related
I want to get a stream object from Azure Inheritance Iterator ItemPaged - ItemPaged[TableEntity] to stream (Python). Is it possible?
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
#Updated 11.08.2021
I have a realization to backup Azure Tables to Azure Blob - Current process to backup Azure Tables. But I want to improve this process and I am considering different options. I try to get the stream from Azure Tables to use create_blob_from_stream
I assume you want to stream bytes from the HTTP response, and not the use the iterator of objects you receive.
Each API in the SDK supports a keyword argument call raw_response_hook that gives you access to the HTTP response object, and then let you use a stream download API if you want to. Note that since the payload is considered to represent objects, it will be pre-loaded in memory no matter what, but you can still use a stream syntax nonetheless.
The callback is simply one parameter:
def response_callback(response):
# Do something with the response
requests_response = response.internal_response
# Use "requests" API now
for chunk in requests_response.iter_content():
work_with_chunk(chunk)
Note that this is pretty advanced, you may encounter difficulties and this might not fit what you want precisely. We are working on a new pattern on SDK to simplify complex scenario like that, but it's not shipped yet. You would be able to send and receive raw requests using a send_request method, which gives you absolute control on all aspect of the query, like explaining you just want to stream (no pre-load in memory) or disabling the deserialization by default.
Feel free to open an issue on the Azure SDK for Python repo if you have additional questions or clarification: https://github.com/Azure/azure-sdk-for-python/issues
Edit with new suggestions: TableEntity is a dict like class, so you can json.dumps as string, or json.dump as a stream while using the ItemPaged<TableEntity>. If JSON dumps raise an exception, you can try our JSON encoder in azure.core.serialization.AzureJSONEncoder: https://github.com/Azure/azure-sdk-for-python/blob/1ffb583d57347257159638ae5f71fa85d14c2366/sdk/core/azure-core/tests/test_serialization.py#L83
(I work at MS in the Azure SDK for Python team.)
Ref:
https://docs.python-requests.org/en/master/api/#requests.Response.iter_content
https://azuresdkdocs.blob.core.windows.net/$web/python/azure-core/1.17.0/azure.core.pipeline.policies.html#azure.core.pipeline.policies.CustomHookPolicy
I'm new to beam so the whole triggering stuff really confuse me.
I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts
and I need to write something that would trigger when all 8 parts of a file exist.
Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2
(there could be multiple files parts in the same dir but if its a problem I could ask for it to change).
Is there any way to create this trigger? and if not what do you suggest I could do instead?
Thanks!
If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.
I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.
Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.
I am pushing video to workers for a cloud dataflow pipeline. I have been advised to use beam directly to manage my objects. I can't understand the best practices for downloading objects. I can see the class
Apache Beam IO GCP So one could use it like so:
def read_file(element,local_path):
with beam.io.gcp.gcsio.GcsIO().open(element, 'r') as f:
Where element is the gcs path read from a previous beam step.
Checking out the available methods, downloader looks like.
f.downloader
Download with 57507840/57507840 bytes transferred from url https://www.googleapis.com/storage/v1/b/api-project-773889352370-testing/o/Clips%2F00011.MTS?generation=1493431837327161&alt=media
This message makes it seem like it has been downloaded, it has the right file size (57mb). But where does it go? I would like to pass a variable (local_path), so that subsequent process can handle the object. The class doesn't seem accept a path destination, its not in current working directory, /tmp/ or downloads folder. I'm testing locally on OSX before I deploy.
Am I using this tool correctly? I know that streaming video bytes may be preferable for large videos, we'll get to that once I understand basic functions. I'll open a separate question for streaming into memory (named pipe?) to be read by opencv.
I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.
I need to upload new device tokens to AWS SNS, and would rather doing it in batches instead of one token at a time.
According to the AWS documentation this is supported by their API, and an example is given for the Java SDK using a "bulkupload package".
The problem is that I wrote everything in Python and I can't find any reference to this feature in the Boto3 documentation.
Do you know of a way to do this in Python (not necessarily using Boto)? Or am I doomed to either uploading tokens one by one or rewrite everything in Java?
Thanks!