I am pushing video to workers for a cloud dataflow pipeline. I have been advised to use beam directly to manage my objects. I can't understand the best practices for downloading objects. I can see the class
Apache Beam IO GCP So one could use it like so:
def read_file(element,local_path):
with beam.io.gcp.gcsio.GcsIO().open(element, 'r') as f:
Where element is the gcs path read from a previous beam step.
Checking out the available methods, downloader looks like.
f.downloader
Download with 57507840/57507840 bytes transferred from url https://www.googleapis.com/storage/v1/b/api-project-773889352370-testing/o/Clips%2F00011.MTS?generation=1493431837327161&alt=media
This message makes it seem like it has been downloaded, it has the right file size (57mb). But where does it go? I would like to pass a variable (local_path), so that subsequent process can handle the object. The class doesn't seem accept a path destination, its not in current working directory, /tmp/ or downloads folder. I'm testing locally on OSX before I deploy.
Am I using this tool correctly? I know that streaming video bytes may be preferable for large videos, we'll get to that once I understand basic functions. I'll open a separate question for streaming into memory (named pipe?) to be read by opencv.
Related
I have been messing with AWS MediaConvert for boto3 for the python library and I find the docs incredibly confusing.
There are so many settings.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html
and amazon does a absoultely terrible job of labeling what is necessary to do a basic job.
what would be the correct json for a simple job.
taking a video with audio file and turning it into a CMAF file
and taking a audio only file and turning it into A CMAF file.
I am trying to establish the baseline use of this technology. And there is so much extra that I don't know what I absolutely need and what is extra settings for specific use cases.
The solve for this is to use the MediaConvert UI in AWS then use the copy json button save it and then use the json created.
Never mind trying to create it yourself. Unless you like pain.
I'm new to beam so the whole triggering stuff really confuse me.
I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts
and I need to write something that would trigger when all 8 parts of a file exist.
Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2
(there could be multiple files parts in the same dir but if its a problem I could ask for it to change).
Is there any way to create this trigger? and if not what do you suggest I could do instead?
Thanks!
If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.
I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.
Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.
I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.
I've been through the newest docs for the GCS client library and went through the example. The sample code shows how to create a file/stream on-the-fly on GCS.
How do I resumably (that allows resumes if error) upload existing files and directories from a local directory to a GCS bucket? Using the new client library. IE, this (can't post more than 2 links so h77ps://cloud.google.com/storage/docs/gspythonlibrary#uploading-objects) is deprecated.
Thanks all
P.S
I do not need GAE functionality - This is going to sit on-premise and upload to GCS
The Python API client can perform resumable uploads. See the documentation for examples. The important bit is:
media = MediaFileUpload('pig.png', mimetype='image/png', resumable=True)
Unfortunately, the library doesn't expose the upload ID itself, so while the upload call will resume uploads if there is an error, there's no way for your application to explicitly resume an upload. If, for instance, your application was terminated and you needed to resume the upload on restart, the library won't help you. If you need that level of retry, you'll have to use another tool or just directly invoke httplib.
The Boto library accomplishes this a little differently and DOES support keeping a persistable tracking token, in case your app crashes and needs to resume. Here's a quick example, stolen from Chromium's system tests:
# Set up other stuff normally
res_upload_handler = ResumableUploadHandler(
tracker_file_name=tracker_file_name, num_retries=3
dst_key.set_contents_from_file(src_file, res_upload_handler=res_upload_handler)
Since you're interested in the new hotness, the latest, greatest Python library for accessing Google Cloud Storage is probably APITools, which also provides for recoverable, resumable uploads and also has examples.
I want to download a file from SkyDrive programmatically using Python on Linux.
I can't use the API as it's a OneNote file and the API can't be used to download these.
My understanding is that SD supports Webdav and there are plenty of examples where people have mounted an SD folder using davfs2 but I just want to be able to grab a specific file without mounting.
I can use the API to get the document owner's cid so don't need to jump through any windows based hoops but my - probably lame, have not really researched webdav - efforts to download the file always resort in an error.
For example using easywebdav:
import easywebdav
webdav = easywebdav.connect("d.docs.live.net/mycid")
webdav.download('me/skydrive/Documents/Getting\ Started', '/tmp/foo')
#this gives the 302 error mentioned in the comments at the end of the the 'jumping through windows hoops' link I posted above.
Is there any workaround for the redirection problem I've seen mentioned?
Do I have this wrong and when accessing files on a webdav share it makes sense, and indeed it's essential, to mount it as a file system?
If you are downloading a specific file, and already know the exact path/URL to that file (as per your example), I'm not sure that you really need to worry about the DAV extensions. Have you tried downloading the file using a simple HTTP GET, through something like urllib2?