apache beam trigger when all necessary files in gcs bucket is uploaded

apache beam trigger when all necessary files in gcs bucket is uploaded - python

I'm new to beam so the whole triggering stuff really confuse me.
I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts
and I need to write something that would trigger when all 8 parts of a file exist.
Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2
(there could be multiple files parts in the same dir but if its a problem I could ask for it to change).
Is there any way to create this trigger? and if not what do you suggest I could do instead?
Thanks!

If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.
I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.
Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.

Related

AWS MediaConvert Docs are obnoxious and unclear

I have been messing with AWS MediaConvert for boto3 for the python library and I find the docs incredibly confusing.
There are so many settings.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mediaconvert.html
and amazon does a absoultely terrible job of labeling what is necessary to do a basic job.
what would be the correct json for a simple job.
taking a video with audio file and turning it into a CMAF file
and taking a audio only file and turning it into A CMAF file.
I am trying to establish the baseline use of this technology. And there is so much extra that I don't know what I absolutely need and what is extra settings for specific use cases.

The solve for this is to use the MediaConvert UI in AWS then use the copy json button save it and then use the json created.
Never mind trying to create it yourself. Unless you like pain.

Getting S3 Response code (Only the HTTP code like 200,300,400,403,500 Etc) while saving file using S3a in pyspark

I am trying to get the HTTP code and store in RDS table for later analysis of pyspark job which will save the file as AVRO format to S3 using S3a. Once the file is saved I know that there will be return status code from S3 but I am not sure how to record that in Code. please find the snippet of the code.
def s3_load(df, row):
df.write.\
format("com.databricks.spark.avro").\
save("s3a://Test-" + row["PARTNER"].lower() + "/" + row["TABLE_NAME"] + "/" +
datetime.datetime.today().strftime('%Y%m%d'))
In the above code i would like the o get the return as status code.
Note:I am able to save the file in S3 as AVRO format.
Thanks

This is a similar concept discussed in this question, getting a status code of a library or function that wraps an s3 API: Amazon S3 POST, event when done?
Ultimately, if databricks is the library handling the upload, the resulting response code from the df.write.save(...) function call would be found somewhere in the result of the databricks function call.
Databricks supports s3 and s3a as target destinations for saving files (as shown in their docs here), but it doesn't appear that databricks surfaces the response code from underlying operations here (maybe they do, I couldn't find it in any of the docs).
A few options for moving forward:
Assuming databricks will throw "some" sort of error for that upload, a simple try/except will allow you to properly catch this (although any non-databricks level errors would still pass).
On AWS, s3 bucket uploads are an event source that can be used as a trigger for other operations like invoking an AWS Lambda, which you can use to call an arbitrary cloud hosted function. Lots of info available on what this architecture would look like in this tutorial.
Depending on the need for parallel uploading, you can rewrite your small upload function using boto3, the official AWS python library. Discussion on how to handle those error/response codes discussed here.
Databricks also seems to have audit logging capabilities somewhere in their enterprise offering.

Python script that moves specific files between S3 buckets

So I'm still a rookie when it comes to coding in Python, but I was wondering if someone could be so kind as to help me with a problem.
A client I work for uses the eDiscovery system Venio. They have a web, app,database, and linux server running off of EC2 instances in AWS.
Right now when customers upload docs to their server, they end up re downloading the content to another drive, causing extra work for themselves. There is also an issue of speed when it comes to serving up files on their system.
After setting up automated snapshots with a script in Lambda, I started thinking that storing their massive files in S3,behind CloudFront might be a better way to go.
Does anyone know if there is a way to make a Python script that looks for key words in a file(ex;"Use", "Discard"), and separates them into different buckets automatically?
Any advice would be immensely appreciated!
UPDATE:
So here is a script I started:
import boto3
# Creates S3 client
s3 = boto3.client('s3')
filename = 'file.txt'
bucket_name = 'responsive-bucket'
keyword_bucket = {
'use': 'responsive-bucket',
'discard': 'non-responsive-bucket',
}
Essentially what I want is when a client uploads a file through the web API, a python script triggers which looks for the keywords of Responsive or Non-Responsive. Once it recognizes those keys, it PUTS those files into the corresponding named buckets. The responsive files will stay in a standard s3 buckets and the non useful ones will go to a s3-IA bucket. After a set time, they are then lifecycle to Glacier.
Any help would be amazing!!!

If you can build a mapping of keywords => bucket names, you could use a dictionary. For example:
keyword_bucket = {
'use': 'bucket_abc',
'discard': 'bucket_xyz',
'etc': 'bucket_whatever'
}
Then you open the file and search for your keywords. When a keyword matches, you use the dictionary above to find the correspondent bucket where the file should go.

Download location for apache_beam.io.gcp.gcsio.GcsBufferedReader object

I am pushing video to workers for a cloud dataflow pipeline. I have been advised to use beam directly to manage my objects. I can't understand the best practices for downloading objects. I can see the class
Apache Beam IO GCP So one could use it like so:
def read_file(element,local_path):
with beam.io.gcp.gcsio.GcsIO().open(element, 'r') as f:
Where element is the gcs path read from a previous beam step.
Checking out the available methods, downloader looks like.
f.downloader
Download with 57507840/57507840 bytes transferred from url https://www.googleapis.com/storage/v1/b/api-project-773889352370-testing/o/Clips%2F00011.MTS?generation=1493431837327161&alt=media
This message makes it seem like it has been downloaded, it has the right file size (57mb). But where does it go? I would like to pass a variable (local_path), so that subsequent process can handle the object. The class doesn't seem accept a path destination, its not in current working directory, /tmp/ or downloads folder. I'm testing locally on OSX before I deploy.
Am I using this tool correctly? I know that streaming video bytes may be preferable for large videos, we'll get to that once I understand basic functions. I'll open a separate question for streaming into memory (named pipe?) to be read by opencv.

Combining many log files in Amazon S3 and read in locally

I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?

There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.

Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

apache beam trigger when all necessary files in gcs bucket is uploaded - python

Related

AWS MediaConvert Docs are obnoxious and unclear

Getting S3 Response code (Only the HTTP code like 200,300,400,403,500 Etc) while saving file using S3a in pyspark

Python script that moves specific files between S3 buckets

Download location for apache_beam.io.gcp.gcsio.GcsBufferedReader object

Combining many log files in Amazon S3 and read in locally

Categories

Resources