Python script that moves specific files between S3 buckets - python

So I'm still a rookie when it comes to coding in Python, but I was wondering if someone could be so kind as to help me with a problem.
A client I work for uses the eDiscovery system Venio. They have a web, app,database, and linux server running off of EC2 instances in AWS.
Right now when customers upload docs to their server, they end up re downloading the content to another drive, causing extra work for themselves. There is also an issue of speed when it comes to serving up files on their system.
After setting up automated snapshots with a script in Lambda, I started thinking that storing their massive files in S3,behind CloudFront might be a better way to go.
Does anyone know if there is a way to make a Python script that looks for key words in a file(ex;"Use", "Discard"), and separates them into different buckets automatically?
Any advice would be immensely appreciated!
UPDATE:
So here is a script I started:
import boto3
# Creates S3 client
s3 = boto3.client('s3')
filename = 'file.txt'
bucket_name = 'responsive-bucket'
keyword_bucket = {
'use': 'responsive-bucket',
'discard': 'non-responsive-bucket',
}
Essentially what I want is when a client uploads a file through the web API, a python script triggers which looks for the keywords of Responsive or Non-Responsive. Once it recognizes those keys, it PUTS those files into the corresponding named buckets. The responsive files will stay in a standard s3 buckets and the non useful ones will go to a s3-IA bucket. After a set time, they are then lifecycle to Glacier.
Any help would be amazing!!!

If you can build a mapping of keywords => bucket names, you could use a dictionary. For example:
keyword_bucket = {
'use': 'bucket_abc',
'discard': 'bucket_xyz',
'etc': 'bucket_whatever'
}
Then you open the file and search for your keywords. When a keyword matches, you use the dictionary above to find the correspondent bucket where the file should go.

Related

Reading File from Vertex AI and Google Cloud Storage

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
#component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.
Try specifying bucket name w/o a gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket
any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.

apache beam trigger when all necessary files in gcs bucket is uploaded

I'm new to beam so the whole triggering stuff really confuse me.
I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts
and I need to write something that would trigger when all 8 parts of a file exist.
Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2
(there could be multiple files parts in the same dir but if its a problem I could ask for it to change).
Is there any way to create this trigger? and if not what do you suggest I could do instead?
Thanks!
If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.
I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.
Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

Using Boto3 to upload a file to Amazon WorkDocs

According to the Amazon WorkDocs SDK page, you can use Boto3 to migrate your content to Amazon WorkDocs. I found the entry for the WorkSpaces Client in the Boto3 documentation, but every call seems to require a "AuthenticationToken" parameter. The only information I can find on AuthenticationToken is that is it supposed to be a "Amazon WorkDocs authentication token".
Does anyone know what this token is? How do I get one? Is there any code examples of using the WorkDocs Client in Boto3?
I am trying to create a simple Python script that will upload a single document into WorkDocs, but there seems to be little to no information on how to do this. I was easily able to write a script that can upload/download files from S3, but this seems like something else entirely.

Local access to Amazon S3 Bucket from EC2 instance

I have an EC2 instance and an S3 bucket in the same region. The bucket contains reasonably large (5-20mb) files that are used regularly by my EC2 instance.
I want to programatically open the file on my EC2 instance (using python). Like so:
file_from_s3 = open('http://s3.amazonaws.com/my-bucket-name/my-file-name')
But using a "http" URL to access the file remotely seems grossly inefficient, surely this would mean downloading the file to the server every time I want to use it.
What I want to know is, is there a way I can access S3 files locally from my EC2 instance, for example:
file_from_s3 = open('s3://my-bucket-name/my-file-name')
I can't find a solution myself, any help would be appreciated, thank you.
Whatever you do the object will be downloaded behind the scenes from S3 into your EC2 instance. That cannot be avoided.
If you want to treat files in the bucket as local files you need to install any one of several S3 filesystem plugins for FUSE (example : s3fs-fuse ). Alternatively you can use boto for easy access to S3 objects via python code.

Categories

Resources