Transfer files from S3 Bucket to another keeping folder structure - python boto - python

Have found many questions related to this with solutions using boto3, however I am in a position where I have to use boto, running Python 2.38.
Now I can successfully transfer my files in their folders (Not real folders I know as S3 doesn't have this concept) but I want them to be saved into a particular folder in my destination bucket
from boto.s3.connection import S3Connection
def transfer_files():
conn = S3Connection()
srcBucket = conn.get_bucket("source_bucket")
dstBucket = conn.get_bucket(bucket_name="destination_bucket")
objectlist = srcbucket.list()
for obj in objectlist:
dstBucket.copy_key(obj.key, srcBucket.name, obj.key)
My srcBucket will look like folder/subFolder/anotherSubFolder/file.txt which when transferred will land in the dstBucket like so destination_bucket/folder/subFolder/anotherSubFolder/file.txt
I would like it to end up in destination_bucket/targetFolder so the final directory structure would look like
destination_bucket/targetFolder/folder/subFolder/anotherSubFolder/file.txt
Hopefully I have explained this well enough and it makes sense

The first parameter is the name of the destination key.
Therefore, just use:
dstBucket.copy_key('targetFolder/' + obj.key, srcBucket.name, obj.key)

Related

create a new bucket inside bucket in google storage using python

I need to create a folder inside folder in google clous storage using python.
I know how to create one folder:
bucket_name = 'data_bucket_10_07'
# create a new bucket
bucket = storage_client.bucket(bucket_name)
bucket.storage_class = 'COLDLINE' # Archive | Nearline | Standard
bucket.location = 'US' # Taiwan
bucket = storage_client.create_bucket(bucket) # returns Bucket object
my_bucket = storage_client.get_bucket(bucket_name)
when I try to change bucket_name = 'data_bucket_10_07' to bucket_name = 'data_bucket_10_07/data_bucket_10_07_1' I got an error:
google.api_core.exceptions.BadRequest: 400 POST https://storage.googleapis.com/storage/v1/b?project=effective-forge-317205&prettyPrint=false: Invalid bucket name: 'data_bucket_10_07/data_bucket_10_07_1'
How should I solve my problem?
As John mentioned in the comment, it may not be ontologically possible to have a bucket inside a bucket.
See Bucket naming guidelines for documentation details.
In nutshell:
There is only one level of buckets in a global namespace (thus the bucket name is to be global unique). Everything beyond the bucket name - belongs to an object name.
For example, you can create a bucket (let's guess the name is not already in use) like data_bucket_10_07. In that case, it may look like gs://data_bucket_10_07
Then, you probably would like to store some objects (files) in a such way, that it looks like a directory hierarchy, so, let say there are /01/data.csv object and /02/data.csv object. Where the 01 and 02 should presumably semantically reflect some date.
Those /01/ and /02/ elements - are essentially beginning parts of the object names (or prefixes for the objects in other words).
So far the bucket name is gs://data_bucket_10_07
The object names are /01/data.csv and /02/data.csv
I would suggest checking Object naming guidelines documentation where those ideas are described much better then I can do in one sentence.
Other comments do a great job of detailing that nested buckets are not possible, but they only suggest at the following shortly: GCS does not rely on folders, and only presents contents with a hierarchical structure on the web UI for ease of use.
From the documentation:
Cloud Storage operates with a flat namespace, which means that folders don't actually exist within Cloud Storage. If you create an object named folder1/file.txt in the bucket your-bucket, the path to the object is your-bucket/folder1/file.txt. There is no folder1 folder, just a single object with folder1 as part of its name.
So if you'd like to create a "folder" for organization and immediately place an object in it, name your object with the "folders" ahead of the name, and GCS will take care of 'creating' them if they don't already exist.

How to save files from s3 into current jupyter directory

I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.

How to get top-level folders in an S3 bucket using boto3?

I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/

How to get objects from a folder in an S3 bucket

I am trying to traverse all objects inside a specific folder in my S3 bucket. The code I already have is like follows:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in bucket.objects.filter(Prefix='folder/'):
do_stuff(obj)
I need to use boto3.resource and not client. This code is not getting any objects at all although I have a bunch of text files in the folder. Can someone advise?
Try adding the Delimiter attribute: Delimiter = '\' as you are filtering objects. The rest of the code looks fine.
I had to make sure to skip the first file. For some reason it thinks the folder name is the first file and that may not be what you want.
for video_item in source_bucket.objects.filter(Prefix="my-folder-name/", Delimiter='/'):
if video_item.key == 'my-folder-name/':
continue
do_something(video_item.key)

Inserting items in sub-buckets on S3 using boto

Say I have the following bucket set up on S3
MyBucket/MySubBucket
And, I am planning on serving static media for a website out of MySubBucket (like images users have uploaded, etc.). The way the S3 configurations appear to be currently set up, there is no way to make the top-level bucket "MyBucket" public, so to mimic this you need to make every individual item in that bucket public after it has been inserted.
You can, make a sub-bucket like "MySubBucket" public, but the problem I am currently having is trying to figure out how to call boto to insert a test image into that bucket? Can anyone provide an example?
Thanks
You actually just need to put the sub bucket name and a / before the file name. The following code snippet will put a file called 'test.jpg' into MySubBucket. The k.make_public() makes that file public, but not necessarily the bucket itself.
from boto.s3.connection import S3Connection
from boto.s3.key import Key
connection = S3Connection(AWSKEY,AWS_SECRET_KEY)
bucket = connection.get_bucket('MyBucket')
file = 'test.jpg'
k = Key(bucket)
k.key = 'MySubBucket/' + file
k.set_contents_from_filename(file)
k.make_public()

Categories

Resources