Trying to get count of objects in S3 folder
Current code
bucket='some-bucket'
File='someLocation/File/'
objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=File)
fileCount = objs['KeyCount']
This gives me the count as 1+actual number of objects in S3.
Maybe it is counting "File" as a key too?
Assuming you want to count the keys in a bucket and don't want to hit the limit of 1000 using list_objects_v2. The below code worked for me but I'm wondering if there is a better faster way to do it! Tried looking if there's a packaged function in boto3 s3 connector but there isn't!
# connect to s3 - assuming your creds are all set up and you have boto3 installed
s3 = boto3.resource('s3')
# identify the bucket - you can use prefix if you know what your bucket name starts with
for bucket in s3.buckets.all():
print(bucket.name)
# get the bucket
bucket = s3.Bucket('my-s3-bucket')
# use loop and count increment
count_obj = 0
for i in bucket.objects.all():
count_obj = count_obj + 1
print(count_obj)
If there are more than 1000 entries, you need to use paginators, like this:
count = 0
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket='your-bucket', Prefix='your-folder/', Delimiter='/'):
count += len(result.get('CommonPrefixes'))
"Folders" do not actually exist in Amazon S3. Instead, all objects have their full path as their filename ('Key'). I think you already know this.
However, it is possible to 'create' a folder by creating a zero-length object that has the same name as the folder. This causes the folder to appear in listings and is what happens if folders are created via the management console.
Thus, you could exclude zero-length objects from your count.
For an example, see: Determine if folder or file key - Boto
The following code worked perfectly
def getNumberOfObjectsInBucket(bucketName,prefix):
count = 0
response = boto3.client('s3').list_objects_v2(Bucket=bucketName,Prefix=prefix)
for object in response['Contents']:
if object['Size'] != 0:
#print(object['Key'])
count+=1
return count
object['Size'] == 0 will take you to folder names, if want to check them, object['Size'] != 0 will lead you to all non-folder keys.
Sample function call below:
getNumberOfObjectsInBucket('foo-test-bucket','foo-output/')
If you have credentials to access that bucket, then you can use this simple code. Below code will give you a list. List comprehension is used for more readability.
Filter is used to filter objects because in bucket to identify the files ,folder names are used. As explained by John Rotenstein concisely.
import boto3
bucket = "Sample_Bucket"
folder = "Sample_Folder"
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket(bucket)
files_in_s3 = [f.key.split(folder + "/")[1] for f in s3_bucket.objects.filter(Prefix=folder).all()]
Related
Im trying to copy files from GCS to a different location. But I need to it in real time with Cloud function.
I have created a function and its working. But the issue is, the file copied multiple times with multiple folders.
EG:
source file path: gs://logbucket/mylog/2020/07/22/log.csv
Expected Target: gs://logbucket/hivelog/2020/07/22/log.csv
My code:
from google.cloud import storage
def hello_gcs_generic(data, context):
sourcebucket=format(data['bucket'])
source_file=format(data['name'])
year = source_file.split("/")[1]
month = source_file.split("/")[2]
day = source_file.split("/")[3]
filename=source_file.split("/")[4]
print(year)
print(month)
print(day)
print(filename)
print(sourcebucket)
print(source_file)
storage_client = storage.Client()
source_bucket = storage_client.bucket(sourcebucket)
source_blob = source_bucket.blob(source_file)
destination_bucket = storage_client.bucket(sourcebucket)
destination_blob_name = 'hivelog/year='+year+'/month='+month+'/day='+day+'/'+filename
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
blob.delete()
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Output:
You can see this year=year=2020 how this comes? Also inside this I have folders like year=year=2020/month=month=07/
Im not able to fix this.
You're writing to the same bucket that you're trying to copy from:
destination_bucket = storage_client.bucket(sourcebucket)
Every time you add a new file to the bucket, it's triggering the Cloud Function again.
You either need to use two different buckets, or add a conditional based on the first part of the path:
top_level_directory = source_file.split("/")[0]
if top_level_directory == "mylog":
# Do the copying
elif top_level_directory == "hivelog":
# This is a file created by the function, do nothing
else:
# We weren't expecting this top level directory
As an educated guess, the source paths you're copying from are really of the format
/foo/year=2020/month=42/...
so when you split by slashes, you get
foo
year=2020
month=42
and when recomposing those components, you again add another year=/month=/... prefix in
destination_blob_name = 'hivelog/year='+year+'/month='+month+ ...
and there you have it; year=year=year= after 3 iterations...
Are you also sure you're not also iterating over the files you've already copied? That would also cause this.
import os
import gcsfs
def hello_gcs_generic(data, context):
fs = gcsfs.GCSFileSystem(project="Project_Name", token=os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))
source_filepath = f"{data['bucket']}/{data['name']}"
destination_filepath = source_filepath.replace("mylog","hivelog")
fs.cp(source_filepath,destination_filepath)
print(f"Blob {data['name']} in bucket {data['bucket']} copied to hivelog")
This should give you a head start on what you're trying to accomplish. Replace Project_Name with the name of your GCP Project where the storage buckets are located.
Also assumes you have service account credentials in a json file set with the environment variable GOOGLE_APPLICATION_CREDENTIALS, which I'm assuming is the case based on your use of google.cloud storage.
Now granted you could accept "mylog" or "hivelog" as arguments and make it useful in other scenarios. Also for splitting your filename,should you ever need to go that route again, one line will do the trick:
_,year,month,data,filename = data['name'].split('/')
In that scenario, the underscore just serves to tell yourself and others that you don't plan to use that portion of the split.
You could use extended unpacking to ignore multiple values such as
*_,month,day,filename = data['name'].split('/')
Or you could combine the two
*_,month,day,_ = data['name'].split('/')
edit: link to gcsfs documentation
I have AWS S3 access and the bucket has nearly 300 files inside the bucket. I need to download single file from this bucket by pattern matching or search because i do not know the exact filename (Say files ends with .csv format).
Here is my sample code which shows all files inside the bucket
def s3connection(credentialsdict):
"""
:param access_key: Access key for AWS to establish S3 connection
:param secret_key: Secret key for AWS to establish S3 connection
:param file_name: file name of the billing file(csv file)
:param bucket_name: Name of the bucket which consists of billing files
:return: status, billing_bucket, billing_key
"""
os.environ['S3_USE_SIGV4'] = 'True'
conn = S3Connection(credentialsdict["access_key"], credentialsdict["secret_key"], host='s3.amazonaws.com')
billing_bucket = conn.get_bucket(credentialsdict["bucket_name"], validate=False)
try:
billing_bucket.get_location()
except S3ResponseError as e:
if e.status == 400 and e.error_code == 'AuthorizationHeaderMalformed':
conn.auth_region_name = ET.fromstring(e.body).find('./Region').text
billing_bucket = conn.get_bucket(credentialsdict["bucket_name"])
print billing_bucket
if not billing_bucket:
raise Exception("Please Enter valid bucket name. Bucket %s does not exist"
% credentialsdict.get("bucket_name"))
for key in billing_bucket.list():
print key.name
del os.environ['S3_USE_SIGV4']
Can I pass search string to retrieve the exact matched filenames?
You can use JMESPath expressions to search and filter down S3 files. To do that you need to get s3 paginator over list_objects_v2.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket="your_bucket_name")
Now that you have iterator you can use JMESPath search.
Most useful is contains - to do %like% query
objects = page_iterator.search("Contents[?contains(Key, `partial-file-name`)][]")
But in your case (to find all files ending .csv it's better to use ends_with - to do *.csv query
objects = page_iterator.search("Contents[?ends_with(Key, `.csv`)][]")
Then you can get object keys with
for item in objects:
print(item['Key'])
This answer is based on https://blog.jeffbryner.com/2020/04/21/jupyter-pandas-analysis.html and https://stackoverflow.com/a/27274997/4587704
There is no way to do this because there is no native support for regex in S3. You have to get the entire list and apply the search/regex at the client side. The only filtering option available in list_objects is by prefix.
list_objects
Prefix (string) -- Limits the response to keys that begin with the
specified prefix.
One option is to use the Python module re and apply it to the list of objects.
import re
pattern = re.compile(<file_pattern_you_are_looking_for>)
for key in billing_bucket.list():
if pattern.match(key.name):
print key.name
You can also use the simple if condition like,
prefix_objs = buck.objects.filter(Prefix="your_bucket_path")
for obj in prefix_objs:
key = obj.key
if key.endswith(".csv"):
body = obj.get()['Body'].read()
print(obj.key)
I Seem To Cannot Create A Variable That Also Creates A New Line
def S3ListAllObjects(event, context):
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('amazons3bucket')
returnoutput = ""
for obj in bucket.objects.all():
returnoutput += obj.key
return returnoutput
For whatever reason I cannot seem to make the returnme variable output a newline. I have tried many different variations of the code but have had no luck. What am I missing?
Thanks!
Are you sure you want key? Perhaps you want name?
>>> import boto3
>>> s3 = boto3.resource('s3')
>>> for bucket in s3.buckets.all():
print(bucket.name)
How can I see what's inside a bucket in S3 with boto3? (i.e. do an "ls")?
Doing the following:
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('some/path/')
returns:
s3.Bucket(name='some/path/')
How do I see its contents?
One way to see the contents would be:
for my_bucket_object in my_bucket.objects.all():
print(my_bucket_object)
This is similar to an 'ls' but it does not take into account the prefix folder convention and will list the objects in the bucket. It's left up to the reader to filter out prefixes which are part of the Key name.
In Python 2:
from boto.s3.connection import S3Connection
conn = S3Connection() # assumes boto.cfg setup
bucket = conn.get_bucket('bucket_name')
for obj in bucket.get_all_keys():
print(obj.key)
In Python 3:
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])
I'm assuming you have configured authentication separately.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucket_name')
for file in my_bucket.objects.all():
print(file.key)
My s3 keys utility function is essentially an optimized version of #Hephaestus's answer:
import boto3
s3_paginator = boto3.client('s3').get_paginator('list_objects_v2')
def keys(bucket_name, prefix='/', delimiter='/', start_after=''):
prefix = prefix.lstrip(delimiter)
start_after = (start_after or prefix) if prefix.endswith(delimiter) else start_after
for page in s3_paginator.paginate(Bucket=bucket_name, Prefix=prefix, StartAfter=start_after):
for content in page.get('Contents', ()):
yield content['Key']
In my tests (boto3 1.9.84), it's significantly faster than the equivalent (but simpler) code:
import boto3
def keys(bucket_name, prefix='/', delimiter='/'):
prefix = prefix.lstrip(delimiter)
bucket = boto3.resource('s3').Bucket(bucket_name)
return (_.key for _ in bucket.objects.filter(Prefix=prefix))
As S3 guarantees UTF-8 binary sorted results, a start_after optimization has been added to the first function.
In order to handle large key listings (i.e. when the directory list is greater than 1000 items), I used the following code to accumulate key values (i.e. filenames) with multiple listings (thanks to Amelio above for the first lines). Code is for python3:
from boto3 import client
bucket_name = "my_bucket"
prefix = "my_key/sub_key/lots_o_files"
s3_conn = client('s3') # type: BaseClient ## again assumes boto.cfg setup, assume AWS S3
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter = "/")
if 'Contents' not in s3_result:
#print(s3_result)
return []
file_list = []
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter="/", ContinuationToken=continuation_key)
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
return file_list
If you want to pass the ACCESS and SECRET keys (which you should not do, because it is not secure):
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)
#To print all filenames in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
files = obj['Key']
return files
filename = get_s3_keys('your_bucket_name')
print(filename)
#To print all filenames in a certain directory in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket, prefix):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in resp['Contents']:
files = obj['Key']
print(files)
return files
filename = get_s3_keys('your_bucket_name', 'folder_name/sub_folder_name/')
print(filename)
Update:
The most easiest way is to use awswrangler
import awswrangler as wr
wr.s3.list_objects('s3://bucket_name')
import boto3
s3 = boto3.resource('s3')
## Bucket to use
my_bucket = s3.Bucket('city-bucket')
## List objects within a given prefix
for obj in my_bucket.objects.filter(Delimiter='/', Prefix='city/'):
print obj.key
Output:
city/pune.csv
city/goa.csv
A more parsimonious way, rather than iterating through via a for loop you could also just print the original object containing all files inside your S3 bucket:
session = Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
bucket = s3.Bucket('bucket_name')
files_in_s3 = bucket.objects.all()
#you can print this iterable with print(list(files_in_s3))
So you're asking for the equivalent of aws s3 ls in boto3. This would be listing all the top level folders and files. This is the closest I could get; it only lists all the top level folders. Surprising how difficult such a simple operation is.
import boto3
def s3_ls():
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
result = bucket.meta.client.list_objects(Bucket=bucket.name,
Delimiter='/')
for o in result.get('CommonPrefixes'):
print(o.get('Prefix'))
ObjectSummary:
There are two identifiers that are attached to the ObjectSummary:
bucket_name
key
boto3 S3: ObjectSummary
More on Object Keys from AWS S3 Documentation:
Object Keys:
When you create an object, you specify the key name, which uniquely identifies the object in the bucket. For example, in the Amazon S3 console (see AWS Management Console), when you highlight a bucket, a list of objects in your bucket appears. These names are the object keys. The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders. Suppose that your bucket (admin-created) has four objects with the following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
Reference:
AWS S3: Object Keys
Here is some example code that demonstrates how to get the bucket name and the object key.
Example:
import boto3
from pprint import pprint
def main():
def enumerate_s3():
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print("Name: {}".format(bucket.name))
print("Creation Date: {}".format(bucket.creation_date))
for object in bucket.objects.all():
print("Object: {}".format(object))
print("Object bucket_name: {}".format(object.bucket_name))
print("Object key: {}".format(object.key))
enumerate_s3()
if __name__ == '__main__':
main()
Here is a simple function that returns you the filenames of all files or files with certain types such as 'json', 'jpg'.
def get_file_list_s3(bucket, prefix="", file_extension=None):
"""Return the list of all file paths (prefix + file name) with certain type or all
Parameters
----------
bucket: str
The name of the bucket. For example, if your bucket is "s3://my_bucket" then it should be "my_bucket"
prefix: str
The full path to the the 'folder' of the files (objects). For example, if your files are in
s3://my_bucket/recipes/deserts then it should be "recipes/deserts". Default : ""
file_extension: str
The type of the files. If you want all, just leave it None. If you only want "json" files then it
should be "json". Default: None
Return
------
file_names: list
The list of file names including the prefix
"""
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket)
file_objs = my_bucket.objects.filter(Prefix=prefix).all()
file_names = [file_obj.key for file_obj in file_objs if file_extension is not None and file_obj.key.split(".")[-1] == file_extension]
return file_names
One way that I used to do this:
import boto3
s3 = boto3.resource('s3')
bucket=s3.Bucket("bucket_name")
contents = [_.key for _ in bucket.objects.all() if "subfolders/ifany/" in _.key]
Here is the solution
import boto3
s3=boto3.resource('s3')
BUCKET_NAME = 'Your S3 Bucket Name'
allFiles = s3.Bucket(BUCKET_NAME).objects.all()
for file in allFiles:
print(file.key)
I just did it like this, including the authentication method:
s3_client = boto3.client(
's3',
aws_access_key_id='access_key',
aws_secret_access_key='access_key_secret',
config=boto3.session.Config(signature_version='s3v4'),
region_name='region'
)
response = s3_client.list_objects(Bucket='bucket_name', Prefix=key)
if ('Contents' in response):
# Object / key exists!
return True
else:
# Object / key DOES NOT exist!
return False
With little modification to #Hephaeastus 's code in one of the above comments, wrote the below method to list down folders and objects (files) in a given path. Works similar to s3 ls command.
from boto3 import session
def s3_ls(profile=None, bucket_name=None, folder_path=None):
folders=[]
files=[]
result=dict()
bucket_name = bucket_name
prefix= folder_path
session = boto3.Session(profile_name=profile)
s3_conn = session.client('s3')
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter = "/", Prefix=prefix)
if 'Contents' not in s3_result and 'CommonPrefixes' not in s3_result:
return []
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter="/", ContinuationToken=continuation_key, Prefix=prefix)
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
if folders:
result['folders']=sorted(folders)
if files:
result['files']=sorted(files)
return result
This lists down all objects / folders in a given path. Folder_path can be left as None by default and method will list the immediate contents of the root of the bucket.
It can also be done as follows:
csv_files = s3.list_objects_v2(s3_bucket_path)
for obj in csv_files['Contents']:
key = obj['Key']
Using cloudpathlib
cloudpathlib provides a convenience wrapper so that you can use the simple pathlib API to interact with AWS S3 (and Azure blob storage, GCS, etc.). You can install with pip install "cloudpathlib[s3]".
Like with pathlib you can use glob or iterdir to list the contents of a directory.
Here's an example with a public AWS S3 bucket that you can copy and past to run.
from cloudpathlib import CloudPath
s3_path = CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349")
# list items with glob
list(
s3_path.glob("*")
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
# list items with iterdir
list(
s3_path.iterdir()
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
Created at 2021-05-21 20:38:47 PDT by reprexlite v0.4.2
A good option may also be to run aws cli command from lambda functions
import subprocess
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def run_command(command):
command_list = command.split(' ')
try:
logger.info("Running shell command: \"{}\"".format(command))
result = subprocess.run(command_list, stdout=subprocess.PIPE);
logger.info("Command output:\n---\n{}\n---".format(result.stdout.decode('UTF-8')))
except Exception as e:
logger.error("Exception: {}".format(e))
return False
return True
def lambda_handler(event, context):
run_command('/opt/aws s3 ls s3://bucket-name')
I was stuck on this for an entire night because I just wanted to get the number of files under a subfolder but it was also returning one extra file in the content that was the subfolder itself,
After researching about it I found that this is how s3 works but I had
a scenario where I unloaded the data from redshift in the following directory
s3://bucket_name/subfolder/<10 number of files>
and when I used
paginator.paginate(Bucket=price_signal_bucket_name,Prefix=new_files_folder_path+"/")
it would only return the 10 files, but when I created the folder on the s3 bucket itself then it would also return the subfolder
Conclusion
If the whole folder is uploaded to s3 then listing the only returns the files under prefix
But if the fodler was created on the s3 bucket itself then listing it using boto3 client will also return the subfolder and the files
First, create an s3 client object:
s3_client = boto3.client('s3')
Next, create a variable to hold the bucket name and folder. Pay attention to the slash "/" ending the folder name:
bucket_name = 'my-bucket'
folder = 'some-folder/'
Next, call s3_client.list_objects_v2 to get the folder's content object's metadata:
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=folder
)
Finally, with the object's metadata, you can obtain the S3 object by calling the s3_client.get_object function:
for object_metadata in response['Contents']:
object_key = object_metadata['Key']
response = s3_client.get_object(
Bucket=bucket_name,
Key=object_key
)
object_body = response['Body'].read()
print(object_body)
As you can see, the object content in the string format is available by calling response['Body'].read()
I am using boto and python and amazon s3.
If I use
[key.name for key in list(self.bucket.list())]
then I get all the keys of all the files.
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
what is the best way to
1. either get all folders from s3
2. or from that list just remove the file from the last and get the unique keys of folders
I am thinking of doing like this
set([re.sub("/[^/]*$","/",path) for path in mylist]
building on sethwm's answer:
To get the top level directories:
list(bucket.list("", "/"))
To get the subdirectories of files:
list(bucket.list("files/", "/")
and so on.
This is going to be an incomplete answer since I don't know python or boto, but I want to comment on the underlying concept in the question.
One of the other posters was right: there is no concept of a directory in S3. There are only flat key/value pairs. Many applications pretend certain delimiters indicate directory entries. For example "/" or "\". Some apps go as far as putting a dummy file in place so that if the "directory" empties out, you can still see it in list results.
You don't always have to pull your entire bucket down and do the filtering locally. S3 has a concept of a delimited list where you specific what you would deem your path delimiter ("/", "\", "|", "foobar", etc) and S3 will return virtual results to you, similar to what you want.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html (
Look at the delimiter header.)
This API will get you one level of directories. So if you had in your example:
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
And you passed in a LIST with prefix "" and delimiter "/", you'd get results:
mybucket/files/
If you passed in a LIST with prefix "mybucket/files/" and delimiter "/", you'd get results:
mybucket/files/pdf/
And if you passed in a LIST with prefix "mybucket/files/pdf/" and delimiter "/", you'd get results:
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/2011/
You'd be on your own at that point if you wanted to eliminate the pdf files themselves from the result set.
Now how you do this in python/boto I have no idea. Hopefully there's a way to pass through.
As pointed in one of the comments approach suggested by j1m returns a prefix object. If you are after a name/path you can use variable name. For example:
import boto
import boto.s3
conn = boto.s3.connect_to_region('us-west-2')
bucket = conn.get_bucket(your_bucket)
folders = bucket.list("","/")
for folder in folders:
print folder.name
I found the following to work using boto3:
import boto3
def list_folders(s3_client, bucket_name):
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='', Delimiter='/')
for content in response.get('CommonPrefixes', []):
yield content.get('Prefix')
s3_client = boto3.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
print('Folder found: %s' % folder)
Refs.:
https://stackoverflow.com/a/51550944/1259478
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2
Basically there is no such thing as a folder in S3. Internally everything is stored as a key, and if the key name has a slash character in it, the clients may decide to show it as a folder.
With that in mind, you should first get all keys and then use a regex to filter out the paths that include a slash in it. The solution you have right now is already a good start.
I see you have successfully made the boto connection. If you only have one directory that you are interested in (like you provided in the example), I think what you can do is use prefix and delimiter that's already provided via AWS (Link).
Boto uses this feature in its bucket object, and you can retrieve a hierarchical directory information using prefix and delimiter. The bucket.list() will return a boto.s3.bucketlistresultset.BucketListResultSet object.
I tried this a couple ways, and if you do choose to use a delimiter= argument in bucket.list(), the returned object is an iterator for boto.s3.prefix.Prefix, rather than boto.s3.key.Key. In other words, if you try to retrieve the subdirectories you should put delimiter='\' and as a result, you will get an iterator for the prefix object
Both returned objects (either prefix or key object) have a .name attribute, so if you want the directory/file information as a string, you can do so by printing like below:
from boto.s3.connection import S3Connection
key_id = '...'
secret_key = '...'
# Create connection
conn = S3Connection(key_id, secret_key)
# Get list of all buckets
allbuckets = conn.get_all_buckets()
for bucket_name in allbuckets:
print(bucket_name)
# Connet to a specific bucket
bucket = conn.get_bucket('bucket_name')
# Get subdirectory info
for key in bucket.list(prefix='sub_directory/', delimiter='/'):
print(key.name)
The issue here, as has been said by others, is that a folder doesn't necessarily have a key, so you have to search through the strings for the / character and figure out your folders through that. Here's one way to generate a recursive dictionary imitating a folder structure.
If you want all the files and their url's in the folders
assets = {}
for key in self.bucket.list(str(self.org) + '/'):
path = key.name.split('/')
identifier = assets
for uri in path[1:-1]:
try:
identifier[uri]
except:
identifier[uri] = {}
identifier = identifier[uri]
if not key.name.endswith('/'):
identifier[path[-1]] = key.generate_url(expires_in=0, query_auth=False)
return assets
If you just want the empty folders
folders = {}
for key in self.bucket.list(str(self.org) + '/'):
path = key.name.split('/')
identifier = folders
for uri in path[1:-1]:
try:
identifier[uri]
except:
identifier[uri] = {}
identifier = identifier[uri]
if key.name.endswith('/'):
identifier[path[-1]] = {}
return folders
This can then be recursively read out later.
the boto interface allows you to list the content of a bucket and give a prefix of the entry.
That way you can have the entry for what would be a directory in a normal filesytem :
import boto
AWS_ACCESS_KEY_ID = '...'
AWS_SECRET_ACCESS_KEY = '...'
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket()
bucket_entries = bucket.list(prefix='/path/to/your/directory')
for entry in bucket_entries:
print entry
Complete example with boto3 using the S3 client
import boto3
def list_bucket_keys(bucket_name):
s3_client = boto3.client("s3")
""" :type : pyboto3.s3 """
result = s3_client.list_objects(Bucket=bucket_name, Prefix="Trails/", Delimiter="/")
return result['CommonPrefixes']
if __name__ == '__main__':
print list_bucket_keys("my-s3-bucket-name")