Counting keys in an S3 bucket - python

Using the boto3 library and python code below, I can iterate through S3 buckets and prefixes, printing out the prefix name and key name as follows:
import boto3
client = boto3.client('s3')
pfx_paginator = client.get_paginator('list_objects_v2')
pfx_iterator = pfx_paginator.paginate(Bucket='app_folders', Delimiter='/')
for prefix in pfx_iterator.search('CommonPrefixes'):
print(prefix['Prefix'])
key_paginator = client.get_paginator('list_objects_v2')
key_iterator = key_paginator.paginate(Bucket='app_folders', Prefix=prefix['Prefix'])
for key in key_iterator.search('Contents'):
print(key['Key'])
Inside the key loop, I can put in a counter to count the number of keys (files), but this is an expensive operation. Is there a way to make one call given a bucket name and a prefix and return the count of keys contained in that prefix (even if it is more than 1000)?
UPDATE: I found a post here that shows a way to do this with the AWS CLI as follows:
aws s3api list-objects --bucket BUCKETNAME --prefix "folder/subfolder/" --output json --query "[length(Contents[])]"
Is there a way to do something similar with the boto3 API?

You can do it using MaxKeys=1000 parameter.
For your case:
pfx_iterator = pfx_paginator.paginate(Bucket='app_folders', Delimiter='/', MaxKeys=1000)
In general:
response = client.list_objects_v2(
Bucket='string',
Delimiter='string',
EncodingType='url',
MaxKeys=123,
Prefix='string',
ContinuationToken='string',
FetchOwner=True|False,
StartAfter='string',
RequestPayer='requester'
)
It will be cheaper for you in 1000 times :)
Documentation here

Using aws cli it is easy to count :
aws s3 ls <folder url> --recursive --summarize | grep <comment>
e.g.,
aws s3 ls s3://abc/ --recursive --summarize | grep "Number of Objects"

Related

Check file permissions for each file on a S3 Bucket, recursive

I need a script in Python to get all ACL for each files in a s3 bucket, to see if there are public o private files in that bucket. All files are images, and Marketing dept wanna know which files are Private.
Something like this
get_acl(object, bucket, ...)
But recursive for all 10.000 files in that bucket.
With the AWS CLI i cant get this work, any idea where i can find some examples?
Thanks
As you state, you need to list all of the objects in the bucket, and either check their ACL, or test to see if you can access the object without authentication.
If you want to check the ACLs, you can run through each object in turn and check:
BUCKET = "example-bucket"
import boto3
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
# List all of the objects
for page in paginator.paginate(Bucket=BUCKET):
for cur in page.get("Contents", []):
# Get the ACL for each object in turn
# Note: This example does not take into
# account any bucket-level permissions
acl = s3.get_object_acl(Bucket=BUCKET, Key=cur['Key'])
public_read = False
public_write = False
# Check each grant in the ACL
for grant in acl["Grants"]:
# See if the All Users group has been given a right, keep track of
# all possibilites in case there are multiple rules for some reason
if grant["Grantee"].get("URI", "") == "http://acs.amazonaws.com/groups/global/AllUsers":
if grant["Permission"] in {"READ", "FULL_CONTROL"}:
public_read = True
if grant["Permission"] in {"WRITE", "FULL_CONTROL"}:
public_read = True
# Write out the status for this object
if public_read and public_write:
status = "public_read_write"
elif public_read:
status = "public_read"
elif public_write:
status = "public_write"
else:
status = "private"
print(f"{cur['Key']},{status}")
When the objects in the bucket are public you should get a 200 code, but if they are private the code will be 403.
So what you could try first is to get the list of all the objects in your bucket:
aws2 s3api list-objects --bucket bucketnamehere
So in python you could iterate a request to each of the objects, example:
https://bucketname.s3.us-east-1.amazonaws.com/objectname
You can do the test with the Unix command line Curl
curl -I https://bucketname.s3.us-east-1.amazonaws.com/objectname

Boto to Boto3 function implementation

1) How can i implement this from boto into boto3 code:
conn = boto.connect_ec2() # boto way
sgs = conn.get_all_security_groups() # boto way
for sg in sgs:
if len(sg.instances()) == 0:
print(sg.name, sg.id, len(sg.instances()))
The above code basically prints all Security Groups with no instances attached.
2) And this individual command which uses duct.sh() module :
command = 'aws ec2 describe-instances --filters "Name=instance.group-id,Values=' + sg.id + '\" --query \'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`] | [0].Value]\' --output json'
boto: get_all_security_groups()
boto3: security_group_iterator = ec2.security_groups.all()
However, boto has the .instances() method on boto.ec2.securitygroup.SecurityGroup, whereas boto3 does not have an equivalent method on ec2.SecurityGroup.
Therefore, it looks like you would have to call describe_instances(), passing the security group as a Filter:
response = client.describe_instances(
Filters=[{'Name':'instance.group-id','Values':['sg-abcd1234']}])
This will return a list of instances that use the given security group.
You could then count len(response['Reservations']) to find unused security groups. (Note: This is an easy way to find zero-length responses, but to count the actual instances would require adding up all Reservations.Instances.)

Accessing nested buckets in S3 with Boto3

I know the path of the bucket I want to access /bucket1/bucket2/etc/ but I can't figure out how to access it via boto3.
I can enumerate all buckets starting from the source, but can't get to the bucket I want.
For example I can do:
prod_bucket = s3.Bucket('prod')
But I cannot do:
prod_bucket = s3.Bucket('prod/prod2/')
TIA
There are no nested buckets. You have bucket and objects.
s3 = boto3.client('s3')
object = s3.get_object(Bucket='prod', Key='prod2/..')
Or:
s3 = boto3.resource('s3')
bucket = s3.Bucket('prod')
object = bucket.Object('prod2/..')
See: get_object

copy file from gcs to s3 in boto3

I am looking to copy files from gcs to my s3 bucket. In boto2, easy as a button.
conn = connect_gs(user_id, password)
gs_bucket = conn.get_bucket(gs_bucket_name)
for obj in bucket:
s3_key = key.Key(s3_bucket)
s3_key.key = obj
s3_key.set_contents_from_filename(obj)
However in boto3, I am lost trying to find equivalent code. Any takers?
If all you're doing is a copy:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in gcs:
s3_obj = bucket.Object(gcs.key)
s3_obj.put(Body=gcs.data)
Docs: s3.Bucket, s3.Bucket.Object, s3.Bucket.Object.put
Alternatively, if you don't want to use the resource model:
import boto3
s3_client = boto3.client('s3')
for obj in gcs:
s3_client.put_object(Bucket='bucket-name', Key=gcs.key, Body=gcs.body)
Docs: s3_client.put_object
Caveat: The gcs bits are pseudocode, I am not familiar with their API.
EDIT:
So it seems gcs supports an old version of the S3 API and with that an old version of the signer. We still have support for that old signer, but you have to opt into it. Note that some regions don't support old signing versions (you can see a list of which S3 regions support which versions here), so if you're trying to copy over to one of those you will need to use a different client.
import boto3
from botocore.client import Config
# Create a client with the s3v2 signer
resource = boto3.resource('s3', config=Config(signature_version='s3'))
gcs_bucket = resource.Bucket('phjordon-test-bucket')
s3_bucket = resource.Bucket('phjordon-test-bucket-tokyo')
for obj in gcs_bucket.objects.all():
s3_bucket.Object(obj.key).copy_from(
CopySource=obj.bucket_name + "/" + obj.key)
Docs: s3.Object.copy_from
This, of course, will only work assuming gcs is still S3 compliant.

Google Cloud Storage + Python : Any way to list obj in certain folder in GCS?

I'm going to write a Python program to check if a file is in certain folder of my Google Cloud Storage, the basic idea is to get the list of all objects in a folder, a file name list, then check if the file abc.txt is in the file name list.
Now the problem is, it looks Google only provide the one way to get obj list, which is uri.get_bucket(), see below code which is from https://developers.google.com/storage/docs/gspythonlibrary#listing-objects
uri = boto.storage_uri(DOGS_BUCKET, GOOGLE_STORAGE)
for obj in uri.get_bucket():
print '%s://%s/%s' % (uri.scheme, uri.bucket_name, obj.name)
print ' "%s"' % obj.get_contents_as_string()
The defect of uri.get_bucket() is, it looks it is getting all of the object first, this is what I don't want, I just need get the obj name list of particular folder(e.g gs//mybucket/abc/myfolder) , which should be much quickly.
Could someone help answer? Appreciate every answer!
Update: the below is true for the older "Google API Client Libraries" for Python, but if you're not using that client, prefer the newer "Google Cloud Client Library" for Python ( https://googleapis.dev/python/storage/latest/index.html ). For the newer library, the equivalent to the below code is:
from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs('bucketname', prefix='abc/myfolder'):
print(str(blob))
Answer for older client follows.
You may find it easier to work with the JSON API, which has a full-featured Python client. It has a function for listing objects that takes a prefix parameter, which you could use to check for a certain directory and its children in this manner:
from apiclient import discovery
# Auth goes here if necessary. Create authorized http object...
client = discovery.build('storage', 'v1') # add http=whatever param if auth
request = client.objects().list(
bucket="mybucket",
prefix="abc/myfolder")
while request is not None:
response = request.execute()
print json.dumps(response, indent=2)
request = request.list_next(request, response)
Fuller documentation of the list call is here: https://developers.google.com/storage/docs/json_api/v1/objects/list
And the Google Python API client is documented here:
https://code.google.com/p/google-api-python-client/
This worked for me:
client = storage.Client()
BUCKET_NAME = 'DEMO_BUCKET'
bucket = client.get_bucket(BUCKET_NAME)
blobs = bucket.list_blobs()
for blob in blobs:
print(blob.name)
The list_blobs() method will return an iterator used to find blobs in the bucket.
Now you can iterate over blobs and access every object in the bucket. In this example I just print out the name of the object.
This documentation helped me alot:
https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html
https://googleapis.github.io/google-cloud-python/latest/_modules/google/cloud/storage/client.html#Client.bucket
I hope I could help!
You might also want to look at gcloud-python and documentation.
from gcloud import storage
connection = storage.get_connection(project_name, email, private_key_path)
bucket = connection.get_bucket('my-bucket')
for key in bucket:
if key.name == 'abc.txt':
print 'Found it!'
break
However, you might be better off just checking if the file exists:
if 'abc.txt' in bucket:
print 'Found it!'
Install python package google-cloud-storage by pip or pycharm and use below code
from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs(BUCKET_NAME, prefix=FOLDER_NAME):
print(str(blob))
I know this is an old question, but I stumbled over this because I was looking for the exact same answer. Answers from Brandon Yarbrough and Abhijit worked for me, but I wanted to get into more detail.
When you run this:
from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))
You will get Blob objects, with just the name field of all files in the given bucket, like this:
[<Blob: BUCKET_NAME, PREFIX, None>,
<Blob: xml-BUCKET_NAME, [PREFIX]claim_757325.json, None>,
<Blob: xml-BUCKET_NAME, [PREFIX]claim_757390.json, None>,
...]
If you are like me and you want to 1) filter out the first item in the list because it does NOT represent a file - its just the prefix, 2) just get the name string value, and 3) remove the PREFIX from the file name, you can do something like this:
blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]
Complete code to get just the string files names from a storage bucket:
from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))
blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]
print(f"blob_names = {blob_names}")

Categories

Resources