I need to create a folder inside folder in google clous storage using python.
I know how to create one folder:
bucket_name = 'data_bucket_10_07'
# create a new bucket
bucket = storage_client.bucket(bucket_name)
bucket.storage_class = 'COLDLINE' # Archive | Nearline | Standard
bucket.location = 'US' # Taiwan
bucket = storage_client.create_bucket(bucket) # returns Bucket object
my_bucket = storage_client.get_bucket(bucket_name)
when I try to change bucket_name = 'data_bucket_10_07' to bucket_name = 'data_bucket_10_07/data_bucket_10_07_1' I got an error:
google.api_core.exceptions.BadRequest: 400 POST https://storage.googleapis.com/storage/v1/b?project=effective-forge-317205&prettyPrint=false: Invalid bucket name: 'data_bucket_10_07/data_bucket_10_07_1'
How should I solve my problem?
As John mentioned in the comment, it may not be ontologically possible to have a bucket inside a bucket.
See Bucket naming guidelines for documentation details.
In nutshell:
There is only one level of buckets in a global namespace (thus the bucket name is to be global unique). Everything beyond the bucket name - belongs to an object name.
For example, you can create a bucket (let's guess the name is not already in use) like data_bucket_10_07. In that case, it may look like gs://data_bucket_10_07
Then, you probably would like to store some objects (files) in a such way, that it looks like a directory hierarchy, so, let say there are /01/data.csv object and /02/data.csv object. Where the 01 and 02 should presumably semantically reflect some date.
Those /01/ and /02/ elements - are essentially beginning parts of the object names (or prefixes for the objects in other words).
So far the bucket name is gs://data_bucket_10_07
The object names are /01/data.csv and /02/data.csv
I would suggest checking Object naming guidelines documentation where those ideas are described much better then I can do in one sentence.
Other comments do a great job of detailing that nested buckets are not possible, but they only suggest at the following shortly: GCS does not rely on folders, and only presents contents with a hierarchical structure on the web UI for ease of use.
From the documentation:
Cloud Storage operates with a flat namespace, which means that folders don't actually exist within Cloud Storage. If you create an object named folder1/file.txt in the bucket your-bucket, the path to the object is your-bucket/folder1/file.txt. There is no folder1 folder, just a single object with folder1 as part of its name.
So if you'd like to create a "folder" for organization and immediately place an object in it, name your object with the "folders" ahead of the name, and GCS will take care of 'creating' them if they don't already exist.
Related
I'm trying to read a file content's (not to download it) from an S3 bucket. The problem is that the file is located under a multi-level folder. For instance, the full path could be s3://s3-bucket/folder-1/folder-2/my_file.json. How can I get that specific file instead of using my iterative approach that lists all objects?
Here is the code that I want to change:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucket')
for obj in my_bucket.objects.all():
key = obj.key
if key == 'folder-1/folder-2/my_file.json':
return obj.get()['Body'].read()
Can it be done in a simpler, more direct way?
Yes - there is no need to enumerate the bucket.
Read the file directly using s3.Object, providing the bucket name as the 1st parameter & the object key as the 2nd parameter.
"Folders" don't really exist in S3 - Amazon S3 doesn't use hierarchy to organize its objects and files. For the sake of organizational simplicity, the Amazon S3 console shows "folders" as a means of grouping objects but they are ultimately baked into your object key.
This should work:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object("s3-bucket", "folder-1/folder-2/my_file.json")
body = obj.get()['Body'].read()
I start with
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)
<what's next? Need something like client.list_folders(path)>
I know how to:
list all the blobs (including blobs in sub-sub-sub-folders, of any depth) with bucket.list_blobs()
or how to list all the blobs recursively in given folder with bucket.list_blobs(prefix=<path to subfolder>)
but what if my file system structure has 100 top level folders, each having thousands of files. Any efficient way to get only those 100 top level folder names without listing all the inside blobs?
All the response here have a piece of the answer but you will need to combine: the prefix, the delimiter and the prefixes in a loaded list_blobs(...) iterator. Let me throw down the code to get the 100 top level folders and then we'll walk through it.
import google.cloud.storage as gcs
client = gcs.Client()
blobs = client.list_blobs(
bucket_or_name=BUCKET_NAME,
prefix="",
delimiter="/",
max_results=1
)
next(blobs, ...) # Force list_blobs to make the api call (lazy loading)
# prefixes is now a set, convert to list
print(list(blobs.prefixes)[:100])
In first eight lines we build the GCS client and make the client.list_blobs(...) call. In your question you mention the bucket.list_blobs(..) method - as of version 1.43 this still works but the page on Buckets in the docs say this is now deprecated. The only difference is the keword arg bucket_or_name, on line 4.
We want folders at the top level, so we don't actually need to specify prefix at all, however, it will be useful for other readers to know that if you had wanted to list folders in a top-level directory stuff then you should specify a trailing slash. This kwarg would then become prefix="stuff/".
Someone already mentioned the delimiter kwarg, but to iterate, you should specify this so GCS knows how to interpret the blob names as directories. Simple enough.
The max_results=1 is for efficiency. Remember that we don't want blobs here, we want only folder names. Therefore if we tell GCS to stop looking once it finds a single blob, it might be faster. In practice, I have not found this to be the case but it could easily be if you have vast numbers of blobs, or if the storage is cold-line or whatever. YMMV. Consider it optional.
The blobs object returned is an lazy-loading iterator, which means that it won't load - including not even populating its members - until the first api call is made. To get this first call, we ask for the next element in the iterator. In your case, you know you have at least one file, so simply calling next(blobs) will work. It fetches the blob that is next in line (at the front of the line) and then throws it away.
However, if you could not guarantee to have at least one blob, then next(blobs), which needs to return something from the interator, will raise a StopIteration exception. To get round this, we put the default value of the ellipsis ....
Now the member of blobs we want, prefixes, is loaded, we print out the first 100. The output will be something like:
{'dir0/','dir1/','dir2/', ...}
I do not think you can get the 100 top level folders without listing all the inside blobs. Google Cloud Storage does not have folders or subdirectories, the library just creates an illusion of a hierarchical file tree.
I used this simple code :
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs('my-project')
res = []
for blob in blobs:
if blob.name.split('/')[0] not in res:
res.append(blob.name.split('/')[0])
print(res)
You can get the top-level prefixes by using a delimited listing. See the list_blobs documentation:
delimiter (str) – (Optional) Delimiter, used with prefix to emulate
hierarchy.
Something like this:
from google.cloud import storage
storage_client = storage.Client()
storage_client.list_blobs(BUCKET_NAME, delimiter='/')
I have an S3 bucket with a few top level folders, and hundreds of files in each of these folders. How do I get the names of these top level folders?
I have tried the following:
s3 = boto3.resource('s3', region_name='us-west-2', endpoint_url='https://s3.us-west-2.amazonaws.com')
bucket = s3.Bucket('XXX')
for obj in bucket.objects.filter(Prefix='', Delimiter='/'):
print obj.key
But this doesn't seem to work. I have thought about using regex to filter all the folder names, but this doesn't seem time efficient.
Thanks in advance!
Try this.
import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
result = paginator.paginate(Bucket='my-bucket', Delimiter='/')
for prefix in result.search('CommonPrefixes'):
print(prefix.get('Prefix'))
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does (source)
In other words, there's no way around iterating all of the keys in the bucket and extracting whatever structure that you want to see (depending on your needs, a dict-of-dicts may be a good approach for you).
You could also use Amazon Athena in order to analyse/query S3 buckets.
https://aws.amazon.com/athena/
Have found many questions related to this with solutions using boto3, however I am in a position where I have to use boto, running Python 2.38.
Now I can successfully transfer my files in their folders (Not real folders I know as S3 doesn't have this concept) but I want them to be saved into a particular folder in my destination bucket
from boto.s3.connection import S3Connection
def transfer_files():
conn = S3Connection()
srcBucket = conn.get_bucket("source_bucket")
dstBucket = conn.get_bucket(bucket_name="destination_bucket")
objectlist = srcbucket.list()
for obj in objectlist:
dstBucket.copy_key(obj.key, srcBucket.name, obj.key)
My srcBucket will look like folder/subFolder/anotherSubFolder/file.txt which when transferred will land in the dstBucket like so destination_bucket/folder/subFolder/anotherSubFolder/file.txt
I would like it to end up in destination_bucket/targetFolder so the final directory structure would look like
destination_bucket/targetFolder/folder/subFolder/anotherSubFolder/file.txt
Hopefully I have explained this well enough and it makes sense
The first parameter is the name of the destination key.
Therefore, just use:
dstBucket.copy_key('targetFolder/' + obj.key, srcBucket.name, obj.key)
I am trying to access a bucket subfolder using python's boto3.
The problem is that I cannot find anywhere how to input the subfolder information inside the boto code.
All I find is how to put the bucket name, but I do not have access to the whole bucket, just to a specific subfolder. Can anyone give me a light?
What I did so far:
BUCKET = "folder/subfolder"
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
print key.name.encode('utf-8')
The error messages:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied
I do not need to use boto for the operation, I just need to list/get the files inside this subfolder.
P.S.: I can access the files using cyberduck by putting the path folder/subfolder, which means I have access to the date.
Sincerely,
Israel
I fixed the problem using something similar vtl suggested:
I had to put the prefix in my bucket and a delimiter. The final code was something like this:
objects = s3.list_objects(Bucket=bucketName, Prefix=bucketPath+'/', Delimiter='/')
As he said, there's not folder structure, then you have to state a delimiter and also put it after the Prefix like I did.
Thanks for the reply.
Try:
for obj in bucket.objects.filter(Prefix="your_subfolder"):
do_something()
AWS doesn't actually have a directory structure - it just fakes one by putting "/"s in names. The Prefix option restricts the search to all objects whose name starts with the given prefix, which should be your "subfolder".