I am working in python and jupyter notebook, and I am trying to read parquet files from an aws s3bucket, and convert them to a single pandas dataframe.
The bucket and folders are arranged like:
The bucket name: mybucket
First Folder: 123
Second Folder: Parquets.parquet
file1.snappy.parquet
file2.snappy.parquet
....
I am getting the full path with:
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append("s3://mybucket/"+key.key)
And then reading them with:
count = 0
keys = keys[2:]
for obj in bucket.objects.all():
subsrc = obj.Object()
key = obj.key
path = keys[count]
obj_df = pd.read_parquet(path)
df_list.append(obj_df)
count +=1
df = pd.concat(df_list)
But that is giving me:
PermissionError: Forbidden
pointing to the line 'obj_df = pd.read_parquet(path)'
I know I have full s3 access, so that should not be the issue. Thank you so much!
This is probably because the path to the data is incorrect.
(In the code above, you're doing pd.read_parquet(path) where path = keys[count], but I'm pretty sure that that's only the keys, which do not include the bucket name. )
Related
I have a s3 folder with a csv file stored on it. I'm trying to download the last modified file. I'm using this script to get the last modified file:
s3_client = boto3.client('s3', aws_access_key_id=s3_extra_data['aws_access_key_id'],
aws_secret_access_key=s3_extra_data['aws_secret_access_key'])
response = s3_client.list_objects_v2(Bucket='test', Prefix='file_r/')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
response = s3_client.list_objects_v2(Bucket='test', Prefix=latest["Key"])[:-52].lower())
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
print("LATEST ->" + str(latest["Key"])[:-52].lower())
print("PATH ->" + str(latest["Key"]))
s3_client.download_file("test", latest["Key"], str(latest["Key"]))
This code lists my last modified object, the file name is part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv and is inside the file_r folder.
Although, when I use s3_client.download_file i get the following error:
'file_r/part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv.8cEebaeb'
When i print my path and my file I get the correct values
LATEST -> file_r/part
PATH -> file_r/part-00000-40f267f2-38dc-4bab-811c-4c3052fdb1ba-c000.csv
Why the value .8cEebaeb is appended after the .csv extension since the PATH is correct.
Any thoughs on that?
I had this issue when the local folder is not created.
folder1/folder2/filename.foo
If folder1 or folder2 does not exist locally, boto3 returns error :
FileNotFoundError: [Errno 2] No such file or directory: 'folder1/folder2/finemane.foo.F89bdcAc'
To solve the problem I changed the code to:
s3_client = boto3.client('s3', aws_access_key_id=s3_extra_data['aws_access_key_id'],
aws_secret_access_key=s3_extra_data['aws_secret_access_key'])
response = s3_client.list_objects_v2(Bucket='test', Prefix='file_r/')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
s3_client.download_file("test", latest["Key"], "FILE_NAME"))
Hi Please note that this issue can sometime happen due to number of char in the folder path in windows machines including the file name. I faced similar issue and when I tried to save to the same folder path it was giving me error till the time I reduced the number of char to 228.
I have below hierarchy in S3 and would like to retrieve only subfolders type information excluding files that ends in .txt (basically exclude filenames and retrieve only prefixes/folders).
--folder1/subfolder1/item1.txt
--folder1/subfolder1/item11.txt
--folder1/subfolder2/item2.txt
--folder1/subfolder2/item21.txt
--folder1/subfolder3/item3.txt
--folder1/subfolder3/subfolder31/item311.txt
Desired Output:
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31
I understand that there is no folders/subfolders in S3 but all are keys.
I tried below code but it is displaying all information including filenames like item1.txt
s3 = boto3.resource('s3')
client = boto3.client('s3')
bucket = s3.Bucket('s3-bucketname')
paginator = client.get_paginator('list_objects')
objs = list(bucket.objects.filter(Prefix='folder1/'))
for i in range(0, len(objs)):
print(objs[i].key)
Any recommendation to get below output?
--folder1/subfolder1
--folder1/subfolder2
--folder1/subfolder3/subfolder31
As you say, S3 doesn't really have a concept of folders, so to get what you want, in a sense, you need to recreate it.
One option is to list all of the objects in the bucket, and construct the folder, or prefix, of each object, and operate on new names as you run across them:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3-bucketname')
shown = set()
for obj in bucket.objects.filter(Prefix='folder1/'):
prefix = "/".join(obj.key.split("/")[:-1])
if len(prefix) and prefix not in shown:
shown.add(prefix)
print(prefix + "/")
I used to have environment variables BUCKET_NAME and FILENAME.
This is my current code to get the file:
obj = self.s3_client.get_object(Bucket=self.bucket_name, Key=filename)
(where self.bucket_name came from BUCKET_NAME and filename came from FILENAME environment variables)
Earlier today, the "higher powers" changed the environment, so now instead of the bucket name I get the BUCKET_FILE, with the value s3://bucket_name/filename
This breaks my code, and I need to fix it.
Can I somehow use this string to get to the object? Or do I have to parse the bucket_name and filename out of it and keep the above code?
I searched S3 website, but I can't find anything other than get_object, which has Bucket (string containing bucket name) as a required parameter.
Yep, you need to parse this string and get the bucket name and the key. Here is the function that AWS CLI uses to achieve this:
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
block_unsupported_resources(s3_path)
match = _S3_ACCESSPOINT_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
match = _S3_OUTPOST_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
s3_components = s3_path.split('/', 1)
bucket = s3_components[0]
s3_key = ''
if len(s3_components) > 1:
s3_key = s3_components[1]
return bucket, s3_key
Reference: https://github.com/aws/aws-cli/blob/develop/awscli/customizations/s3/utils.py#L217-L235
Is it possible to copy all the files in one source bucket to other target bucket using boto3. And source bucket doesn't have regular folder structure.
Source bucket: SRC
Source Path: A/B/C/D/E/F..
where in D folder it has some files,
E folder has some files
Target bucket: TGT
Target path: L/M/N/
I need to copy all the files and folders from above SRC bucket from folder C to TGT bucket under N folder using boto3.
Can any one aware of any API or do we need to write new python script to complete this task.
S3 store object, it doesn't store folder, even '/' or '\' is part of the object key name. You just need to manipulate the key name and copy the data over.
import boto3
old_bucket_name = 'SRC'
old_prefix = 'A/B/C/'
new_bucket_name = 'TGT'
new_prefix = 'L/M/N/'
s3 = boto3.resource('s3')
old_bucket = s3.Bucket(old_bucket_name)
new_bucket = s3.Bucket(new_bucket_name)
for obj in old_bucket.objects.filter(Prefix=old_prefix):
old_source = { 'Bucket': old_bucket_name,
'Key': obj.key}
# replace the prefix
new_key = obj.key.replace(old_prefix, new_prefix, 1)
new_obj = new_bucket.Object(new_key)
new_obj.copy(old_source)
Optimized technique of defining new_key suggested by zvikico:
new_key = new_prefix + obj.key[len(old_prefix):]
I'm trying to use boto in python to loop through and upload files to my aws bucket. I can successfully upload to my root bucket, but have been unable to upload to a specific prefix. Here is the snip I have:
conn = S3Connection(aws_access_key_id=key, aws_secret_access_key=secret)
bucket = conn.get_bucket('mybucket')
k = boto.s3.key.Key(bucket)
k.key = u
k.set_contents_from_filename(u)
It must be something simple, I have looked through other posts and have been unable to figure this out.
Thanks
You need to build the full path of the key's name and then you can set its content:
#Connect to aws
conn = S3Connection(aws_access_key_id=key, aws_secret_access_key=secret)
bucket = conn.get_bucket('mybucket')
#Build path
path = 'prefix'
key_name = 'this_is_any.file'
full_key_name = os.path.join(path, key_name)
#Set and save in S3
k = bucket.new_key(full_key_name)
k.set_contents_from_filename(...)