I am trying to write the file to S3 from the JSON structure in the Python 2.7 script. The code is as follows:
S3_bucket = s3.Bucket(__S3_BUCKET__)
result = S3_bucket.put_object(Key=__S3_BUCKET_PATH__ + 'file_prefix_' + str(int(time.time()))+'.json', Body = str(json.dumps(dict_list)).encode("utf-8"))
I end up with the S3 bucket handler is which is
s3.Bucket(name='bucket_name')
S3 file path is /file_prefix_1545039898.json
{'statusCode': s3.Object(bucket_name='bucket_name', key='/file_prefix_1545039898.json')}
But I see nothing on S3 - no files were created. I have a suspicion that I may require commit of some kind, bit all the manuals I came across are saying otherwise. Did anyone had a problem like this?
Apparently, the leading slash works not as a standard path designator - it creates an empty name directory, which has not been seen. Removing the one puts things where they belong.
Related
Using the PyGithub API, I am attempting to retrieve all contents from a specific folder from a specific branch of a repository hosted with Github. I can't share the actual repository or specifics regarding the data, but the code I am using is this:
import github
import json
import requests
import base64
from collections import namedtuple
Package = namedtuple('Package', 'name version')
# Parameters
gh_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
header = {"Authorization": f"token {gh_token}"}
gh_hostname = 'devtopia.xxx.com'
gh = github.Github(base_url=f'https://{gh_hostname}/api/v3', login_or_token = gh_token)
repo_name = "xxxxxxxxx/SupportFiles"
conda_meta = "xxxxxxx/bin/Python/envs/xxxxxx-xx/conda-meta"
repo = gh.get_repo(repo_name)
def parse_conda_meta(branch):
package_list = []
meta_contents = repo.get_contents(conda_meta, ref=branch) #<< Returns less files than expected for
# a specified branch "xxx/release/3.2.0",
# returns expected number of files for
# "master" branch.
for i, pkg in enumerate(meta_contents):
if ".json" in pkg.name: # filter for JSON files
print(i, pkg.name)
# Need to use GitHub Data API (REST) blobs instead of easier
# `github` with `pkg.decoded_content` here because that method
# only works with files <= 1MB whereas Data API allows for
# reading files <= 100MB.
resp = requests.get(f"https://devtopia.xxxx.com/api/v3/repos/xxxxxxxxx/SupportFiles/git/blobs/{pkg.sha}?ref={branch}", headers=header)
pkg_cont = json.loads(base64.b64decode(json.loads(resp.content)["content"]))
package_list.append(Package(pkg_cont['name'], pkg_cont['version']))
else:
print('>>', i, pkg.name)
return package_list
if __name__ == "__main__":
pkgs = parse_conda_meta("xxx/release/3.2.0")
print(pkgs)
print(len(pkgs))
For some reason that I can't get to the bottom of, I am not getting the correct number of files returned by repo.get_contents(conda_meta, ref=branch). For the branch that I am specifying, when that branch is checked out I am seeing 186 files in the conda-meta folder. However, repo.get_contents(conda_meta, ref=branch) returns only 182, I am missing four JSON files.
Is there some limitation to repo.get_contents that I'm not aware of? I've been reading the doc but can't find anything that hints at the problem I am having. There is one bit about it only handling files up to 1mb, but I am seeing files larger than this returned (e.x: python is 1.204mb and is returned in the list of files). I believe this just applies to reading file content over 1mb, which I deal with by using the GitHub Data API (REST) further downstream. Is there something I'm doing wrong here?
Thanks for reading, any help with this is much appreciated!
Update with solution!
The Problem:
After some more digging, I have found the problem's cause. It's not to do with the code above or repo.get_contents(conda_meta, ref=branch) specifically. It is actually a unix/windows clash that was mistakenly introduced into our repository for this specific branch "xxx/release/3.2.0" but not present in others.
So what was the problem? NTFS (and Windows more broadly) by default is case insensitive, but Git is from a Unix world and is case-sensitive by default
We inadvertently created two folders for Python in the bin directory of the conda_meta path (xxxxxx/bin/), one folder called "Python" and one called "python" (note the lower-case). When pulling the repository locally, only the "Python" folder shows up containing all 168 files. On GitHub, however, the path with "Python" contains 182 files while the path with "python" contains the remaining 4 files.
The Solution:
Solution is to add a conda_meta_folders parameter that takes a list of paths to parse_conda_meta and search each directory. There might be a slicker solution though, I'm looking into whether it is possible to do something like git config core.ignorecase true with the PyGithub API. Does anyone know if it is possible to have PyGithub honor this or be configured for this?
I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.
Have found many questions related to this with solutions using boto3, however I am in a position where I have to use boto, running Python 2.38.
Now I can successfully transfer my files in their folders (Not real folders I know as S3 doesn't have this concept) but I want them to be saved into a particular folder in my destination bucket
from boto.s3.connection import S3Connection
def transfer_files():
conn = S3Connection()
srcBucket = conn.get_bucket("source_bucket")
dstBucket = conn.get_bucket(bucket_name="destination_bucket")
objectlist = srcbucket.list()
for obj in objectlist:
dstBucket.copy_key(obj.key, srcBucket.name, obj.key)
My srcBucket will look like folder/subFolder/anotherSubFolder/file.txt which when transferred will land in the dstBucket like so destination_bucket/folder/subFolder/anotherSubFolder/file.txt
I would like it to end up in destination_bucket/targetFolder so the final directory structure would look like
destination_bucket/targetFolder/folder/subFolder/anotherSubFolder/file.txt
Hopefully I have explained this well enough and it makes sense
The first parameter is the name of the destination key.
Therefore, just use:
dstBucket.copy_key('targetFolder/' + obj.key, srcBucket.name, obj.key)
I am trying to access a bucket subfolder using python's boto3.
The problem is that I cannot find anywhere how to input the subfolder information inside the boto code.
All I find is how to put the bucket name, but I do not have access to the whole bucket, just to a specific subfolder. Can anyone give me a light?
What I did so far:
BUCKET = "folder/subfolder"
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
print key.name.encode('utf-8')
The error messages:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied
I do not need to use boto for the operation, I just need to list/get the files inside this subfolder.
P.S.: I can access the files using cyberduck by putting the path folder/subfolder, which means I have access to the date.
Sincerely,
Israel
I fixed the problem using something similar vtl suggested:
I had to put the prefix in my bucket and a delimiter. The final code was something like this:
objects = s3.list_objects(Bucket=bucketName, Prefix=bucketPath+'/', Delimiter='/')
As he said, there's not folder structure, then you have to state a delimiter and also put it after the Prefix like I did.
Thanks for the reply.
Try:
for obj in bucket.objects.filter(Prefix="your_subfolder"):
do_something()
AWS doesn't actually have a directory structure - it just fakes one by putting "/"s in names. The Prefix option restricts the search to all objects whose name starts with the given prefix, which should be your "subfolder".
I am trying to traverse all objects inside a specific folder in my S3 bucket. The code I already have is like follows:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket-name')
for obj in bucket.objects.filter(Prefix='folder/'):
do_stuff(obj)
I need to use boto3.resource and not client. This code is not getting any objects at all although I have a bunch of text files in the folder. Can someone advise?
Try adding the Delimiter attribute: Delimiter = '\' as you are filtering objects. The rest of the code looks fine.
I had to make sure to skip the first file. For some reason it thinks the folder name is the first file and that may not be what you want.
for video_item in source_bucket.objects.filter(Prefix="my-folder-name/", Delimiter='/'):
if video_item.key == 'my-folder-name/':
continue
do_something(video_item.key)