Python - List files and folders in Bucket - python

I am playing around with the boto library to access an amazon s3 bucket. I am trying to list all the file and folders in a given folder in the bucket. I use this to get all the file and folders:
for key in bucket.list():
print key.name
This gives me all the files and folders within the root , along with the sub-folders it has the files within them, like this:
root/
file1
file2
folder1/file3
folder1/file4
folder1/folder2/file5
folder1/folder2/file6
How can I list only the contents of say folder1, where in it will list something like:
files:
file3
file4
folders:
folder2
I can navigate to a folder using the
for key in in bucket.list(prefix=path/to/folder/)
but in that case it lists the files in folder2 as files of folder1 because I am trying you use string manipuations on the bucket path. I have tried every scenario and it still breaks in case there are longer paths and when folders have multiple files and folders( and these folders have more files). Is there a recursive way to deal with this issue?

All of the information is the other answers is correct but because so many people store objects with path-like keys in S3, the API does provide some tools to help you deal with them.
For example, in your case if you wanted to list only the "subdirectories" of root without listing all of the objects below that you would do this:
for key in bucket.list(prefix='root/', delimiter='/'):
print(key.name)
which should produce the output:
file1
file2
folder1/
You could then do:
for key in bucket.list(prefix='root/folder1/', delimiter='/'):
print(key.name)
and get:
file3
file4
folder2/
And so forth. You can probably accomplish what you want with this approach.

What I found most difficult to fully grasp about S3 is that it is simply a key/value store and not a disk or other type of file-based store that most people are familiar with. The fact that people refer to keys as folders and values as files helps to lend to the initial confusion of working with it.
Being a key/value store, the keys are simply just identifiers and not actual paths into a directory structure. This means that you don't need to actually create folders before referencing them, so you can simply put an object in a bucket at a location like /path/to/my/object without first having to create the "directory" /path/to/my.
Because S3 is a key/value store, the API for interacting with it is more object & hash based than file based. This means that, whether using Amazon's native API or using boto, functions like s3.bucket.Bucket.list will list all the objects in a bucket and optionally filter on a prefix. If you specify a prefix /foo/bar then everything with that prefix will be listed, including /foo/bar/file, /foo/bar/blargh/file, /foo/bar/1/2/3/file, etc.
So the short answer is that you will need to filter out the results that you don't want from your call to s3.bucket.Bucket.list because functions like s3.bucket.Bucket.list, s3.bucket.Bucket.get_all_keys, etc. are all designed to return all keys under the prefix that you specify as a filter.

S3 has no concept of "folders" as may think of. It's a single-level hierarchy where files are stored by key.
If you need to do a single-level listing inside a folder, you'll have to constraint the listing in your code. Something like if key.count('/')==1

Related

Iterate over files in databricks Repos

I would like to iterate over some files in a folder that has its path in databricks Repos.
How would one do this? I don't seem to be able to access the files in Repos
I have added a picture that shows what folders i would like to access (the dbrks & sql folders)
Thanks :)
Image of the repo folder hierarchy
You can read files from repo folders. The path is /mnt/repos/, this is the top folder when opening the repo window. You can then iterate yourself over these files.
Whenever you find the file you want you can read it with (for example) Spark. Example if you want to read a CSV file.
spark.read.format("csv").load(
path, header=True, inferSchema=True, delimiter=";"
)
If you just want to list files in the repositories, then you can use the list command of Workspace REST API. Using it you can implement recursive listing of files. The actual implementation would different, based on your requirements, like, if you need to generate a list of full paths vs. list with subdirectories, etc. This could be something like this (not tested):
import requests
my_pat = "generated personal access token"
workspace_url = "https://name-of-workspace"
def list_files(base_path: str):
lst = requests.request(method='get',
url=f"{workspace_url}/api/2.0/workspace/list",
headers={"Authentication": f"Bearer {my_pat}",
json={"path": base_path}).json()["objects"]
results = []
for i in lst:
if i["object_type"] == "DIRECTORY" or i["object_type"] == "REPO":
results.extend(list_files(i["path"]))
else:
results.append(i["path"])
return results
all_files = list_files("/Repos/<my-initial-folder")
But if you want to read a content of the files in the repository, then you need to use so-called Arbitrary Files support that is available since DBR 8.4.

Put contents of a folder into a zip using python

I need to put the contents of a folder into a ZIP archive (In my case a JAR but they're the same thing).
Though I cant just do the normal adding every file individually, because for one... they change each time depending on what the user put in and it's something like 1,000 files anyway!
I'm using ZipFile, it's the only really viable option.
Basically my code makes a temporary folder, and puts all the files in there, but I can't find a way to add the contents of the folder and not just the folder itself.
Python allrady support zip and rar archife. Personaly I learn it form here https://docs.python.org/3/library/zipfile.html
And to get all files form that folder try somethign like
for r, d, f in os.walk(path):
for file in f:
#enter code here

Read files from a dir in python and sort them

I am trying to read all the files that exist in the path using python. My files are jpg files correspond to videos and the filename:
img_1089_IEO_HAP_MD_5_.jpg
img_1089_IEO_HAP_MD_1_.jpg
...
img_1068_IWL_SAD_XX_4_.jpg
All the terms except the last one are indicating a specific video (img_1089_IEO_HAP_MD_...jpg). When i just use os.listdir(path) the order of files is kind of random. I want to read all the jpg files with sorted order in order to be able to store them in a dictionary that will contain each video name and all the correspondent frames. Any help of how can I do so?
You can sort the directory lexicographically by simply wrapping the os.listdir call in sorted:
for file in sorted(os.listdir(directory)):
print(file)
However, if your ultimate goal is to use the filenames as keys in a dictionary, order won't matter since dictionaries are unordered.

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.
The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

With Python's 'tarfile', how can I get the top-most directory in a tar archive?

I am wanting to upload a theme archive to a django web module and wanting to pull the name of the top-most directory in the archive to use as the theme's name. The archive will always be a tar-gzip format and will always have only one folder at the top level (though other files may exist parallel to it) with the various sub-directories containing templates, css, images etc. in what ever order suits the theme best.
Currently, based on the very useful code from MegaMark16, my tool uses the following method:
f = tarfile.open(fileobj=self.theme_file, mode='r:gz')
self.name = f.getnames()[0]
Where self.theme_file is a full path to the uploaded file. This works fine as long as the order of the entries in the tarball happens to be correct, but in many cases it is not. I can certainly loop through the entire archive and manually check for the proper 'name' characteristics, but I suspect that there is a more elegant and rapid approach. Any suggestions?
You'll want to use a method called commonprefix.
Sample code would be something to the effect of:
archive = tarfile.open(filepath, mode='r')
print os.path.commonprefix(archive.getnames())
Where the printed value would be the 'topmost directory in the archive'--or, your theme name.
Edit: upon further reading of your specs, though, this approach may not yield your desired result if you have files that are siblings to the 'topmost directory', as the common prefix would then just be .; this would only work if ALL files, indeed, had that common prefix of your theme name.
All sub directories have a '/' so you can do something like this
self.name = [name for name in f.getnames() if '/' not in name][0] and optimize with other tricks.

Categories

Resources