How to get a List of Files in IBM COS Bucket using Watson Studio - python

I have a working Python script for consolidating multiple xlsx files that I want to move to a Watson Studio project. My current code uses a path variable which is passed to glob...
path = '/Users/Me/My_Path/*.xlsx'
files = glob.glob(path)
Since credentials in Watson Studio are specific to individual files, how do I get a list of all files in my IBM COS storage bucket? I'm also wondering how to create folders to separate the files in my storage bucket?

Watson Studio cloud provides a helper library, named project-lib for working with objects in your Cloud Object Storage instance. Take a look at this documentation for using the package in Python: https://dataplatform.cloud.ibm.com/docs/content/analyze-data/project-lib-python.html
For your specific question, get_files() should do what you need. This will return a list of all the files in your bucket, then you can do pattern matching to only keep what you need. Based on this filtered list you can then iterate and use get_file(file_name) for each file_name in your list.
To create a "folder" in your bucket, you need to follow a naming convention for files to create a "pseudo folder". For example, if you want to create a "data" folder of assets, you should prefix file names for objects belonging to this folder with data/.

The credentials in IBM Cloud Object Storage (COS) is at COS instance level, not at individual file level. Each COS instance can have any number of buckets with each bucket containing files.
You can get the credentials for the COS instance from Bluemix console.
https://console.bluemix.net/docs/services/cloud-object-storage/iam/service-credentials.html#service-credentials
You can use boto3 python package to access the files.
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
import boto3
s3c = boto3.client('s3', endpoint_url='XXXXXXXXX',aws_access_key_id='XXXXXXXXXXX',aws_secret_access_key='XXXXXXXXXX')
s3.list_objects(Bucket=bucket_name, Prefix=file_path)
s3c.download_file(Filename=filename, Bucket=bucket, Key=objectname)
s3c.upload_file(Filename=filename, Bucket=bucket, Key=objectname)

There's probably a more pythonic way to write this but here is the code I wrote using project-lib per the answer provided by #Greg Filla
files = [] # List to hold data file names
# Get list of all file names in storage bucket
all_files = project.get_files() # returns list of dictionaries
# Create list of file names to load based on prefix
for f in all_files:
if f['name'][:3] == DataFile_Prefix:
files.append(f['name'])
print ("There are " + str(len(files)) + " data files in the storage bucket.")

Related

Moving files from one bucket to another in GCS

I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.

Iterate over files in databricks Repos

I would like to iterate over some files in a folder that has its path in databricks Repos.
How would one do this? I don't seem to be able to access the files in Repos
I have added a picture that shows what folders i would like to access (the dbrks & sql folders)
Thanks :)
Image of the repo folder hierarchy
You can read files from repo folders. The path is /mnt/repos/, this is the top folder when opening the repo window. You can then iterate yourself over these files.
Whenever you find the file you want you can read it with (for example) Spark. Example if you want to read a CSV file.
spark.read.format("csv").load(
path, header=True, inferSchema=True, delimiter=";"
)
If you just want to list files in the repositories, then you can use the list command of Workspace REST API. Using it you can implement recursive listing of files. The actual implementation would different, based on your requirements, like, if you need to generate a list of full paths vs. list with subdirectories, etc. This could be something like this (not tested):
import requests
my_pat = "generated personal access token"
workspace_url = "https://name-of-workspace"
def list_files(base_path: str):
lst = requests.request(method='get',
url=f"{workspace_url}/api/2.0/workspace/list",
headers={"Authentication": f"Bearer {my_pat}",
json={"path": base_path}).json()["objects"]
results = []
for i in lst:
if i["object_type"] == "DIRECTORY" or i["object_type"] == "REPO":
results.extend(list_files(i["path"]))
else:
results.append(i["path"])
return results
all_files = list_files("/Repos/<my-initial-folder")
But if you want to read a content of the files in the repository, then you need to use so-called Arbitrary Files support that is available since DBR 8.4.

Google Cloud Storage List Blob objects with specific file name

With the help of google.cloud.storage and list_blobs I can get the list of files from the specific bucket. But I want to filter(name*.ext) the exact files from the bucket. I was not able to find the exact solution.
For example: buket=data, prefix_folder_name=sales, with in prefix folder I have list of invoices with metadata. I want to get the specific invoices and its metadata(name*.csv & name.*.meta). Also, if I loop the entire all_blobs of the particular folder to get the selected files then it will be huge volume of data and it may affecting performance.
It would be good if someone one help me with this solution.
bucket = gcs_client.get_bucket(buket)
all_blobs = bucket.list_blobs(prefix=prefix_folder_name)
for blob in all_blobs:
print(blob.name)
According to google-cloud-storage documentation Blobs are objects that have name attribute, so you can filter them by this attribute.
from google.cloud import storage
# storage_client = gcs client
storage_client = storage.Client()
# bucket_name = "your-bucket-name"
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
# filter_dir = "filter-string"
[blob.name for blob in blobs if filter_dir in blob.name ]
It doesn't allow you to filter, but you can use use the fields parameter to just return the name of the objects, limiting the amount of data returned and making it easy to filter.
You can filter for a prefix, but to filter more specifically (e.g., for objects ending with a given name extension) you have to implement client-side filtering logic. That's what gsutil does when you do a command like:
gsutil ls gs://your-bucket/abc*.txt
You can use the following considering the filters as name and .ext for the files:
all_blobs = bucket.list_blobs()
fileList = [file.name for file in all_blobs if '.ext' in file.name and 'name' in file.name]
for file in fileList:
print(file)
Here name will be the fileName filter and .ext will be your extension filter.

How to delete files available in s3 bucket

I have written a code to delete older files and and keep the latest one. My code is working in local but wanted to apply the same code when accessing AWS s3 bucket folder to perform the similar operation.
The code working fine when providing local path.
import os
import glob
path = r'C:\Desktop\MyFolder'
allfiles =[os.path.basename(file) for file in glob.glob(path + '\*.*')]
diff_pattern=set()
deletefile=[]
for file in allfiles:
diff_pattern.add('_'.join(file.split('_',2)[:2]))
print('Pattern Found - ',diff_pattern)
for pattern in diff_pattern:
patternfiles=[os.path.basename(file) for file in glob.glob(path + '\\'+pattern+'_*.*')]
patternfiles.sort()
if len(patternfiles)>1:
deletefile=deletefile+patternfiles[:len(patternfiles)-1]
print('Files Need to Delete - ',deletefile)
for file in deletefile:
os.remove(path+'\\'+file)
print('File Deleted')
I expect the same code to work for AWS s3 buckets. Below is the files format and example with there status(keep/delete) that I'm working with.
file format: file_name_yyyyMMdd.txt
v_xyz_20190501.txt Delete
v_xyz_20190502.txt keep
v_xyz_20190430.txt Delete
v_abc_20190505.txt Keep
v_abc_20190504.txt Delete
I don't think you can access S3 files like local path.
You may need to use boto3 library in python to access s3 folders.
Here is a sample for you to see how it works..
https://dluo.me/s3databoto3

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.
The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

Categories

Resources