I have written a code to delete older files and and keep the latest one. My code is working in local but wanted to apply the same code when accessing AWS s3 bucket folder to perform the similar operation.
The code working fine when providing local path.
import os
import glob
path = r'C:\Desktop\MyFolder'
allfiles =[os.path.basename(file) for file in glob.glob(path + '\*.*')]
diff_pattern=set()
deletefile=[]
for file in allfiles:
diff_pattern.add('_'.join(file.split('_',2)[:2]))
print('Pattern Found - ',diff_pattern)
for pattern in diff_pattern:
patternfiles=[os.path.basename(file) for file in glob.glob(path + '\\'+pattern+'_*.*')]
patternfiles.sort()
if len(patternfiles)>1:
deletefile=deletefile+patternfiles[:len(patternfiles)-1]
print('Files Need to Delete - ',deletefile)
for file in deletefile:
os.remove(path+'\\'+file)
print('File Deleted')
I expect the same code to work for AWS s3 buckets. Below is the files format and example with there status(keep/delete) that I'm working with.
file format: file_name_yyyyMMdd.txt
v_xyz_20190501.txt Delete
v_xyz_20190502.txt keep
v_xyz_20190430.txt Delete
v_abc_20190505.txt Keep
v_abc_20190504.txt Delete
I don't think you can access S3 files like local path.
You may need to use boto3 library in python to access s3 folders.
Here is a sample for you to see how it works..
https://dluo.me/s3databoto3
Related
How can I delete file with particular extension from s3 bucket using pycharm boto3 library.
For example I have an s3 bucket having multiple files with different extension like '.txt' , '.csv' etc.
I want to create python script which will delete file from s3 having ".csv" extension only
Please help
You can add a trigger on your S3 bucket with a suffix value as ".csv" to invoke a lambda where you can read the bucket and key from the event and use boto3's delete() method to delete the CSV file.
`import boto3
s3 = boto3.resource('s3')
s3.Object('your-bucket', 'your-key').delete()`
I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.
I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.
I can upload single file to Azure blob storage with Python. But for a folder with multiple folders containing data, is there a way I can try to upload the whole folder with same directory to Azure?
Say I have
FOLDERA
------SUBFOLDERa
----------filea.txt
----------fileb.txt
------SUBFOLDERb
------SUBFOLDERc
I want to put this FOLDERA as above structure to Azure.
Any hints?
#Krumelur is almost right, but here I want to give a working code example, as well as explain some folders are not be able to upload to azure blob storage.
1.Code example:
from azure.storage.blob import BlockBlobService,PublicAccess
import os
def run_sample():
account_name = "your_account_name"
account_key ="your_account_key"
block_blob_service = BlockBlobService(account_name, account_key)
container_name ='test1'
path_remove = "F:\\"
local_path = "F:\\folderA"
for r,d,f in os.walk(local_path):
if f:
for file in f:
file_path_on_azure = os.path.join(r,file).replace(path_remove,"")
file_path_on_local = os.path.join(r,file)
block_blob_service.create_blob_from_path(container_name,file_path_on_azure,file_path_on_local)
# Main method.
if __name__ == '__main__':
run_sample()
2.You should remember that any empty folder can not be created / uploaded to azure blob storage, since there is no real "folder" in azure blob storage. The folder or directory is just a part of the blob name. So without a real blob file like test.txt inside a folder, there is no way to create/upload an empty folder. So in your folder structure, the empty folder SUBFOLDERb and SUBFOLDERc are not be able to upload to azure blob storage.
The test result is as below, all the non-empty folders are uploaded to blob storage in azure:
There is nothing built in, but you can easily write that functionality in your code (see os.walk).
Another option is to use the subprocess module to call into the azcopy command line tool.
I have a working Python script for consolidating multiple xlsx files that I want to move to a Watson Studio project. My current code uses a path variable which is passed to glob...
path = '/Users/Me/My_Path/*.xlsx'
files = glob.glob(path)
Since credentials in Watson Studio are specific to individual files, how do I get a list of all files in my IBM COS storage bucket? I'm also wondering how to create folders to separate the files in my storage bucket?
Watson Studio cloud provides a helper library, named project-lib for working with objects in your Cloud Object Storage instance. Take a look at this documentation for using the package in Python: https://dataplatform.cloud.ibm.com/docs/content/analyze-data/project-lib-python.html
For your specific question, get_files() should do what you need. This will return a list of all the files in your bucket, then you can do pattern matching to only keep what you need. Based on this filtered list you can then iterate and use get_file(file_name) for each file_name in your list.
To create a "folder" in your bucket, you need to follow a naming convention for files to create a "pseudo folder". For example, if you want to create a "data" folder of assets, you should prefix file names for objects belonging to this folder with data/.
The credentials in IBM Cloud Object Storage (COS) is at COS instance level, not at individual file level. Each COS instance can have any number of buckets with each bucket containing files.
You can get the credentials for the COS instance from Bluemix console.
https://console.bluemix.net/docs/services/cloud-object-storage/iam/service-credentials.html#service-credentials
You can use boto3 python package to access the files.
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
import boto3
s3c = boto3.client('s3', endpoint_url='XXXXXXXXX',aws_access_key_id='XXXXXXXXXXX',aws_secret_access_key='XXXXXXXXXX')
s3.list_objects(Bucket=bucket_name, Prefix=file_path)
s3c.download_file(Filename=filename, Bucket=bucket, Key=objectname)
s3c.upload_file(Filename=filename, Bucket=bucket, Key=objectname)
There's probably a more pythonic way to write this but here is the code I wrote using project-lib per the answer provided by #Greg Filla
files = [] # List to hold data file names
# Get list of all file names in storage bucket
all_files = project.get_files() # returns list of dictionaries
# Create list of file names to load based on prefix
for f in all_files:
if f['name'][:3] == DataFile_Prefix:
files.append(f['name'])
print ("There are " + str(len(files)) + " data files in the storage bucket.")