I can upload single file to Azure blob storage with Python. But for a folder with multiple folders containing data, is there a way I can try to upload the whole folder with same directory to Azure?
Say I have
FOLDERA
------SUBFOLDERa
----------filea.txt
----------fileb.txt
------SUBFOLDERb
------SUBFOLDERc
I want to put this FOLDERA as above structure to Azure.
Any hints?
#Krumelur is almost right, but here I want to give a working code example, as well as explain some folders are not be able to upload to azure blob storage.
1.Code example:
from azure.storage.blob import BlockBlobService,PublicAccess
import os
def run_sample():
account_name = "your_account_name"
account_key ="your_account_key"
block_blob_service = BlockBlobService(account_name, account_key)
container_name ='test1'
path_remove = "F:\\"
local_path = "F:\\folderA"
for r,d,f in os.walk(local_path):
if f:
for file in f:
file_path_on_azure = os.path.join(r,file).replace(path_remove,"")
file_path_on_local = os.path.join(r,file)
block_blob_service.create_blob_from_path(container_name,file_path_on_azure,file_path_on_local)
# Main method.
if __name__ == '__main__':
run_sample()
2.You should remember that any empty folder can not be created / uploaded to azure blob storage, since there is no real "folder" in azure blob storage. The folder or directory is just a part of the blob name. So without a real blob file like test.txt inside a folder, there is no way to create/upload an empty folder. So in your folder structure, the empty folder SUBFOLDERb and SUBFOLDERc are not be able to upload to azure blob storage.
The test result is as below, all the non-empty folders are uploaded to blob storage in azure:
There is nothing built in, but you can easily write that functionality in your code (see os.walk).
Another option is to use the subprocess module to call into the azcopy command line tool.
Related
Downloaded the .xpt format file from the URL to the blob container in Databricks - Python notebook.
In the below code - 'example.xpt' is the local file. How to read the .xpt format file from the blob container?
import xport.v56
with open('example.xpt', 'rb') as f:
library = xport.v56.load(f)
Appreciate any inputs. Thanks!
Considering you have already installed the library xport in your cluster and mounted your ADLS blob container, you follow the steps given below:
Use the same code, except the path will be the .xpt file present in your blob container.
import xport.v56
with open('/dbfs/mnt/repro/ALQY_F.XPT', 'rb') as f:
# '/dbfs/mnt/repro/' refers to the mount point i.e., to ADLS blob container.
library = xport.v56.load(f)
The library object is of type class 'xport.v56.Library'. library has an attribute values which returns an iterable object.
Use the following code to write the required data to csv format in specified destination
for data in library.values():
print(type(data)) # <class 'xport.v56.Member'>
print(dir(data)) # use to check all the possible attributes that can be used on this object
data.to_csv("/dbfs/mnt/repro/op.csv") #writes as csv to your blob container.
Without Mounting:
With your client_id, tenant_id and client_secret, set up configurations for your ADLS storage using following command.
spark.conf.set("fs.azure.account.auth.type.<adls_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<adls_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<adls_name>.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret.<adls_name>.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<adls_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
Now you can access your data lake storage container using abfss. Using abfss://data#blb0708.dfs.core.windows.net/example.xpt directly with open() does not work.
Hence, use dbutils.fs.cp() to copy the file from abfss to DBFS.
#copy from datalake to DBFS
dbutils.fs.cp("abfss://data#<adls_name>.dfs.core.windows.net/<file_name>","dbfs:/<required_path>")
Now you can follow the procedure provided above to create csv from the data of xpt (save csv to dbfs and move to ADLS if required)
I have written a code to move files from one bucket to another in GCS using python. The bucket has multiple subfolders and I am trying to move the Day folder only to a different bucket. Source Path: /Bucketname/projectname/XXX/Day Target Path: /Bucketname/Archive/Day
Is there a way to directly move/copy the Day folder without moving each file inside it one by one. Im trying to optimize my code which is taking long time if there are multiple Day folders. Sample code below.
from google.cloud import storage
from google.cloud import bigquery
import glob
import pandas as pd
def Archive_JSON(bucket_name, new_bucket_name, source_prefix_arch, staging_prefix_arch, **kwargs):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
today_execution_date = kwargs['ds_nodash']
source_prefix_new = source_prefix_arch + today_execution_date + '/'
blobs = bucket.list_blobs(prefix=source_prefix_new)
destination_bucket = storage_client.get_bucket(new_bucket_name)
for blob in blobs:
destination_bucket.rename_blob(blob, new_name=blob.name.replace(source_prefix_arch, staging_prefix_arch))
You can't move all the file of a folder to another bucket because folders don't exist in Cloud Storage. All the object are put at the bucket level and the object name is the full path of the object.
By convention, and for (poor) human readability, slash / are folder separator, but it's a fake!
So, you haven't other option than moving all the files with the same prefix (the "folder path"), and iterating on all of them.
I have written a code to delete older files and and keep the latest one. My code is working in local but wanted to apply the same code when accessing AWS s3 bucket folder to perform the similar operation.
The code working fine when providing local path.
import os
import glob
path = r'C:\Desktop\MyFolder'
allfiles =[os.path.basename(file) for file in glob.glob(path + '\*.*')]
diff_pattern=set()
deletefile=[]
for file in allfiles:
diff_pattern.add('_'.join(file.split('_',2)[:2]))
print('Pattern Found - ',diff_pattern)
for pattern in diff_pattern:
patternfiles=[os.path.basename(file) for file in glob.glob(path + '\\'+pattern+'_*.*')]
patternfiles.sort()
if len(patternfiles)>1:
deletefile=deletefile+patternfiles[:len(patternfiles)-1]
print('Files Need to Delete - ',deletefile)
for file in deletefile:
os.remove(path+'\\'+file)
print('File Deleted')
I expect the same code to work for AWS s3 buckets. Below is the files format and example with there status(keep/delete) that I'm working with.
file format: file_name_yyyyMMdd.txt
v_xyz_20190501.txt Delete
v_xyz_20190502.txt keep
v_xyz_20190430.txt Delete
v_abc_20190505.txt Keep
v_abc_20190504.txt Delete
I don't think you can access S3 files like local path.
You may need to use boto3 library in python to access s3 folders.
Here is a sample for you to see how it works..
https://dluo.me/s3databoto3
I have a working Python script for consolidating multiple xlsx files that I want to move to a Watson Studio project. My current code uses a path variable which is passed to glob...
path = '/Users/Me/My_Path/*.xlsx'
files = glob.glob(path)
Since credentials in Watson Studio are specific to individual files, how do I get a list of all files in my IBM COS storage bucket? I'm also wondering how to create folders to separate the files in my storage bucket?
Watson Studio cloud provides a helper library, named project-lib for working with objects in your Cloud Object Storage instance. Take a look at this documentation for using the package in Python: https://dataplatform.cloud.ibm.com/docs/content/analyze-data/project-lib-python.html
For your specific question, get_files() should do what you need. This will return a list of all the files in your bucket, then you can do pattern matching to only keep what you need. Based on this filtered list you can then iterate and use get_file(file_name) for each file_name in your list.
To create a "folder" in your bucket, you need to follow a naming convention for files to create a "pseudo folder". For example, if you want to create a "data" folder of assets, you should prefix file names for objects belonging to this folder with data/.
The credentials in IBM Cloud Object Storage (COS) is at COS instance level, not at individual file level. Each COS instance can have any number of buckets with each bucket containing files.
You can get the credentials for the COS instance from Bluemix console.
https://console.bluemix.net/docs/services/cloud-object-storage/iam/service-credentials.html#service-credentials
You can use boto3 python package to access the files.
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
import boto3
s3c = boto3.client('s3', endpoint_url='XXXXXXXXX',aws_access_key_id='XXXXXXXXXXX',aws_secret_access_key='XXXXXXXXXX')
s3.list_objects(Bucket=bucket_name, Prefix=file_path)
s3c.download_file(Filename=filename, Bucket=bucket, Key=objectname)
s3c.upload_file(Filename=filename, Bucket=bucket, Key=objectname)
There's probably a more pythonic way to write this but here is the code I wrote using project-lib per the answer provided by #Greg Filla
files = [] # List to hold data file names
# Get list of all file names in storage bucket
all_files = project.get_files() # returns list of dictionaries
# Create list of file names to load based on prefix
for f in all_files:
if f['name'][:3] == DataFile_Prefix:
files.append(f['name'])
print ("There are " + str(len(files)) + " data files in the storage bucket.")
delete_blob() seems to delete only the files inside the container and from folders and subfolders inside the container. But i'm seeing below error in python while trying to delete a folder from container.
Client-Request-ID=7950669c-2c4a-11e8-88e7-00155dbf7128 Retry policy did not allow for a retry: Server-Timestamp=Tue, 20 Mar 2018 14:25:00 GMT, Server-Request-ID=54d1a5d6-b01e-007b-5e57-c08528000000, HTTP status code=404, Exception=The specified blob does not exist.ErrorCode: BlobNotFoundBlobNotFoundThe specified blob does not exist.RequestId:54d1a5d6-b01e-007b-5e57-c08528000000Time:2018-03-20T14:25:01.2130063Z.
azure.common.AzureMissingResourceHttpError: The specified blob does not exist.ErrorCode: BlobNotFound
BlobNotFoundThe specified blob does not exist.
RequestId:54d1a5d6-b01e-007b-5e57-c08528000000
Time:2018-03-20T14:25:01.2130063Z
Could anyone please help here?
In Azure Blob Storage, as such a folder doesn't exist. It is just a prefix for a blob's name. For example, if you see a folder named images and it contains a blob called myfile.png, then essentially the blob's name is images/myfile.png. Because the folders don't really exist (they are virtual), you can't delete the folder directly.
What you need to do is delete all blobs individually in that folder (or in other words delete the blobs whose name begins with that virtual folder name/path. Once you have deleted all the blobs, then that folder automatically goes away.
In order to accomplish this, first you would need to fetch all blobs whose name starts with the virtual folder path. For that you will use list_blobs method and specify the virtual folder path in prefix parameter. This will give you a list of blobs starting with that prefix. Once you have that list, you will delete the blobs one by one.
There are two things to understand from the process, you could delete specific files,folders,images...(blobs) using delete_blob , But if you want to delete containers, you have to use the delete_container which will delete all blobs within, here's a sample that i created which deletes blobs inside a path/virtual folder:
from azure.storage.blob import BlockBlobService
block_blob_service = BlockBlobService(account_name='yraccountname', account_key='accountkey')
print("Retreiving blobs in specified container...")
blob_list=[]
container="containername"
def list_blobs(container):
try:
global blob_list
content = block_blob_service.list_blobs(container)
print("******Blobs currently in the container:**********")
for blob in content:
blob_list.append(blob.name)
print(blob.name)
except:
print("The specified container does not exist, Please check the container name or if it exists.")
list_blobs(container)
print("The list() is:")
print(blob_list)
print("Delete this blob: ",blob_list[1])
#DELETE A SPECIFIC BLOB FROM THE CONTAINER
block_blob_service.delete_blob(container,blob_list[1],snapshot=None)
list_blobs(container)
Please refer to the code in my repo with explanation in Readme section, as well as new storage scripts:https://github.com/adamsmith0016/Azure-storage
For others searching for the solution in python. This worked for me.
First make a variable that stores all the files in the folder that you want to remove.
Then for every file in that folder, remove the file by stating the name of the container, and then the actual foldername.name .
By removing all the files in a folder, the folders is deleted in azure.
def delete_folder(self, containername, foldername):
folders = [blob for blob in blob_service.block_blob_service.list_blobs(containername) if blob.name.startswith(foldername)]
if len(folders) > 0:
for folder in folders:
blob_service.block_blob_service.delete_blob(containername, foldername.name)
print("deleted folder",folder name)
Use list_blobs(name_starts_with=folder_name) and delete_blob()
Complete code:
blob_service_client = BlobServiceClient.from_connection_string(conn_str=CONN_STR)
blob_client = blob_service_client.get_container_client(AZURE_BLOBSTORE_CONTAINER)
for blob in blob_client.list_blobs(name_starts_with=FOLDER_NAME):
blob_client.delete_blob(blob.name)
You cannot delete a non-empty folder in Azure blobs, but you can achieve it if you delete the files inside the sub-folders first. The below work around will start deleting it from the files to the parent folder.
from azure.storage.blob import BlockBlobService
blob_client = BlockBlobService(account_name='', account_key='')
containername = 'XXX'
foldername = 'XXX'
def delete_folder(containername, foldername):
folders = [blob.name for blob in blob_client.list_blobs(containername, prefix=foldername)]
folders.sort(reverse=True, key=len)
if len(folders) > 0:
for folder in folders:
blob_client.delete_blob(containername, folder)
print("deleted folder",folder)