How to move files from one folder to another on databricks - python

I am trying to move the file from one folder to another folder using databricks python notebook.
My source is azure data lake gen 1.
Suppose, my file is present adl://testdatalakegen12021.azuredatalakestore.net/source/test.csv
and I am trying to move the file from adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv to adl://testdatalakegen12021.azuredatalakestore.net/destination/movedtest.csv
I tried various logic but not none of my code is working fine.
# Move a file by renaming it's path
import os
import shutil
os.rename('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/demo/renamedtest.csv')
# Move a file from the directory d1 to d2
shutil.move('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/destination/renamedtest.csv')
Please, let me know If I am using correct logic as I am executing this on databricks, not in my local.

To move a file in databricks notebook, you can use dbutils as follow:
dbutils.fs.mv('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/destination/renamedtest.csv')

Here are the steps to move files from one folder to another on databricks:
Mount the Azure Data Lake Storage Gen1 to the databricks workspace:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Reference: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Moving file using %fs command
%fs mv dbfs:/mnt/adlsgen1/test/mapreduce.txt dbfs:/mnt/adlsgen1/test1/mapreduce.txt
Moving file using dbutils command:
dbutils.fs.mv('dbfs:/mnt/adlsgen1/test/data.csv', 'dbfs:/mnt/adlsgen1/test1/dataone.csv')

you may want to also move/or copy the file without their subdirectories
import os
source_dir = "/mnt/yourplateform/source"
dest_dir= "/mnt/yourplateform/destination//"
list_of_files=[]
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
path_exists = fs.exists(spark._jvm.org.apache.hadoop.fs.Path(source_dir))
if path_exists == True:
file_list = fs.listFiles(spark._jvm.org.apache.hadoop.fs.Path(source_dir), True)
while file_list.hasNext():
file = file_list.next()
list_of_files.append(str(file.getPath()))
for file in list_of_files:
dbutils.fs.mv(file, os.path.join(dest_dir, os.path.basename(file)), recurse=True)

Related

Untar Files present on google cloud storage using cloud composer

I have scenario where I have to untar landed files on google cloud storage and and place them in some other folder on bucket to process and and after untaring, I have to remove some xml header's from the extracted xml file , we are have to use cloud composer to do all these tasks, in current on-premise the is all handled using bash scripts.
I have tried to use a python function which will be invoked using PythonOperator to untar and place the files in output location and I am thinking of using bashoperator to run sed and grep commands to remove the headers in the extracted file.
The below code seems to not produce any output
import io
import os
import tarfile
from google.cloud import storage
client = storage.Client("project_id")
input_bucket = client.get_bucket('bucket_name')
output_bucket = client.get_bucket('bucket_name')
def untar(context):
# Get the contents of the uploaded file
file_name="testzip.tar.gz"
print("Starting_untar")
input_blob = input_bucket.get_blob(f'/LnS_landing/{file_name}').download_as_string()
print("file_name:",input_blob.name)
tar = tarfile.open(fileobj=io.BytesIO(input_blob))
for member in tar.getnames():
file_object = tar.extractfile(member)
# print(file_object)
if file_object:
output_blob = output_bucket.blob(os.path.join(f'/LnS_landing/{file_name}', member))
output_blob.upload_from_string(file_object.read())

Automating running Python code using Azure services

Hi everyone on Stackoverflow,
I wrote two python scripts. One script is for picking up local files and sending them to GCS (Google Cloud Storage). Another one is opposite - for taking files from GCS that were uploaded and saving locally.
I want to automate process using Azure.
What would you recommend to use? Azure Function App, Azure Logic App or other services?
*
I'm now trying to use Logic App. I made .exe file using pyinstaller and looking for connector in Logic App that will run my program (.exe file). I have trigger in Logic App - "When a file is added or modified", but now I stack when selecting next step (connector)..
Kind regards,
Anna
Adding code as requested:
from google.cloud import storage
import os
import glob
import json
# Finding path to config file that is called "gcs_config.json" in directory C:/
def find_config(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
def upload_files(config_file):
# Reading 3 Parameters for upload from JSON file
with open(config_file, "r") as file:
contents = json.loads(file.read())
print(contents)
# Setting up login credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = contents['login_credentials']
# The ID of GCS bucket
bucket_name = contents['bucket_name']
# Setting path to files
LOCAL_PATH = contents['folder_from']
for source_file_name in glob.glob(LOCAL_PATH + '/**'):
# For multiple files upload
# Setting destination folder according to file name
if os.path.isfile(source_file_name):
partitioned_file_name = os.path.split(source_file_name)[-1].partition("-")
file_type_name = partitioned_file_name[0]
# Setting folder where files will be uploaded
destination_blob_name = file_type_name + "/" + os.path.split(source_file_name)[-1]
# Setting up required variables for GCS
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
# Running upload and printing confirmation message
blob.upload_from_filename(source_file_name)
print("File from {} uploaded to {} in bucket {}.".format(
source_file_name, destination_blob_name, bucket_name
))
config_file = find_config("gcs_config.json", "C:/")
upload_files(config_file)
config.json:
{
"login_credentials": "C:/Users/AS/Downloads/bright-velocity-___-53840b2f9bb4.json",
"bucket_name": "staging.bright-velocity-___.appspot.com",
"folder_from": "C:/Users/AS/Documents/Test2/",
"folder_for_downloaded_files": "C:/Users/AnnaShepilova/Documents/DownloadedFromGCS2/",
"given_date": "",
"given_prefix": ["Customer", "Account"] }
Currently, there is no built-in connector in Logic Apps for interacting with Google Cloud Services. however, you can use Google Cloud Storage does provide REST API in your Logic app or Function app.
But my suggestion is you can use the Azure Function to do these things. Because the azure Function can be more flexible to write your own flow to do the task.
Refer to run your .exe file in the Azure function. If you are using Local EXE or using Cloud Environment exe.
Refer here for more information

How do I iterate through multiple text(.txt) files in a folder on Google Drive to upload on Google Colab?

I have a folder on Google Drive that consists of multiple text files. I want to upload them on Google colab by iterating through each file in the folder. It would be great if someone could help me out with this
in order to read txt files from your google drive (not .zip or .rar folder):
First you have to mount (like most of colab codes which work along google drive at the same time)
from google.colab import drive
drive.mount('/content/drive')
then the following code will read any text file (any file ends with .txt) in path folder and save them to new_list.
import os
new_list = []
for root, dirs, files in os.walk("/content/.../folder_of_txt_files"):
for file in files:
if file.endswith('.txt'):
with open(os.path.join(root, file), 'r') as f:
text = f.read()
new_list.append(text)
obviously, you can save into a dictionary or dataframe or any data structure you prefer.
note: idk why but sometimes you need to change 'r' to 'rb'
You need a listOfFileNames.txt file located in the same folder; for example, I have a listOfDates.txt files that stores the files names titled by the date.
import numpy as np
import pandas pd
#listOfFilesNames = ['8_26_2021', '8_27_2021', '8_29_2021', '8_30_2021']
savedListOfFileNames = pd.read_csv('listOfFilesNames.txt', header = None).copy()
emptyVectorToStoreAllOfTheData = []
listOfFileNames = []
for iteratingThroughFileNames in range(len(savedListOfFileNames)):
listOfFileNames.append(savedListOfFileNames[0][iteratingThroughFileNames])
for iteratingThroughFileNames in range(len(listOfFileNames)):
currentFile = pd.read_csv(listOfFileNames[0][iteratingThroughFileNames] + '.txt').copy()
for iteratingThroughCurrentFile in range(len(currentFile)):
emptyVectorToStoreAllOfTheData.append(currentFile[0][iteratingThroughCurrentfile])
If you don't know how to access your folders and files, then you need to (1) mount your drive and (2) define a createWorkingDirectoryFunction:
import os
from google.colab import drive
myGoogleDrive = drive.mount('/content/drive', force_remount = True)
def createWorkingDirectoryFunction(projectFolder, rootDirectory):
if os.path.isdir(rootDirectory + projectFolder) == False:
os.mkdir(rootDirectory + projectFolder)
os.chdir(rootDirectory + projectFolder)
projectFolder = '/folderContainingMyFiles/' # Folder you want to access and/or create
rootDirectory = '/content/drive/My Drive/Colab Notebooks'
createWorkingDirectoryFunction(projectFolder, rootDirectory)

Set working directory for Google Colab notebook to Drive location of the notebook

I'm trying to set the working directory for a Google Colab notebook to the location where the notebook resides in Google Drive without manually copying-pasting the folderpath. The motivation is to allow copies of the notebook to function in place and dynamically set the working directory to the location of the notebook without having to manually copy and paste the location to the code.
I have code to mount the notebook to Google Drive and know how to set the working directory but would like to have a section of code that identifies the location of the notebook and stores it as a variable/object.
## Mount notebook to Google Drive
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
## Here is where i'd like to save the folderpath of the notebook
## for example, I would like root_path to ultimately be a folder named "Research" located in a Shared Drive named "Projects"
## root_path should equal '/content/drive/Shared drives/Projects/Research'
## the notebook resides in this "Research" folder
## then change the working directory to root_path
os.chdir(root_path)
This is quite complicated. You need to get the current notebook's file_id. Then look up all its parents and get their names.
# all imports, login, connect drive
import os
from pathlib import Path
import requests
from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive = build('drive', 'v3').files()
# recursively get names
def get_path(file_id):
f = drive.get(fileId=file_id, fields='name, parents').execute()
name = f.get('name')
if f.get('parents'):
parent_id = f.get('parents')[0] # assume 1 parent
return get_path(parent_id) / name
else:
return Path(name)
# change directory
def chdir_notebook():
d = requests.get('http://172.28.0.2:9000/api/sessions').json()[0]
file_id = d['path'].split('=')[1]
path = get_path(file_id)
nb_dir = '/content/drive' / path.parent
os.chdir(nb_dir)
return nb_dir
Now you just call chdir_notebook(), and it will change to the orginal directory of that notebook.
And don't forget to connect to your Google Drive first.
Here's a workable notebook
I simplify all these and now have added it to my library.
!pip install kora -q
from kora import drive
drive.chdir_notebook()

Downloading a file from google cloud storage inside a folder

I've got a python script that gets a list of files that have been uploaded to a google cloud storage bucket, and attempts to retrieve the data as a string.
The code is simply:
file = open(base_dir + "/" + path, 'wb')
data = Blob(path, bucket).download_as_string()
file.write(data)
My issue is that the data I've uploaded is stored inside folders in the bucket, so the path would be something like:
folder/innerfolder/file.jpg
When the google library attempts to download the file, it gets it in the form of a GET request, which turns the above path into:
https://www.googleapis.com/storage/v1/b/bucket/o/folder%2Finnerfolder%2Ffile.jpg
Is there any way to stop this happening / download the file though this way? Cheers.
Yes - you can do this with the python storage client library.
Just install it with pip install --upgrade google-cloud-storage and then use the following code:
from google.cloud import storage
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(destination_file_name)
You can also use .download_as_string() but as you're writing it to a file anyway downloading straight to the file may be easier.
The only slightly awkward thing to be aware of is that the filepath is the path from after the bucket name, so doesn't line up exactly with the path on the web interface.

Categories

Resources