Untar Files present on google cloud storage using cloud composer

Untar Files present on google cloud storage using cloud composer - python

I have scenario where I have to untar landed files on google cloud storage and and place them in some other folder on bucket to process and and after untaring, I have to remove some xml header's from the extracted xml file , we are have to use cloud composer to do all these tasks, in current on-premise the is all handled using bash scripts.
I have tried to use a python function which will be invoked using PythonOperator to untar and place the files in output location and I am thinking of using bashoperator to run sed and grep commands to remove the headers in the extracted file.
The below code seems to not produce any output
import io
import os
import tarfile
from google.cloud import storage
client = storage.Client("project_id")
input_bucket = client.get_bucket('bucket_name')
output_bucket = client.get_bucket('bucket_name')
def untar(context):
# Get the contents of the uploaded file
file_name="testzip.tar.gz"
print("Starting_untar")
input_blob = input_bucket.get_blob(f'/LnS_landing/{file_name}').download_as_string()
print("file_name:",input_blob.name)
tar = tarfile.open(fileobj=io.BytesIO(input_blob))
for member in tar.getnames():
file_object = tar.extractfile(member)
# print(file_object)
if file_object:
output_blob = output_bucket.blob(os.path.join(f'/LnS_landing/{file_name}', member))
output_blob.upload_from_string(file_object.read())

Related

python delete files from google cloud storage that starts with

Using python I am able to delete files from bucket using prefixes also but in python code, prefix means directory.
I want to delete the files from GCP bucket which starts with example.
For example:
example-2022-12-07
example-2022-12-08
I followed this(Delete Files from Google Cloud Storage) but did not get the answer.
I am trying this, but not working:
blobs = bucket.list_blobs()
fileList = [file.name for file in blobs if 'example' in file.name ]
print(fileList)
for file in fileList:
blob = blobs.blob(file)
blob.delete()
print(f"Blob {blob_name} deleted.")

You can try the following code to delete files from the Google Cloud Storage by using the blob.delete method as suggested in the Documentation.
Below is the example for what you are looking:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
# Add prefix as parameter to bucket.list_blobs
blobs = bucket.list_blobs(prefix=?)
for blob in blobs:
blob.delete()
print(f"Blob {blob_name} deleted.")`
You can check with this thread1 and thread2.

Automating running Python code using Azure services

Hi everyone on Stackoverflow,
I wrote two python scripts. One script is for picking up local files and sending them to GCS (Google Cloud Storage). Another one is opposite - for taking files from GCS that were uploaded and saving locally.
I want to automate process using Azure.
What would you recommend to use? Azure Function App, Azure Logic App or other services?
*
I'm now trying to use Logic App. I made .exe file using pyinstaller and looking for connector in Logic App that will run my program (.exe file). I have trigger in Logic App - "When a file is added or modified", but now I stack when selecting next step (connector)..
Kind regards,
Anna
Adding code as requested:
from google.cloud import storage
import os
import glob
import json
# Finding path to config file that is called "gcs_config.json" in directory C:/
def find_config(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
def upload_files(config_file):
# Reading 3 Parameters for upload from JSON file
with open(config_file, "r") as file:
contents = json.loads(file.read())
print(contents)
# Setting up login credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = contents['login_credentials']
# The ID of GCS bucket
bucket_name = contents['bucket_name']
# Setting path to files
LOCAL_PATH = contents['folder_from']
for source_file_name in glob.glob(LOCAL_PATH + '/**'):
# For multiple files upload
# Setting destination folder according to file name
if os.path.isfile(source_file_name):
partitioned_file_name = os.path.split(source_file_name)[-1].partition("-")
file_type_name = partitioned_file_name[0]
# Setting folder where files will be uploaded
destination_blob_name = file_type_name + "/" + os.path.split(source_file_name)[-1]
# Setting up required variables for GCS
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
# Running upload and printing confirmation message
blob.upload_from_filename(source_file_name)
print("File from {} uploaded to {} in bucket {}.".format(
source_file_name, destination_blob_name, bucket_name
))
config_file = find_config("gcs_config.json", "C:/")
upload_files(config_file)
config.json:
{
"login_credentials": "C:/Users/AS/Downloads/bright-velocity-___-53840b2f9bb4.json",
"bucket_name": "staging.bright-velocity-___.appspot.com",
"folder_from": "C:/Users/AS/Documents/Test2/",
"folder_for_downloaded_files": "C:/Users/AnnaShepilova/Documents/DownloadedFromGCS2/",
"given_date": "",
"given_prefix": ["Customer", "Account"] }

Currently, there is no built-in connector in Logic Apps for interacting with Google Cloud Services. however, you can use Google Cloud Storage does provide REST API in your Logic app or Function app.
But my suggestion is you can use the Azure Function to do these things. Because the azure Function can be more flexible to write your own flow to do the task.
Refer to run your .exe file in the Azure function. If you are using Local EXE or using Cloud Environment exe.
Refer here for more information

How to move files from one folder to another on databricks

I am trying to move the file from one folder to another folder using databricks python notebook.
My source is azure data lake gen 1.
Suppose, my file is present adl://testdatalakegen12021.azuredatalakestore.net/source/test.csv
and I am trying to move the file from adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv to adl://testdatalakegen12021.azuredatalakestore.net/destination/movedtest.csv
I tried various logic but not none of my code is working fine.
# Move a file by renaming it's path
import os
import shutil
os.rename('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/demo/renamedtest.csv')
# Move a file from the directory d1 to d2
shutil.move('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/destination/renamedtest.csv')
Please, let me know If I am using correct logic as I am executing this on databricks, not in my local.

To move a file in databricks notebook, you can use dbutils as follow:
dbutils.fs.mv('adl://testdatalakegen12021.azuredatalakestore.net/demo/test.csv', 'adl://testdatalakegen12021.azuredatalakestore.net/destination/renamedtest.csv')

Here are the steps to move files from one folder to another on databricks:
Mount the Azure Data Lake Storage Gen1 to the databricks workspace:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Reference: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Moving file using %fs command
%fs mv dbfs:/mnt/adlsgen1/test/mapreduce.txt dbfs:/mnt/adlsgen1/test1/mapreduce.txt
Moving file using dbutils command:
dbutils.fs.mv('dbfs:/mnt/adlsgen1/test/data.csv', 'dbfs:/mnt/adlsgen1/test1/dataone.csv')

you may want to also move/or copy the file without their subdirectories
import os
source_dir = "/mnt/yourplateform/source"
dest_dir= "/mnt/yourplateform/destination//"
list_of_files=[]
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
path_exists = fs.exists(spark._jvm.org.apache.hadoop.fs.Path(source_dir))
if path_exists == True:
file_list = fs.listFiles(spark._jvm.org.apache.hadoop.fs.Path(source_dir), True)
while file_list.hasNext():
file = file_list.next()
list_of_files.append(str(file.getPath()))
for file in list_of_files:
dbutils.fs.mv(file, os.path.join(dest_dir, os.path.basename(file)), recurse=True)

Error in azure functions for python whenever trying to create new directory

I am trying to create a new directory folder using azure functions for python. But I am not able to create a new directory and file in azure functions for python. I got below error.
Whenever I am executing Azure functions for python on local then it's working fine but not on azure.
Error: -
Error in folder creation: [Errno 30] Read-only file system: './HttpTrigger/logs'
I am trying to create new log folder in HttpTrigger function, but got above error.
Please check the below code: -
import logging
import struct
import sys
import azure.functions as func
import os
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
try:
if not os.path.exists('./HttpTrigger/logs'):
logging.info('Inside Forlder Creation')
os.makedirs('./HttpTrigger/logs')
f= open("test.txt","w+")
for i in range(10):
logging.info('Inside For')
f.write("This is line %d\r\n" % (i+1))
logging.info('Outside For')
f.close()
return func.HttpResponse("Created",
status_code=200
)
except Exception as e:
return func.HttpResponse(f"Error in floder creation : {e}", status_code=400)
Is there any way to create a new directory in azure functions for python? Please let me know if there is any way.

If you need to do some file processing temporary then azure function provides a temporary directory.
temporary directory in azure functions
Here is a code snippet.
import tempfile
from os import listdir
tempFilePath = tempfile.gettempdir()
fp = tempfile.NamedTemporaryFile()
fp.write(b'Hello world!')
filesDirListInTemp = listdir(tempFilePath)
For reference.

The point of Azure functions (and more generally serverless functions) is to be triggered by a specific event, execute some logic and then exit. It's not like a regular server where you have access to the file system and where you can read/write files. Actually, you can't be sure it will always be executed by the same physical machine ; Azure abstracts all this concepts for you (hence the name "serverless").
Now, if you really need to write files, you should have a look at Blob storage. It's a cloud-native service where you can actually download and upload files. From your Azure function, you'll have to use the Blob storage API to manipulate the files.

Your actual app folder will be reading from a zip, so won't allow you to create folders or files. However you should be able to create temporary directories in like the /tmp directory. That said, you shouldn't rely on them being there and are (as the name implies) intended to be temporary. So would stick with #frankie567 advice on using something like Azure Storage to store artifacts you expect to pull later.

You could create file or directory in the temp folder or the function execution folder, cause the content in temp folder won't be saved all the time so you could create directory in the execution directory, you could get the directory with Context bingding the use function_directory to get it. Further more information you could refer to this doc: Python developer guide.
And the below is my test code, I create the folder and file then send the filename as the response
import logging
import os
import time
import datetime
import json
import azure.functions as func
def main(req: func.HttpRequest,context: func.Context) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
t = datetime.datetime.now().strftime("%H-%M-%S-%f")
foldername=context.function_directory+"/newfolder"+t
os.makedirs(foldername)
suffix = ".txt"
newfile= t+suffix
os.getcwd()
os.chdir(foldername)
if not os.path.exists(newfile):
f = open(newfile,'w')
f.write("test")
f.close()
data=[]
for filename in os.listdir(context.function_directory):
print(filename)
d1={"filename":filename}
data.append(d1)
jsondata=json.dumps(data)
return func.HttpResponse(jsondata)
Here is the result picture, you could see I could create the folder and file.

Downloading a file from google cloud storage inside a folder

I've got a python script that gets a list of files that have been uploaded to a google cloud storage bucket, and attempts to retrieve the data as a string.
The code is simply:
file = open(base_dir + "/" + path, 'wb')
data = Blob(path, bucket).download_as_string()
file.write(data)
My issue is that the data I've uploaded is stored inside folders in the bucket, so the path would be something like:
folder/innerfolder/file.jpg
When the google library attempts to download the file, it gets it in the form of a GET request, which turns the above path into:
https://www.googleapis.com/storage/v1/b/bucket/o/folder%2Finnerfolder%2Ffile.jpg
Is there any way to stop this happening / download the file though this way? Cheers.

Yes - you can do this with the python storage client library.
Just install it with pip install --upgrade google-cloud-storage and then use the following code:
from google.cloud import storage
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(destination_file_name)
You can also use .download_as_string() but as you're writing it to a file anyway downloading straight to the file may be easier.
The only slightly awkward thing to be aware of is that the filepath is the path from after the bucket name, so doesn't line up exactly with the path on the web interface.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Untar Files present on google cloud storage using cloud composer - python

Related

python delete files from google cloud storage that starts with

Automating running Python code using Azure services

How to move files from one folder to another on databricks

Error in azure functions for python whenever trying to create new directory

Downloading a file from google cloud storage inside a folder

Categories

Resources