I've got a python script that gets a list of files that have been uploaded to a google cloud storage bucket, and attempts to retrieve the data as a string.
The code is simply:
file = open(base_dir + "/" + path, 'wb')
data = Blob(path, bucket).download_as_string()
file.write(data)
My issue is that the data I've uploaded is stored inside folders in the bucket, so the path would be something like:
folder/innerfolder/file.jpg
When the google library attempts to download the file, it gets it in the form of a GET request, which turns the above path into:
https://www.googleapis.com/storage/v1/b/bucket/o/folder%2Finnerfolder%2Ffile.jpg
Is there any way to stop this happening / download the file though this way? Cheers.
Yes - you can do this with the python storage client library.
Just install it with pip install --upgrade google-cloud-storage and then use the following code:
from google.cloud import storage
# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(destination_file_name)
You can also use .download_as_string() but as you're writing it to a file anyway downloading straight to the file may be easier.
The only slightly awkward thing to be aware of is that the filepath is the path from after the bucket name, so doesn't line up exactly with the path on the web interface.
Related
I have scenario where I have to untar landed files on google cloud storage and and place them in some other folder on bucket to process and and after untaring, I have to remove some xml header's from the extracted xml file , we are have to use cloud composer to do all these tasks, in current on-premise the is all handled using bash scripts.
I have tried to use a python function which will be invoked using PythonOperator to untar and place the files in output location and I am thinking of using bashoperator to run sed and grep commands to remove the headers in the extracted file.
The below code seems to not produce any output
import io
import os
import tarfile
from google.cloud import storage
client = storage.Client("project_id")
input_bucket = client.get_bucket('bucket_name')
output_bucket = client.get_bucket('bucket_name')
def untar(context):
# Get the contents of the uploaded file
file_name="testzip.tar.gz"
print("Starting_untar")
input_blob = input_bucket.get_blob(f'/LnS_landing/{file_name}').download_as_string()
print("file_name:",input_blob.name)
tar = tarfile.open(fileobj=io.BytesIO(input_blob))
for member in tar.getnames():
file_object = tar.extractfile(member)
# print(file_object)
if file_object:
output_blob = output_bucket.blob(os.path.join(f'/LnS_landing/{file_name}', member))
output_blob.upload_from_string(file_object.read())
Using python I am able to delete files from bucket using prefixes also but in python code, prefix means directory.
I want to delete the files from GCP bucket which starts with example.
For example:
example-2022-12-07
example-2022-12-08
I followed this(Delete Files from Google Cloud Storage) but did not get the answer.
I am trying this, but not working:
blobs = bucket.list_blobs()
fileList = [file.name for file in blobs if 'example' in file.name ]
print(fileList)
for file in fileList:
blob = blobs.blob(file)
blob.delete()
print(f"Blob {blob_name} deleted.")
You can try the following code to delete files from the Google Cloud Storage by using the blob.delete method as suggested in the Documentation.
Below is the example for what you are looking:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# list all objects in the directory
# Add prefix as parameter to bucket.list_blobs
blobs = bucket.list_blobs(prefix=?)
for blob in blobs:
blob.delete()
print(f"Blob {blob_name} deleted.")`
You can check with this thread1 and thread2.
Below is the function to download the files from a S3 Bucket.
But the problem is I can't find how to direct those files into a network path instead of downloading into the project folder without having any control over where the files must be downloaded.
import boto3
import config
import os
import win32api
def download_all_objects_in_folder():
#= boto3.resource('s3')
s3_resource = boto3.resource('s3', aws_access_key_id=config.AWS_BUCKET_KEY, aws_secret_access_key=config.AWS_BUCKET_SECRET_KEY)
my_bucket = s3_resource.Bucket(config.BUCKET)
# Create the folder logic here
objects = my_bucket.objects.filter(Prefix='Export_20181104/')
for obj in objects:
path, filename = os.path.split(obj.key)
my_bucket.download_file(obj.key, filename,"C:\Other")
#win32api.MessageBox(0, obj.key, 'title')
print("imports completed")
Update:
This is the error I am getting when I pass the custom path.
ValueError: Invalid extra_args key 'C', must be one of: ChecksumMode,
VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5,
RequestPayer, ExpectedBucketOwner
There is no difference for Python or the Boto3 SDK where you download your specified S3 object, whether that is a local or network location.
That concern is up to the operating system.
Download to the network location, just as you would download to the local location.
Hi everyone on Stackoverflow,
I wrote two python scripts. One script is for picking up local files and sending them to GCS (Google Cloud Storage). Another one is opposite - for taking files from GCS that were uploaded and saving locally.
I want to automate process using Azure.
What would you recommend to use? Azure Function App, Azure Logic App or other services?
*
I'm now trying to use Logic App. I made .exe file using pyinstaller and looking for connector in Logic App that will run my program (.exe file). I have trigger in Logic App - "When a file is added or modified", but now I stack when selecting next step (connector)..
Kind regards,
Anna
Adding code as requested:
from google.cloud import storage
import os
import glob
import json
# Finding path to config file that is called "gcs_config.json" in directory C:/
def find_config(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
def upload_files(config_file):
# Reading 3 Parameters for upload from JSON file
with open(config_file, "r") as file:
contents = json.loads(file.read())
print(contents)
# Setting up login credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = contents['login_credentials']
# The ID of GCS bucket
bucket_name = contents['bucket_name']
# Setting path to files
LOCAL_PATH = contents['folder_from']
for source_file_name in glob.glob(LOCAL_PATH + '/**'):
# For multiple files upload
# Setting destination folder according to file name
if os.path.isfile(source_file_name):
partitioned_file_name = os.path.split(source_file_name)[-1].partition("-")
file_type_name = partitioned_file_name[0]
# Setting folder where files will be uploaded
destination_blob_name = file_type_name + "/" + os.path.split(source_file_name)[-1]
# Setting up required variables for GCS
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
# Running upload and printing confirmation message
blob.upload_from_filename(source_file_name)
print("File from {} uploaded to {} in bucket {}.".format(
source_file_name, destination_blob_name, bucket_name
))
config_file = find_config("gcs_config.json", "C:/")
upload_files(config_file)
config.json:
{
"login_credentials": "C:/Users/AS/Downloads/bright-velocity-___-53840b2f9bb4.json",
"bucket_name": "staging.bright-velocity-___.appspot.com",
"folder_from": "C:/Users/AS/Documents/Test2/",
"folder_for_downloaded_files": "C:/Users/AnnaShepilova/Documents/DownloadedFromGCS2/",
"given_date": "",
"given_prefix": ["Customer", "Account"] }
Currently, there is no built-in connector in Logic Apps for interacting with Google Cloud Services. however, you can use Google Cloud Storage does provide REST API in your Logic app or Function app.
But my suggestion is you can use the Azure Function to do these things. Because the azure Function can be more flexible to write your own flow to do the task.
Refer to run your .exe file in the Azure function. If you are using Local EXE or using Cloud Environment exe.
Refer here for more information
Are there any API function that allow us to move files in Google Cloud Storage from one bucket in another bucket?
The scenario is we want Python to move read files in A bucket to B bucket. I knew that gsutil could do that but not sure Python can support that or not.
Thanks.
Here's a function I use when moving blobs between directories within the same bucket or to a different bucket.
from google.cloud import storage
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="path_to_your_creds.json"
def mv_blob(bucket_name, blob_name, new_bucket_name, new_blob_name):
"""
Function for moving files between directories or buckets. it will use GCP's copy
function then delete the blob from the old location.
inputs
-----
bucket_name: name of bucket
blob_name: str, name of file
ex. 'data/some_location/file_name'
new_bucket_name: name of bucket (can be same as original if we're just moving around directories)
new_blob_name: str, name of file in new directory in target bucket
ex. 'data/destination/file_name'
"""
storage_client = storage.Client()
source_bucket = storage_client.get_bucket(bucket_name)
source_blob = source_bucket.blob(blob_name)
destination_bucket = storage_client.get_bucket(new_bucket_name)
# copy to new destination
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, new_blob_name)
# delete in old destination
source_blob.delete()
print(f'File moved from {source_blob} to {new_blob_name}')
Using the google-api-python-client, there is an example on the storage.objects.copy page. After you copy, you can delete the source with storage.objects.delete.
destination_object_resource = {}
req = client.objects().copy(
sourceBucket=bucket1,
sourceObject=old_object,
destinationBucket=bucket2,
destinationObject=new_object,
body=destination_object_resource)
resp = req.execute()
print json.dumps(resp, indent=2)
client.objects().delete(
bucket=bucket1,
object=old_object).execute()
you can use GCS Client Library Functions documented at [1] to read to one bucket and write to the other and then delete source file.
You can even use the GCS REST API documented at [2].
Link:
[1] - https://developers.google.com/appengine/docs/python/googlecloudstorageclient/functions
[2] - https://developers.google.com/storage/docs/concepts-techniques#overview
def GCP_BUCKET_A_TO_B():
source_bucket = storage_client.get_bucket("Bucket_A_Name")
filename = [filename.name for filename in
list(source_bucket.list_blobs(prefix=""))]
for i in range (0,len(filename)):
source_blob = source_bucket.blob(filename[i])
destination_bucket = storage_client.get_bucket("Bucket_B_Name")
new_blob = source_bucket.copy_blob(
source_blob, destination_bucket, filename[i])
I just wanted to point out that there's another possible approach and that is using gsutil through the use of the subprocess module.
The advantages of using gsutil like that:
You don't have to deal with individual blobs
gsutil's implementation of the move and especially rsync will probably be much better and more resilient that what we do ourselves.
The disadvantages:
You can't deal with individual blobs easily
It's hacky and generally a library is preferable to executing shell commands
Example:
def move(source_uri: str,
destination_uri: str) -> None:
"""
Move file from source_uri to destination_uri.
:param source_uri: gs:// - like uri of the source file/directory
:param destination_uri: gs:// - like uri of the destination file/directory
:return: None
"""
cmd = f"gsutil -m mv {source_uri} {destination_uri}"
subprocess.run(cmd)