GCP cloud function python - GCS copy files - duplicate files

GCP cloud function python - GCS copy files - duplicate files - python

Im trying to copy files from GCS to a different location. But I need to it in real time with Cloud function.
I have created a function and its working. But the issue is, the file copied multiple times with multiple folders.
EG:
source file path: gs://logbucket/mylog/2020/07/22/log.csv
Expected Target: gs://logbucket/hivelog/2020/07/22/log.csv
My code:
from google.cloud import storage
def hello_gcs_generic(data, context):
sourcebucket=format(data['bucket'])
source_file=format(data['name'])
year = source_file.split("/")[1]
month = source_file.split("/")[2]
day = source_file.split("/")[3]
filename=source_file.split("/")[4]
print(year)
print(month)
print(day)
print(filename)
print(sourcebucket)
print(source_file)
storage_client = storage.Client()
source_bucket = storage_client.bucket(sourcebucket)
source_blob = source_bucket.blob(source_file)
destination_bucket = storage_client.bucket(sourcebucket)
destination_blob_name = 'hivelog/year='+year+'/month='+month+'/day='+day+'/'+filename
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
blob.delete()
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Output:
You can see this year=year=2020 how this comes? Also inside this I have folders like year=year=2020/month=month=07/
Im not able to fix this.

You're writing to the same bucket that you're trying to copy from:
destination_bucket = storage_client.bucket(sourcebucket)
Every time you add a new file to the bucket, it's triggering the Cloud Function again.
You either need to use two different buckets, or add a conditional based on the first part of the path:
top_level_directory = source_file.split("/")[0]
if top_level_directory == "mylog":
# Do the copying
elif top_level_directory == "hivelog":
# This is a file created by the function, do nothing
else:
# We weren't expecting this top level directory

As an educated guess, the source paths you're copying from are really of the format
/foo/year=2020/month=42/...
so when you split by slashes, you get
foo
year=2020
month=42
and when recomposing those components, you again add another year=/month=/... prefix in
destination_blob_name = 'hivelog/year='+year+'/month='+month+ ...
and there you have it; year=year=year= after 3 iterations...
Are you also sure you're not also iterating over the files you've already copied? That would also cause this.

import os
import gcsfs
def hello_gcs_generic(data, context):
fs = gcsfs.GCSFileSystem(project="Project_Name", token=os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))
source_filepath = f"{data['bucket']}/{data['name']}"
destination_filepath = source_filepath.replace("mylog","hivelog")
fs.cp(source_filepath,destination_filepath)
print(f"Blob {data['name']} in bucket {data['bucket']} copied to hivelog")
This should give you a head start on what you're trying to accomplish. Replace Project_Name with the name of your GCP Project where the storage buckets are located.
Also assumes you have service account credentials in a json file set with the environment variable GOOGLE_APPLICATION_CREDENTIALS, which I'm assuming is the case based on your use of google.cloud storage.
Now granted you could accept "mylog" or "hivelog" as arguments and make it useful in other scenarios. Also for splitting your filename,should you ever need to go that route again, one line will do the trick:
_,year,month,data,filename = data['name'].split('/')
In that scenario, the underscore just serves to tell yourself and others that you don't plan to use that portion of the split.
You could use extended unpacking to ignore multiple values such as
*_,month,day,filename = data['name'].split('/')
Or you could combine the two
*_,month,day,_ = data['name'].split('/')
edit: link to gcsfs documentation

Related

How to download a list of files from Azure Blob Storage given SAS URI and container name using Python?

I have the container name and its folder structure. I need to download all files in a single folder in the container using python code. I also have a SAS URL link to this particular folder.
The method I have found online use BlockBlobService class, which is part of the old SDK. I need to find a way to do it using the current SDK.
Can you please help me with this?
Edit 1:
this is my SAS URL: https://xxxx.blob.core.windows.net/<CONTAINER>/<FOLDER>?sp=r&st=2022-05-31T17:49:47Z&se=2022-06-05T21:59:59Z&sv=2020-08-04&sr=c&sig=9M8ql9nYOhEYdmAOKUyetWbCU8hoWS72UFczkShdbeY%3D
Edit 2:
added link to the method found.
Edit 3:
I also have the full path of the files that I want to download.

Please try this code (untested though).
The code below basically parses the SAS URL and creates an instance of ContainerClient. Then it lists the blobs in that container names of which start with the folder name. Once you have that list, you can download individual blobs.
I noticed that your SAS URL only has read permission (sp=r). Please note that you would need both read and list permissions (sp=rl). You will need to ask for new SAS URL with these two permissions.
from urllib.parse import urlparse
from azure.storage.blob import ContainerClient
sasUrl = "https://xxxx.blob.core.windows.net/<CONTAINER>/<FOLDER>?sp=r&st=2022-05-31T17:49:47Z&se=2022-06-05T21:59:59Z&sv=2020-08-04&sr=c&sig=9M8ql9nYOhEYdmAOKUyetWbCU8hoWS72UFczkShdbeY%3D"
sasUrlParts = urlparse(sasUrl)
accountEndpoint = sasUrlParts.scheme + '://' + sasUrlParts.netloc
sasToken = sasUrlParts.query
pathParts = sasUrlParts.path.split('/')
containerName = pathParts[1]
folderName = pathParts[2]
containerClient = ContainerClient(accountEndpoint, containerName, sasToken)
blobs = containerClient.list_blobs(folderName)
for blob in blobs_list:
blobClient = containerClient.get_blob_client(blob)
download the blob here...blobClient.download()
UPDATE
I have the SAS URL mentioned above, and the full paths of the files I
want to download. For example: PATH 1 : Container/folder/file1.csv,
PATH 2 : Container/folder/file2.txt, and so on
Please see the code below:
from urllib.parse import urlparse
from azure.storage.blob import BlobClient
sasUrl = "https://xxxx.blob.core.windows.net/<CONTAINER>/<FOLDER>?sp=r&st=2022-05-31T17:49:47Z&se=2022-06-05T21:59:59Z&sv=2020-08-04&sr=c&sig=9M8ql9nYOhEYdmAOKUyetWbCU8hoWS72UFczkShdbeY%3D"
blobNameWithContainer = "Container/folder/file1.csv"
sasUrlParts = urlparse(sasUrl)
accountEndpoint = sasUrlParts.scheme + '://' + sasUrlParts.netloc
sasToken = sasUrlParts.query
blobSasUrl = accountEndpoint + '/' + blobNameWithContainer + '?' + sasToken;
blobClient = BlobClient.from_blob_url(blobSasUrl);
.... now do any operation on that blob ...

Python how to generate a signed url for a Google cloud storage file after composing and renaming the file?

I am building files for download by exporting them from BigQuery into Google Cloud storage. Some of these files are relatively large, and so they have to be composed together after they are sharded by the BigQuery export.
The user inputs a file name for the file that will be generated on the front-end.
On the backend, I am generating a random temporary name so that I can then compose the files together.
# Job Configuration for the job to move the file to bigQuery
job_config = bigquery.QueryJobConfig()
job_config.destination = table_ref
query_job = client.query(sql, job_config=job_config)
query_job.result()
# Generate a temporary name for the file
fname = "".join(random.choices(string.ascii_letters, k=10))
# Generate Destination URI
destinationURI = info["bucket_uri"].format(filename=f"{fname}*.csv")
extract_job = client.extract_table(table_ref, destination_uris=destinationURI, location="US")
extract_job.result()
# Delete the temporary table, if it doesn't exist ignore it
client.delete_table(f"{info['did']}.{info['tmpid']}", not_found_ok=True)
After the data export has completed, I unshard the files by composing the blobs together.
client = storage.Client(project=info["pid"])
bucket = client.bucket(info['bucket_name'])
all_blobs = list(bucket.list_blobs(prefix=fname))
blob_initial = all_blobs.pop(0)
prev_ind = 0
for i in range(31, len(all_blobs), 31):
# Compose files in chunks of 32 blobs (GCS Limit)
blob_initial.compose([blob_initial, *all_blobs[prev_ind:i]])
# PREVENT GCS RATE-LIMIT FOR FILES EXCEEDING ~100GB
time.sleep(1.0)
prev_ind = i
else:
# Compose all remaining files when less than 32 files are left
blob_initial.compose([blob_initial, *all_blobs[prev_ind:]])
for b in all_blobs:
# Delete the sharded files
b.delete()
After all the files have been composed into one file, I rename the blob to the user provided filename. Then I generate a signed URL which gets posted to firebase for the front-end to provide the file for download.
# Rename the file to the user provided filename
bucket.rename_blob(blob_initial, data["filename"])
# Generate signed url to post to firebase
download_url = blob_initial.generate_signed_url(datetime.now() + timedelta(days=10000))
The issue I am encountering occurs because of the use of the random filename used when the files are sharded. The reason I chose to use a random filename instead of the user-provided filename is because there may be instances when multiple users submit a request using the same (default value) filenames, and so those edge-cases would cause issues with the file sharding.
When I try to download the file, I get the following return:
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>No such object: project-id.appspot.com/YMHprgIqMe000000000000.csv</Details>
</Error>
Although I renamed the file, it seems that the download URL is still using the old file name.
Is there a way to inform GCS that the filename has changed when I generate the signed URL?

It appears as though all that was needed was to reload the blob!
bucket.rename_blob(blob_initial, data["filename"])
blob = bucket.get(data["filename"])
# Generate signed url to post to firebase
download_url = blob.generate_signed_url(datetime.now() + timedelta(days=10000))

Get count of objects in a specific S3 folder using Boto3

Trying to get count of objects in S3 folder
Current code
bucket='some-bucket'
File='someLocation/File/'
objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=File)
fileCount = objs['KeyCount']
This gives me the count as 1+actual number of objects in S3.
Maybe it is counting "File" as a key too?

Assuming you want to count the keys in a bucket and don't want to hit the limit of 1000 using list_objects_v2. The below code worked for me but I'm wondering if there is a better faster way to do it! Tried looking if there's a packaged function in boto3 s3 connector but there isn't!
# connect to s3 - assuming your creds are all set up and you have boto3 installed
s3 = boto3.resource('s3')
# identify the bucket - you can use prefix if you know what your bucket name starts with
for bucket in s3.buckets.all():
print(bucket.name)
# get the bucket
bucket = s3.Bucket('my-s3-bucket')
# use loop and count increment
count_obj = 0
for i in bucket.objects.all():
count_obj = count_obj + 1
print(count_obj)

If there are more than 1000 entries, you need to use paginators, like this:
count = 0
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket='your-bucket', Prefix='your-folder/', Delimiter='/'):
count += len(result.get('CommonPrefixes'))

"Folders" do not actually exist in Amazon S3. Instead, all objects have their full path as their filename ('Key'). I think you already know this.
However, it is possible to 'create' a folder by creating a zero-length object that has the same name as the folder. This causes the folder to appear in listings and is what happens if folders are created via the management console.
Thus, you could exclude zero-length objects from your count.
For an example, see: Determine if folder or file key - Boto

The following code worked perfectly
def getNumberOfObjectsInBucket(bucketName,prefix):
count = 0
response = boto3.client('s3').list_objects_v2(Bucket=bucketName,Prefix=prefix)
for object in response['Contents']:
if object['Size'] != 0:
#print(object['Key'])
count+=1
return count
object['Size'] == 0 will take you to folder names, if want to check them, object['Size'] != 0 will lead you to all non-folder keys.
Sample function call below:
getNumberOfObjectsInBucket('foo-test-bucket','foo-output/')

If you have credentials to access that bucket, then you can use this simple code. Below code will give you a list. List comprehension is used for more readability.
Filter is used to filter objects because in bucket to identify the files ,folder names are used. As explained by John Rotenstein concisely.
import boto3
bucket = "Sample_Bucket"
folder = "Sample_Folder"
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket(bucket)
files_in_s3 = [f.key.split(folder + "/")[1] for f in s3_bucket.objects.filter(Prefix=folder).all()]

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'

There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

How can I get the list of only folders in amazon S3 using python boto?

I am using boto and python and amazon s3.
If I use
[key.name for key in list(self.bucket.list())]
then I get all the keys of all the files.
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
what is the best way to
1. either get all folders from s3
2. or from that list just remove the file from the last and get the unique keys of folders
I am thinking of doing like this
set([re.sub("/[^/]*$","/",path) for path in mylist]

building on sethwm's answer:
To get the top level directories:
list(bucket.list("", "/"))
To get the subdirectories of files:
list(bucket.list("files/", "/")
and so on.

This is going to be an incomplete answer since I don't know python or boto, but I want to comment on the underlying concept in the question.
One of the other posters was right: there is no concept of a directory in S3. There are only flat key/value pairs. Many applications pretend certain delimiters indicate directory entries. For example "/" or "\". Some apps go as far as putting a dummy file in place so that if the "directory" empties out, you can still see it in list results.
You don't always have to pull your entire bucket down and do the filtering locally. S3 has a concept of a delimited list where you specific what you would deem your path delimiter ("/", "\", "|", "foobar", etc) and S3 will return virtual results to you, similar to what you want.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html (
Look at the delimiter header.)
This API will get you one level of directories. So if you had in your example:
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/new/abc.pdf
mybucket/files/pdf/2011/
And you passed in a LIST with prefix "" and delimiter "/", you'd get results:
mybucket/files/
If you passed in a LIST with prefix "mybucket/files/" and delimiter "/", you'd get results:
mybucket/files/pdf/
And if you passed in a LIST with prefix "mybucket/files/pdf/" and delimiter "/", you'd get results:
mybucket/files/pdf/abc.pdf
mybucket/files/pdf/abc2.pdf
mybucket/files/pdf/abc3.pdf
mybucket/files/pdf/abc4.pdf
mybucket/files/pdf/new/
mybucket/files/pdf/2011/
You'd be on your own at that point if you wanted to eliminate the pdf files themselves from the result set.
Now how you do this in python/boto I have no idea. Hopefully there's a way to pass through.

As pointed in one of the comments approach suggested by j1m returns a prefix object. If you are after a name/path you can use variable name. For example:
import boto
import boto.s3
conn = boto.s3.connect_to_region('us-west-2')
bucket = conn.get_bucket(your_bucket)
folders = bucket.list("","/")
for folder in folders:
print folder.name

I found the following to work using boto3:
import boto3
def list_folders(s3_client, bucket_name):
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='', Delimiter='/')
for content in response.get('CommonPrefixes', []):
yield content.get('Prefix')
s3_client = boto3.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
print('Folder found: %s' % folder)
Refs.:
https://stackoverflow.com/a/51550944/1259478
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2

Basically there is no such thing as a folder in S3. Internally everything is stored as a key, and if the key name has a slash character in it, the clients may decide to show it as a folder.
With that in mind, you should first get all keys and then use a regex to filter out the paths that include a slash in it. The solution you have right now is already a good start.

I see you have successfully made the boto connection. If you only have one directory that you are interested in (like you provided in the example), I think what you can do is use prefix and delimiter that's already provided via AWS (Link).
Boto uses this feature in its bucket object, and you can retrieve a hierarchical directory information using prefix and delimiter. The bucket.list() will return a boto.s3.bucketlistresultset.BucketListResultSet object.
I tried this a couple ways, and if you do choose to use a delimiter= argument in bucket.list(), the returned object is an iterator for boto.s3.prefix.Prefix, rather than boto.s3.key.Key. In other words, if you try to retrieve the subdirectories you should put delimiter='\' and as a result, you will get an iterator for the prefix object
Both returned objects (either prefix or key object) have a .name attribute, so if you want the directory/file information as a string, you can do so by printing like below:
from boto.s3.connection import S3Connection
key_id = '...'
secret_key = '...'
# Create connection
conn = S3Connection(key_id, secret_key)
# Get list of all buckets
allbuckets = conn.get_all_buckets()
for bucket_name in allbuckets:
print(bucket_name)
# Connet to a specific bucket
bucket = conn.get_bucket('bucket_name')
# Get subdirectory info
for key in bucket.list(prefix='sub_directory/', delimiter='/'):
print(key.name)

The issue here, as has been said by others, is that a folder doesn't necessarily have a key, so you have to search through the strings for the / character and figure out your folders through that. Here's one way to generate a recursive dictionary imitating a folder structure.
If you want all the files and their url's in the folders
assets = {}
for key in self.bucket.list(str(self.org) + '/'):
path = key.name.split('/')
identifier = assets
for uri in path[1:-1]:
try:
identifier[uri]
except:
identifier[uri] = {}
identifier = identifier[uri]
if not key.name.endswith('/'):
identifier[path[-1]] = key.generate_url(expires_in=0, query_auth=False)
return assets
If you just want the empty folders
folders = {}
for key in self.bucket.list(str(self.org) + '/'):
path = key.name.split('/')
identifier = folders
for uri in path[1:-1]:
try:
identifier[uri]
except:
identifier[uri] = {}
identifier = identifier[uri]
if key.name.endswith('/'):
identifier[path[-1]] = {}
return folders
This can then be recursively read out later.

the boto interface allows you to list the content of a bucket and give a prefix of the entry.
That way you can have the entry for what would be a directory in a normal filesytem :
import boto
AWS_ACCESS_KEY_ID = '...'
AWS_SECRET_ACCESS_KEY = '...'
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket()
bucket_entries = bucket.list(prefix='/path/to/your/directory')
for entry in bucket_entries:
print entry

Complete example with boto3 using the S3 client
import boto3
def list_bucket_keys(bucket_name):
s3_client = boto3.client("s3")
""" :type : pyboto3.s3 """
result = s3_client.list_objects(Bucket=bucket_name, Prefix="Trails/", Delimiter="/")
return result['CommonPrefixes']
if __name__ == '__main__':
print list_bucket_keys("my-s3-bucket-name")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.