I need to backup various file types to GDrive (not just those convertible to GDocs formats) from some linux server.
What would be the simplest, most elegant way to do that with a python script? Would any of the solutions pertaining to GDocs be applicable?
You can use the Documents List API to write a script that writes to Drive:
https://developers.google.com/google-apps/documents-list/
Both the Documents List API and the Drive API interact with the same resources (i.e. same documents and files).
This sample in the Python client library shows how to upload an unconverted file to Drive:
http://code.google.com/p/gdata-python-client/source/browse/samples/docs/docs_v3_example.py#180
The current documentation for saving a file to google drive using python can be found here:
https://developers.google.com/drive/v3/web/manage-uploads
However, the way that the google drive api handles document storage and retrieval does not follow the same architecture as POSIX file systems. As a result, if you wish to preserve the hierarchical architecture of the nested files on your linux file system, you will need to write a lot of custom code so that the parent directories are preserved on google drive.
On top of that, google makes it difficult to gain write access to a normal drive account. Your permission scope must include the following link: https://www.googleapis.com/auth/drive and to obtain a token to access a user's normal account, that user must first join a group to provide access to non-reviewed apps. And any oauth token that is created has a limited shelf life.
However, if you obtain an access token, the following script should allow you to save any file on your local machine to the same (relative) path on google drive.
def migrate(file_path, access_token, drive_space='drive'):
'''
a method to save a posix file architecture to google drive
NOTE: to write to a google drive account using a non-approved app,
the oauth2 grantee account must also join this google group
https://groups.google.com/forum/#!forum/risky-access-by-unreviewed-apps
:param file_path: string with path to local file
:param access_token: string with oauth2 access token grant to write to google drive
:param drive_space: string with name of space to write to (drive, appDataFolder, photos)
:return: string with id of file on google drive
'''
# construct drive client
import httplib2
from googleapiclient import discovery
from oauth2client.client import AccessTokenCredentials
google_credentials = AccessTokenCredentials(access_token, 'my-user-agent/1.0')
google_http = httplib2.Http()
google_http = google_credentials.authorize(google_http)
google_drive = discovery.build('drive', 'v3', http=google_http)
drive_client = google_drive.files()
# prepare file body
from googleapiclient.http import MediaFileUpload
media_body = MediaFileUpload(filename=file_path, resumable=True)
# determine file modified time
import os
from datetime import datetime
modified_epoch = os.path.getmtime(file_path)
modified_time = datetime.utcfromtimestamp(modified_epoch).isoformat()
# determine path segments
path_segments = file_path.split(os.sep)
# construct upload kwargs
create_kwargs = {
'body': {
'name': path_segments.pop(),
'modifiedTime': modified_time
},
'media_body': media_body,
'fields': 'id'
}
# walk through parent directories
parent_id = ''
if path_segments:
# construct query and creation arguments
walk_folders = True
folder_kwargs = {
'body': {
'name': '',
'mimeType' : 'application/vnd.google-apps.folder'
},
'fields': 'id'
}
query_kwargs = {
'spaces': drive_space,
'fields': 'files(id, parents)'
}
while path_segments:
folder_name = path_segments.pop(0)
folder_kwargs['body']['name'] = folder_name
# search for folder id in existing hierarchy
if walk_folders:
walk_query = "name = '%s'" % folder_name
if parent_id:
walk_query += "and '%s' in parents" % parent_id
query_kwargs['q'] = walk_query
response = drive_client.list(**query_kwargs).execute()
file_list = response.get('files', [])
else:
file_list = []
if file_list:
parent_id = file_list[0].get('id')
# or create folder
# https://developers.google.com/drive/v3/web/folder
else:
if not parent_id:
if drive_space == 'appDataFolder':
folder_kwargs['body']['parents'] = [ drive_space ]
else:
del folder_kwargs['body']['parents']
else:
folder_kwargs['body']['parents'] = [parent_id]
response = drive_client.create(**folder_kwargs).execute()
parent_id = response.get('id')
walk_folders = False
# add parent id to file creation kwargs
if parent_id:
create_kwargs['body']['parents'] = [parent_id]
elif drive_space == 'appDataFolder':
create_kwargs['body']['parents'] = [drive_space]
# send create request
file = drive_client.create(**create_kwargs).execute()
file_id = file.get('id')
return file_id
PS. I have modified this script from the labpack python module. There is class called driveClient in that module written by rcj1492 which handles saving, loading, searching and deleting files on google drive in a way that preserves the POSIX file system.
from labpack.storage.google.drive import driveClient
I found that PyDrive handles the Drive API elegantly, and it also has great documentation (especially walking the user through the authentication part).
EDIT: Combine that with the material on Automating pydrive verification process and Pydrive google drive automate authentication, and that makes for some great documentation to get things going. Hope it helps those who are confused about where to start.
Related
I feel like this should be possible, but I looked through the wandb SDK code and I can't find an easy/logical way to do it. It might be possible to hack it by modifying the manifest entries at some point later (but maybe before the artifact is logged to wandb as then the manifest and the entries might be locked)? I saw things like this in the SDK code:
version = manifest_entry.extra.get("versionID")
etag = manifest_entry.extra.get("etag")
So, I figure we can probably edit those?
UPDATE
So, I tried to hack it together with something like this and it works but it feels wrong:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
API_KEY = os.environ.get("WANDB_API_KEY")
api = api = wandb.Api(overrides={"entity": ENTITY, "project": ENTITY})
run = wandb.init(entity=ENTITY, project=PROJECT, job_type="test upload")
file = "admin2Codes.txt" # "admin1CodesASCII.txt" # (both already on s3 with a couple versions)
artifact = wandb.Artifact("test_data", type="dataset")
# modify one of the local files so it has a new md5hash etc.
with open(file, "a") as f:
f.write("new_line_1\n")
# upload local file to s3
local_file_path = file
s3_url = f"s3://bucket/prefix/{file}"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
head_response = s3_client.head_object(Bucket=s3_bucket, Key=key)
version_id: str = head_response["VersionId"]
print(version_id)
# upload a link/ref to this s3 object in wandb:
artifact.add_reference(s3_dir)
# at this point we might be able to modify the artifact._manifest.entries and each entry.extra.get("etag") etc.?
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
# set these to an older version on s3 that we know we want (rather than latest) - do this via wandb public API:
dataset_v2 = api.artifact(f"{ENTITY}/{PROJECT}/test_data:v2", type="dataset")
# artifact._manifest.add_entry(dataset_v2.manifest.entries["admin1CodesASCII.txt"])
artifact._manifest.entries["admin1CodesASCII.txt"] = dataset_v2.manifest.entries[
"admin1CodesASCII.txt"
]
# verify that it did change:
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
run.log_artifact(artifact) # at this point the manifest is locked I believe?
artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(artifact.name)
run_id = run.id
run.finish()
curr_run = api.run(f"{ENTITY}/{PROJECT}/{run_id}")
used_artifacts = curr_run.used_artifacts()
logged_artifacts = curr_run.logged_artifacts()
Am I on the right track here? I guess the other workaround is to make a copy on s3 (so that older version is the latest again) but I wanted to avoid this as the 1 file that I want to use an old version of is a large NLP model and the only files I want to change are small config.json files etc. (so seems very wasteful to upload all files again).
I was also wondering if when I copy an old version of an object back into the same key in the bucket if that creates a real copy or just like a pointer to the same underlying object. Neither boto3 nor AWS documentation makes that clear - although it seems like it is a proper copy.
I think I found the correct way to do it now:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
def wandb_update_only_some_files_in_artifact(
existing_artifact_name: str,
new_s3_file_urls: list[str],
entity: str = ENTITY,
project: str = PROJECT,
) -> Artifact:
"""If you want to just update a config.json file for example,
but the rest of the artifact can remain the same, then you can
use this functions like so:
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)
and then all the other files like model.bin will be the same as in v3,
even if there was a v4 or v5 in between (as the v3 VersionIds are used)
Args:
existing_artifact_name (str): name with version like "old_artifact:v3"
new_s3_file_urls (list[str]): files that should be updated
entity (str, optional): wandb entity. Defaults to ENTITY.
project (str, optional): wandb project. Defaults to PROJECT.
Returns:
Artifact: the new artifact object
"""
api = wandb.Api(overrides={"entity": entity, "project": project})
old_artifact = api.artifact(existing_artifact_name)
old_artifact_name = re.sub(r":v\d+$", "", old_artifact.name)
with wandb.init(entity=entity, project=project) as run:
new_artifact = wandb.Artifact(old_artifact_name, type=old_artifact.type)
s3_file_names = [s3_url.split("/")[-1] for s3_url in new_s3_file_urls]
# add the new ones:
for s3_url, filename in zip(new_s3_file_urls, s3_file_names):
new_artifact.add_reference(s3_url, filename)
# add the old ones:
for filename, entry in old_artifact.manifest.entries.items():
if filename in s3_file_names:
continue
new_artifact.add_reference(entry, filename)
# this also works but feels hackier:
# new_artifact._manifest.entries[filename] = entry
run.log_artifact(new_artifact)
new_artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(new_artifact.name)
print(run.id)
return new_artifact
# usage:
local_file_path = "config.json" # modified file
s3_url = "s3://bucket/prefix/config.json"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)
i am trying to create a public folder in google drive and get in return a link for shareing
this:
def createfolder(foldername,service):
new_role='reader'
types='anyone'
# create folder
file_metadata = {
'name': '{}'.format(foldername),
'mimeType': 'application/vnd.google-apps.folder',
'role': 'reader',
'type': 'anyone',
}
file = service.files().create(body=file_metadata,
fields='id,webViewLink').execute()
print('Folder ID: %s' % file.get('webViewLink'))
return file.get('id')
i got this far
it creates and folder and prints the link
tryd to add the fields in to the body role and type and set it to reader / anyone but this not working
role type fields seem to be ignored
is there a way to do this on create or do i have to change the permission after i create it?
You have to call Permissions.create:
File permissions are handled via Permissions, not through Files methods.
If you check the Files resource representation, you'll notice that some fields, like name or mimeType, have the word writable under Notes. This means you can modify these fields directly using this resource (Files) methods.
If you check the property permissions, though, you'll notice there's no writable there. This means permissions cannot be updated directly, using Files methods. You have to use Permissions methods instead.
More specifically, you have to call Permissions.create after creating the folder and retrieving its ID.
Code snippet:
def shareWithEveryone(folderId, service):
payload = {
"role": "reader",
"type": "anyone"
}
service.permissions().create(fileId=folderId, body=payload).execute()
Reference:
permissions().create(fileId=*, body=None)
Creating a folder and changing the permissions are two diffrent calls.
Create your directory first
file = service.files().create(body=file_metadata).execute()
Then do a permissions.update To set the permissions on the file to be public.
permissions = service.permissions().update(body=permissions).execute()
I am not a python dev so the code is a guess
I have a cloud function calling SCC's list_assets and converting the paginated output to a List (to fetch all the results). However, since I have quite a lot of assets in the organization tree, it is taking a lot of time to fetch and cloud function times out (540 seconds max timeout).
asset_iterator = security_client.list_assets(org_name)
asset_fetch_all=list(asset_iterator)
I tried to export via WebUI and it works fine (took about 5 minutes). Is there a way to export the assets from SCC directly to a Cloud Storage bucket using the API?
I develop the same thing, in Python, for exporting to BQ. Searching in BigQuery is easier than in a file. The code is very similar for GCS storage. Here my working code with BQ
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
from google.cloud.asset_v1 import enums
def GCF_ASSET_TO_BQ(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getenv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}/datasets/{}'.format(os.getenv('PROJECT_ID'),os.getenv('DATASET'))
content_type = enums.ContentType.RESOURCE
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config, content_type=content_type)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
GCF_ASSET_TO_BQ('')
As you can see, there is some values in Env Var (OrganizationID, projectId and Dataset). For exporting to Cloud Storage, you have to change the definition of the output_config like this
output_config = asset_service_pb2.OutputConfig()
output_config.gcs_destination.uri = 'gs://path/to/file'
You have example in other languages here
Try something like this:
We use it to upload finding into a bucket. Make sure to give the SP the function is running the right permissions to the bucket.
def test_list_medium_findings(source_name):
# [START list_findings_at_a_time]
from google.cloud import securitycenter
from google.cloud import storage
# Create a new client.
client = securitycenter.SecurityCenterClient()
#Set query paramaters
organization_id = "11112222333344444"
org_name = "organizations/{org_id}".format(org_id=organization_id)
all_sources = "{org_name}/sources/-".format(org_name=org_name)
#Query Security Command Center
finding_result_iterator = client.list_findings(all_sources,filter_=YourFilter)
#Set output file settings
bucket="YourBucketName"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket)
output_file_name = "YourFileName"
my_file = bucket.blob(output_file_name)
with open('/tmp/data.txt', 'w') as file:
for i, finding_result in enumerate(finding_result_iterator):
file.write(
"{}: name: {} resource: {}".format(
i, finding_result.finding.name, finding_result.finding.resource_name
)
)
#Upload to bucket
my_file.upload_from_filename("/tmp/data.txt")
Is there a python equivalent to the getPublicUrl PHP method?
$public_url = CloudStorageTools::getPublicUrl("gs://my_bucket/some_file.txt", true);
I am storing some files using the Google Cloud Client Library for Python, and I'm trying to figure out a way of programatically getting the public URL of the files I am storing.
Please refer to https://cloud.google.com/storage/docs/reference-uris on how to build URLs.
For public URLs, there are two formats:
http(s)://storage.googleapis.com/[bucket]/[object]
or
http(s)://[bucket].storage.googleapis.com/[object]
Example:
bucket = 'my_bucket'
file = 'some_file.txt'
gcs_url = 'https://%(bucket)s.storage.googleapis.com/%(file)s' % {'bucket':bucket, 'file':file}
print gcs_url
Will output this:
https://my_bucket.storage.googleapis.com/some_file.txt
You need to use get_serving_url from the Images API. As that page explains, you need to call create_gs_key() first to get the key to pass to the Images API.
Daniel, Isaac - Thank you both.
It looks to me like Google is deliberately aiming for you not to directly serve from GCS (bandwidth reasons? dunno). So the two alternatives according to the docs are either using Blobstore or Image Services (for images).
What I ended up doing is serving the files with blobstore over GCS.
To get the blobstore key from a GCS path, I used:
blobKey = blobstore.create_gs_key('/gs' + gcs_filename)
Then, I exposed this URL on the server -
Main.py:
app = webapp2.WSGIApplication([
...
('/blobstore/serve', scripts.FileServer.GCSServingHandler),
...
FileServer.py:
class GCSServingHandler(blobstore_handlers.BlobstoreDownloadHandler):
def get(self):
blob_key = self.request.get('id')
if (len(blob_key) > 0):
self.send_blob(blob_key)
else:
self.response.write('no id given')
It's not available, but I've filed a bug. In the meantime, try this:
import urlparse
def GetGsPublicUrl(gsUrl, secure=True):
u = urlparse.urlsplit(gsUrl)
if u.scheme == 'gs':
return urlparse.urlunsplit((
'https' if secure else 'http',
'%s.storage.googleapis.com' % u.netloc,
u.path, '', ''))
For example:
>>> GetGsPublicUrl('gs://foo/bar.tgz')
'https://foo.storage.googleapis.com/bar.tgz'
We have a job that checks if a file on the cloud storage has been modified. If so, then it reads the data from the file and processes it further.
I want to know if there is an API to check when a file on the cloud storage was last modified.
You can now do this using the official Python lib for Google Storage.
from google.cloud import storage
def blob_metadata(bucket_name, blob_name):
"""Prints out a blob's metadata."""
# bucket_name = 'your-bucket-name'
# blob_name = 'your-object-name'
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.get_blob(blob_name)
print("Blob: {}".format(blob.name))
print("Bucket: {}".format(blob.bucket.name))
print("Storage class: {}".format(blob.storage_class))
print("ID: {}".format(blob.id))
print("Size: {} bytes".format(blob.size))
print("Updated: {}".format(blob.updated))
print("Generation: {}".format(blob.generation))
print("Metageneration: {}".format(blob.metageneration))
print("Etag: {}".format(blob.etag))
print("Owner: {}".format(blob.owner))
print("Component count: {}".format(blob.component_count))
print("Crc32c: {}".format(blob.crc32c))
print("md5_hash: {}".format(blob.md5_hash))
print("Cache-control: {}".format(blob.cache_control))
print("Content-type: {}".format(blob.content_type))
print("Content-disposition: {}".format(blob.content_disposition))
print("Content-encoding: {}".format(blob.content_encoding))
print("Content-language: {}".format(blob.content_language))
print("Metadata: {}".format(blob.metadata))
print("Temporary hold: ", "enabled" if blob.temporary_hold else "disabled")
print(
"Event based hold: ",
"enabled" if blob.event_based_hold else "disabled",
)
if blob.retention_expiration_time:
print(
"retentionExpirationTime: {}".format(
blob.retention_expiration_time
)
)
In your case you will have to look at blob.updated property
You can do this with boto:
>>> import boto
>>> conn = boto.connect_gs()
>>> bucket = conn.get_bucket('yourbucket')
>>> k = bucket.get_key('yourkey')
>>> k.last_modified
'Tue, 04 Dec 2012 17:44:57 GMT'
There is also an App Engine Python interface to cloud storage, but I don't think it exposes the metadata you want.
The App engine Cloud Storage client library will expose this information to you. This library also has dev appserver support. Getting started has an example.
Cloud Storage has an API, which you can use to get the creation time of an object
see https://developers.google.com/storage/docs/json_api/v1/objects
I am using the solution mentioned by #orby above using blob.updated to get the latest file. But there are more than 450+ files in the bucket and this script takes around 6-7 minutes to go through all the files and provide the latest latest file. I suppose the blob.updated part takes some time to process. Is there any faster way to do this?
files = bucket.list_blobs()
fileList = [file.name for file in files if '.dat' in file.name]
latestFile = fileList[0]
latestTimeStamp = bucket.get_blob(fileList[0]).updated
for i in range(len(fileList)):
timeStamp = bucket.get_blob(fileList[i]).updated
if timeStamp > latestTimeStamp:
latestFile = fileList[i]
latestTimeStamp = timeStamp
print(latestFile)