How can I push AWS CodeCommit to S3 using Lambda? - python

Python is my preferred language but any supported by Lambda will do.
-- All AWS Architecture --
I have Prod, Beta, and Gamma branches and corresponding folders in S3. I am looking for a method to have Lambda respond to a CodeCommit trigger and based on the Branch that triggered it, clone the repo and place the files in the appropriate S3 folder.
S3://Example-Folder/Application/Prod
S3://Example-Folder/Application/Beta
S3://Example-Folder/Application/Gamma
I tried to utilize GitPython but it does not work because Lambda does not have Git installed on the base Lambda AMI and GitPython depends on it.
I also looked through the Boto3 docs and there are only custodial tasks available; it is not able to return the project files.
Thank you for the help!

The latest version of the boto3 codecommit includes the methods get_differences and get_blob.
You can get all the content of a codecommit repository using these two methods (at least, if you are not interested in the retaining the .git history).
The script below takes all the content of the master branch and adds it to a tar file. Afterwards you could upload it to s3 as you please.
You can run this as a lambda function, which can be invoked when you push to codecommit.
This works with the current lambda python 3.6 environment.
botocore==1.5.89
boto3==1.4.4
import boto3
import pathlib
import tarfile
import io
import sys
def get_differences(repository_name, branch="master"):
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
)
differences = []
while "nextToken" in response:
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
nextToken=response["nextToken"]
)
differences += response.get("differences", [])
else:
differences += response["differences"]
return differences
if __name__ == "__main__":
repository_name = sys.argv[1]
codecommit = boto3.client("codecommit")
repository_path = pathlib.Path(repository_name)
buf = io.BytesIO()
with tarfile.open(None, mode="w:gz", fileobj=buf) as tar:
for difference in get_differences(repository_name):
blobid = difference["afterBlob"]["blobId"]
path = difference["afterBlob"]["path"]
mode = difference["afterBlob"]["mode"] # noqa
blob = codecommit.get_blob(
repositoryName=repository_name, blobId=blobid)
tarinfo = tarfile.TarInfo(str(repository_path / path))
tarinfo.size = len(blob["content"])
tar.addfile(tarinfo, io.BytesIO(blob["content"]))
tarobject = buf.getvalue()
# save to s3

Looks like LambCI does exactly you want.

Unfortunately, currently CodeCommit doesn’t have an API to upload the repository to S3 bucket. However, if you are open to trying out CodePipeline, You can configure AWS CodePipeline to use a branch in an AWS CodeCommit repository as the source stage for your code. In this way, when you make changes to your selected tracking branch in CodePipeline, an archive of the repository at the tip of that branch will be delivered to your CodePipelie bucket. For more information about CodePipeline, please refer to following link:
http://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-simple-codecommit.html

Related

Create file system/container if not found

I'm trying to export a CSV to an Azure Data Lake Storage but when the file system/container does not exist the code breaks. I have also read through the documentation but I cannot seem to find anything helpful for this situation.
How do I go about creating a container in Azure Data Lake Storage if the container specified by the user does not exist?
Current Code:
try:
file_system_client = service_client.get_file_system_client(file_system="testfilesystem")
except Exception:
file_system_client = service_client.create_file_system(file_system="testfilesystem")
Traceback:
(FilesystemNotFound) The specified filesystem does not exist.
RequestId:XXXX
Time:2021-03-31T13:39:21.8860233Z
The try catch pattern should be not used here since the Azure Data lake gen2 library has the built in exists() method for file_system_client.
First, make sure you've installed the latest version library: azure-storage-file-datalake 12.3.0. If you're not sure which version you're using, please use pip show azure-storage-file-datalake command to check the current version.
Then you can use the code below:
from azure.storage.filedatalake import DataLakeServiceClient
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", "xxx"), credential="xxx")
#the get_file_system_client method will not throw error if the file system does not exist, if you're using the latest library 12.3.0
file_system_client = service_client.get_file_system_client("filesystem333")
print("the file system exists: " + str(file_system_client.exists()))
#create the file system if it does not exist
if not file_system_client.exists():
file_system_client.create_file_system()
print("the file system is created.")
#other code
I've tested it locally, it can work successfully:

Import Existing Python App into AWS Lambda

I need to create an AWS Lambda version of an existing Python 2.7 program written by someone else who has left the company.
Using one function I need to convert as an example:
#!/usr/bin/env python
from aws_common import get_profiles,get_regions
from aws_ips import get_all_public_ips
import sys
def main(cloud_type):
# csv header
output_header = "profile,region,public ip"
profiles = get_profiles(cloud_type)
regions = get_regions(cloud_type)
print output_header
for profile in profiles:
for region in regions:
# public_ips = get_public_ips(profile,region)
public_ips = get_all_public_ips(profile,region)
for aws_ip in public_ips:
print "%s,%s,%s" % (profile,region,aws_ip)
if __name__ == "__main__":
cloud_type = 'commercial'
if sys.argv[1]:
if sys.argv[1] == 'govcloud':
cloud_type = 'govcloud'
main(cloud_type)
I need to know how to create this as an AWS handler with event and context arguments from the code above.
If I could get some pointers on how to do this it would be appreciated.
You can simply start writing python function inside the handler of aws labda.
in handler simply start defining functions and variables and uplaod zip file inside lambda if there is any type of dependency.
you can change the python version in lambda as per if you are using python 2.7.
i would like to suggest server less framework and uplaoding your code to lambda. it's so easy to manage dependency code management from locally.
here you are using aws_common and importing you have to check it is inside aws sdk or not.
you can import aws-sdk and use it
var aws = require('aws-sdk');
exports.handler = function (event, context)
{
}
inside exports handler you can start making for loops in python or goes further

Get the path of saved_model.pb after training on ML engine

I have been using the python client API of ML engine to create training jobs of some canned estimators. What I'm not able to do is get the path of the saved_model.pb on GCS because the path it is stored in has a timestamp as a dir name. Is there anyway I can get this using a regular expression or something on python client, so that I'll be able to deploy the model with correct path.
The path seems to be in this format right now -
gs://bucket_name/outputs/export/serv/timestamp/saved_model.pb
UPDATE
Thanks shahin for the answer.
So I wrote this, which gives me the exact path that I can pass to the deploy_uri for ml engine.
from google.cloud import storage
def getGCSPath(prefix):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
mlist = bucket.list_blobs(prefix=prefix)
for line in mlist:
if 'saved_model.pb' in line.name:
return line.name[:-14]
# print getGCSPath('output/export/serv/')
Use gsutil and tail:
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/outputs/export/serv | tail -1)
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION
import os
import cloudstorage as gcs
bucket = os.environ.get('BUCKET')
page_size = 1
stats = gcs.listbucket(bucket + '/outputs/export/serv', max_keys=page_size)

Use bare repo with git-python

When I'm trying to add files to bare repo:
import git
r = git.Repo("./bare-repo")
r.working_dir("/tmp/f")
print(r.bare) # True
r.index.add(["/tmp/f/foo"]) # Exception, can't use bare repo <...>
I only understood that I can add files only by Repo.index.add.
Is using bare repo with git-python module even possible? Or I need to use subprocess.call with git --work-tree=... --git-dir=... add ?
You can not add files into bare repositories. They are for sharing, not for working. You should clone bare repository to work with it. There is a nice post about it: www.saintsjd.com/2011/01/what-is-a-bare-git-repository/
UPDATE (16.06.2016)
Code sample as requested:
import git
import os, shutil
test_folder = "temp_folder"
# This is your bare repository
bare_repo_folder = os.path.join(test_folder, "bare-repo")
repo = git.Repo.init(bare_repo_folder, bare=True)
assert repo.bare
del repo
# This is non-bare repository where you can make your commits
non_bare_repo_folder = os.path.join(test_folder, "non-bare-repo")
# Clone bare repo into non-bare
cloned_repo = git.Repo.clone_from(bare_repo_folder, non_bare_repo_folder)
assert not cloned_repo.bare
# Make changes (e.g. create .gitignore file)
tmp_file = os.path.join(non_bare_repo_folder, ".gitignore")
with open(tmp_file, 'w') as f:
f.write("*.pyc")
# Run git regular operations (I use cmd commands, but you could use wrappers from git module)
cmd = cloned_repo.git
cmd.add(all=True)
cmd.commit(m=".gitignore was added")
# Push changes to bare repo
cmd.push("origin", "master", u=True)
del cloned_repo # Close Repo object and cmd associated with it
# Remove non-bare cloned repo
shutil.rmtree(non_bare_repo_folder)

How to fetch using dulwich in python

I'm trying to do the equivalent of git fetch -a using the dulwich library within python.
Using the docs at https://www.dulwich.io/docs/tutorial/remote.html I created the following script:
from dulwich.client import LocalGitClient
from dulwich.repo import Repo
import os
home = os.path.expanduser('~')
local_folder = os.path.join(home, 'temp/local'
local = Repo(local_folder)
remote = os.path.join(home, 'temp/remote')
remote_refs = LocalGitClient().fetch(remote, local)
local_refs = LocalGitClient().get_refs(local_folder)
print(remote_refs)
print(local_refs)
with an existing git repository at ~/temp/remote and a newly initialised repo at ~/temp/local
remote_refs shows everything I would expect, but local_refs is an empty dictionary and git branch -a on the local repo returns nothing.
Am I missing something obvious?
This is on dulwich 0.12.0 and Python 3.5
EDIT #1
Following a discussion on the python-uk irc channel, I updated my script to include the use of determine_wants_all:
from dulwich.client import LocalGitClient
from dulwich.repo import Repo
home = os.path.expanduser('~')
local_folder = os.path.join(home, 'temp/local'
local = Repo(local_folder)
remote = os.path.join(home, 'temp/remote')
wants = local.object_store.determine_wants_all
remote_refs = LocalGitClient().fetch(remote, local, wants)
local_refs = LocalGitClient().get_refs(local_folder)
print(remote_refs)
print(local_refs)
but this had no effect :-(
EDIT #2
Again, following discussion on the python-uk irc channel, I tried running dulwich fetch from within the local repo. It gave the same result as my script i.e. the remote refs were printed to the console correctly, but git branch -a showed nothing.
EDIT - Solved
A simple loop to update the local refs did the trick:
from dulwich.client import LocalGitClient
from dulwich.repo import Repo
import os
home = os.path.expanduser('~')
local_folder = os.path.join(home, 'temp/local')
local = Repo(local_folder)
remote = os.path.join(home, 'temp/remote')
remote_refs = LocalGitClient().fetch(remote, local)
for key, value in remote_refs.items():
local.refs[key] = value
local_refs = LocalGitClient().get_refs(local_folder)
print(remote_refs)
print(local_refs)
LocalGitClient.fetch() does not update refs, it just fetches objects and then returns the remote refs so you can use that to update the target repository refs.

Categories

Resources