I am using the GitPython library and was wondering how to get all commits on a branch in range of two commit sha-1's. I have the start one and end one. Is there any way to get list of them?
I have instantiated the repo object and was wondering if there was a way to query it and obtain a list of commits in the range of two shas?
Would be looking to do something similar to this command but return them as a list:
git log e0d8a4c3fec7ef2c352342c2ffada21fa07c1dc..63af686e626e0a5cbb0508367983765154e188ce --pretty=format:%h,%an,%s > commits.csv
Seems like there is Repo.iter_commits() method but can't see how to specify a range.
#shlomi33 here was my solution:
This is the method that I used to set a List of GitPyhton Commit Object. Within the class I have already instantiated the Repo object which this method relies on.
Edit: have also added the init method that is used to set up the Repo object instance.
def __init__(self, repo_url, project_name, branch_name) -> None:
"""
Clone the required repo in the repo folder and instantiate the Repo instance
:param repo_url: str of the git repository url
:param project_name: str of the project name
:param branch_name: str of the branch name
"""
self._repo_url = repo_url
self._project = project_name
self._branch = branch_name
self._project_dir = Path(self._repo_dir, self._project)
if self._project_dir.exists():
self._logger.info('Deleting existing repo: ' + str(self._project_dir))
shutil.rmtree(self._project_dir)
self._logger.info('Creating directory for repo: ' + str(self._project_dir))
os.makedirs(self._project_dir)
try:
self.repo = GitRepo.clone_from(self._repo_url, self._project_dir, branch=self._branch,
progress=self.__progress)
except GitCommandError:
self._logger.error('Failed to clone repo. Make sure you have git ssh access set up.')
raise
self._logger.info('Repository downloaded to: ' + str(self._repo_dir))
def set_commits(self, start_rev: str, end_rev: str):
"""
Sets a list of Commits for the range of the start rev and end rev
:param start_rev: Start commit revision SHA1
:param end_rev: End commit revision
"""
self._start_rev = start_rev
self._end_rev = end_rev
commits = self.repo.iter_commits(start_rev + '..' + end_rev)
self._commits = list(map(Commit, list(commits)))
Related
I feel like this should be possible, but I looked through the wandb SDK code and I can't find an easy/logical way to do it. It might be possible to hack it by modifying the manifest entries at some point later (but maybe before the artifact is logged to wandb as then the manifest and the entries might be locked)? I saw things like this in the SDK code:
version = manifest_entry.extra.get("versionID")
etag = manifest_entry.extra.get("etag")
So, I figure we can probably edit those?
UPDATE
So, I tried to hack it together with something like this and it works but it feels wrong:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
API_KEY = os.environ.get("WANDB_API_KEY")
api = api = wandb.Api(overrides={"entity": ENTITY, "project": ENTITY})
run = wandb.init(entity=ENTITY, project=PROJECT, job_type="test upload")
file = "admin2Codes.txt" # "admin1CodesASCII.txt" # (both already on s3 with a couple versions)
artifact = wandb.Artifact("test_data", type="dataset")
# modify one of the local files so it has a new md5hash etc.
with open(file, "a") as f:
f.write("new_line_1\n")
# upload local file to s3
local_file_path = file
s3_url = f"s3://bucket/prefix/{file}"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
head_response = s3_client.head_object(Bucket=s3_bucket, Key=key)
version_id: str = head_response["VersionId"]
print(version_id)
# upload a link/ref to this s3 object in wandb:
artifact.add_reference(s3_dir)
# at this point we might be able to modify the artifact._manifest.entries and each entry.extra.get("etag") etc.?
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
# set these to an older version on s3 that we know we want (rather than latest) - do this via wandb public API:
dataset_v2 = api.artifact(f"{ENTITY}/{PROJECT}/test_data:v2", type="dataset")
# artifact._manifest.add_entry(dataset_v2.manifest.entries["admin1CodesASCII.txt"])
artifact._manifest.entries["admin1CodesASCII.txt"] = dataset_v2.manifest.entries[
"admin1CodesASCII.txt"
]
# verify that it did change:
print([(name, entry.extra) for name, entry in artifact._manifest.entries.items()])
run.log_artifact(artifact) # at this point the manifest is locked I believe?
artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(artifact.name)
run_id = run.id
run.finish()
curr_run = api.run(f"{ENTITY}/{PROJECT}/{run_id}")
used_artifacts = curr_run.used_artifacts()
logged_artifacts = curr_run.logged_artifacts()
Am I on the right track here? I guess the other workaround is to make a copy on s3 (so that older version is the latest again) but I wanted to avoid this as the 1 file that I want to use an old version of is a large NLP model and the only files I want to change are small config.json files etc. (so seems very wasteful to upload all files again).
I was also wondering if when I copy an old version of an object back into the same key in the bucket if that creates a real copy or just like a pointer to the same underlying object. Neither boto3 nor AWS documentation makes that clear - although it seems like it is a proper copy.
I think I found the correct way to do it now:
import os
import wandb
import boto3
from wandb.util import md5_file
ENTITY = os.environ.get("WANDB_ENTITY")
PROJECT = os.environ.get("WANDB_PROJECT")
def wandb_update_only_some_files_in_artifact(
existing_artifact_name: str,
new_s3_file_urls: list[str],
entity: str = ENTITY,
project: str = PROJECT,
) -> Artifact:
"""If you want to just update a config.json file for example,
but the rest of the artifact can remain the same, then you can
use this functions like so:
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)
and then all the other files like model.bin will be the same as in v3,
even if there was a v4 or v5 in between (as the v3 VersionIds are used)
Args:
existing_artifact_name (str): name with version like "old_artifact:v3"
new_s3_file_urls (list[str]): files that should be updated
entity (str, optional): wandb entity. Defaults to ENTITY.
project (str, optional): wandb project. Defaults to PROJECT.
Returns:
Artifact: the new artifact object
"""
api = wandb.Api(overrides={"entity": entity, "project": project})
old_artifact = api.artifact(existing_artifact_name)
old_artifact_name = re.sub(r":v\d+$", "", old_artifact.name)
with wandb.init(entity=entity, project=project) as run:
new_artifact = wandb.Artifact(old_artifact_name, type=old_artifact.type)
s3_file_names = [s3_url.split("/")[-1] for s3_url in new_s3_file_urls]
# add the new ones:
for s3_url, filename in zip(new_s3_file_urls, s3_file_names):
new_artifact.add_reference(s3_url, filename)
# add the old ones:
for filename, entry in old_artifact.manifest.entries.items():
if filename in s3_file_names:
continue
new_artifact.add_reference(entry, filename)
# this also works but feels hackier:
# new_artifact._manifest.entries[filename] = entry
run.log_artifact(new_artifact)
new_artifact.wait() # wait for upload to finish (blocking - but should be very quick given it is just an s3 link)
print(new_artifact.name)
print(run.id)
return new_artifact
# usage:
local_file_path = "config.json" # modified file
s3_url = "s3://bucket/prefix/config.json"
s3_url_arr = s3_url.replace("s3://", "").split("/")
s3_bucket = s3_url_arr[0]
key = "/".join(s3_url_arr[1:])
s3_client = boto3.client("s3")
file_digest = md5_file(local_file_path)
s3_client.upload_file(
local_file_path,
s3_bucket,
key,
# save the md5_digest in metadata,
# can be used later to only upload new files to s3,
# as AWS doesn't digest the file consistently in the E-tag
ExtraArgs={"Metadata": {"md5_digest": file_digest}},
)
wandb_update_only_some_files_in_artifact(
"old_artifact:v3",
["s3://bucket/prefix/config.json"],
)
I've seen the topic of commiting using PyGithub in many other questions here, but none of them helped me, I didn't understood the solutions, I guess I'm too newbie.
I simply want to commit a file from my computer to a test github repository that I created. So far I'm testing with a Google Collab notebook.
This is my code, questions and problems are in the comments:
from github import Github
user = '***'
password = '***'
g = Github(user, password)
user = g.get_user()
# created a test repository
repo = user.create_repo('test')
# problem here, ask for an argument 'sha', what is this?
tree = repo.get_git_tree(???)
file = 'content/echo.py'
# since I didn't got the tree, this also goes wrong
repo.create_git_commit('test', tree, file)
The sha is a 40-character checksum hash that functions as a unique identifier to the commit ID that you want to fetch (sha is used to identify each other Git Objects as well).
From the docs:
Each object is uniquely identified by a binary SHA1 hash, being 20 bytes in size, or 40 bytes in hexadecimal notation.
Git only knows 4 distinct object types being Blobs, Trees, Commits and Tags.
The head commit sha is accessible via:
headcommit = repo.head.commit
headcommit_sha = headcommit.hexsha
Or master branch commit is accessible via:
branch = repo.get_branch("master")
master_commit = branch.commit
You can see all your existing branches via:
for branch in user.repo.get_branches():
print(f'{branch.name}')
You can also view the sha of the branch you'd like in the repository you want to fetch.
The get_git_tree takes the given sha identifier and returns a github.GitTree.GitTree, from the docs:
Git tree object creates the hierarchy between files in a Git repository
You'll find a lot of more interesting information in the docs tutorial.
Code for repository creation and to commit a new file in it on Google CoLab:
!pip install pygithub
from github import Github
user = '****'
password = '****'
g = Github(user, password)
user = g.get_user()
repo_name = 'test'
# Check if repo non existant
if repo_name not in [r.name for r in user.get_repos()]:
# Create repository
user.create_repo(repo_name)
# Get repository
repo = user.get_repo(repo_name)
# File details
file_name = 'echo.py'
file_content = 'print("echo")'
# Create file
repo.create_file(file_name, 'commit', file_content)
I want to use the Github API for Python to be able to get every repository and check the last change to the repository.
import git
from git import Repo
from github import Github
repos = []
g = Github('Dextron12', 'password')
for repo in g.get_user().get_repos():
repos.append(str(repo))
#check for last commit to repository HERE
This gets all repository's on my account but I want to be able to also get the last change to each one of them and I want a result like this:
13:46:45
I don't mind if it is 12 hour time either.
According to the documentation, the max info you can get is the SHA of the commit and the commit date:
https://pygithub.readthedocs.io/en/latest/examples/Commit.html#
with your example:
g = Github("usar", "pass")
for repo in g.get_user().get_repos():
master = repo.get_branch("master")
sha_com = master.commit
commit = repo.get_commit(sha=sha_com)
print(commit.commit.author.date)
from github import Github
from datetime import datetime
repos = {}
g = Github('username', 'password')
for repo in g.get_user().get_repos():
master = repo.get_branch('master')
sha_com = master.commit
sha_com = str(sha_com).split('Commit(sha="')
sha_com = sha_com[1].split('")')
sha_com = sha_com[0]
commit = repo.get_commit(sha_com)
#get repository name
repo = str(repo).split('Repository(full_name="Dextron12/')
repo = repo[1].split('")')
#CONVERT DATETIME OBJECT TO STRING
timeObj = commit.commit.author.date
timeStamp = timeObj.strftime("%d-%b-%Y (%H:%M:%S)")
#ADD REPOSITORY NAME AND TIMESTAMP TO repos DICTIONARY
repos[repo[0]] = timeStamp
print(repos)
I got the timeStamp by using the method Damian Lattenero suggested. upon testing his code I got a AssertationError this was because sha_commit was returning Commit=("sha") and not "sha". So I removed the brackets and Commit from the sha_com to be left with the sha all by itslef then I didn't receive that error and it worked. I then use datetime to convert the timestamp to a string and save it to a dictionary
#Dextron Just add .sha because its a property, No need to do split and form dictionary
g = Github("user", "pass")
for repo in g.get_user().get_repos():
master = repo.get_branch("master")
sha_com = master.commit
commit = repo.get_commit(sha=sha_com.sha)
print(commit.commit.author.date)
I am using python with python-kubernetes with a minikube running locally, e.g there are no cloud issues.
I am trying to create a job and provide it with data to run on. I would like to provide it with a mount of a directory with my local machine data.
I am using this example and trying to add a mount volume
This is my code after adding the keyword volume_mounts (I tried multiple places, multiple keywords and nothing works)
from os import path
import yaml
from kubernetes import client, config
JOB_NAME = "pi"
def create_job_object():
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=["/home/user/data"],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
# Create the specification of deployment
spec = client.V1JobSpec(
template=template,
backoff_limit=0)
# Instantiate the job object
job = client.V1Job(
api_version="batch/v1",
kind="Job",
metadata=client.V1ObjectMeta(name=JOB_NAME),
spec=spec)
return job
def create_job(api_instance, job):
# Create job
api_response = api_instance.create_namespaced_job(
body=job,
namespace="default")
print("Job created. status='%s'" % str(api_response.status))
def update_job(api_instance, job):
# Update container image
job.spec.template.spec.containers[0].image = "perl"
# Update the job
api_response = api_instance.patch_namespaced_job(
name=JOB_NAME,
namespace="default",
body=job)
print("Job updated. status='%s'" % str(api_response.status))
def delete_job(api_instance):
# Delete job
api_response = api_instance.delete_namespaced_job(
name=JOB_NAME,
namespace="default",
body=client.V1DeleteOptions(
propagation_policy='Foreground',
grace_period_seconds=5))
print("Job deleted. status='%s'" % str(api_response.status))
def main():
# Configs can be set in Configuration class directly or using helper
# utility. If no argument provided, the config will be loaded from
# default location.
config.load_kube_config()
batch_v1 = client.BatchV1Api()
# Create a job object with client-python API. The job we
# created is same as the `pi-job.yaml` in the /examples folder.
job = create_job_object()
create_job(batch_v1, job)
update_job(batch_v1, job)
delete_job(batch_v1)
if __name__ == '__main__':
main()
I get this error
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job
in version \"v1\" cannot be handled as a Job: v1.Job.Spec:
v1.JobSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.Containers:
[]v1.Container: v1.Container.VolumeMounts: []v1.VolumeMount:
readObjectStart: expect { or n, but found \", error found in #10 byte
of ...|ounts\": [\"/home/user|..., bigger context ...| \"image\":
\"perl\", \"name\": \"pi\", \"volumeMounts\": [\"/home/user/data\"]}],
\"restartPolicy\": \"Never\"}}}}|...","reason":"BadRequest","code":400
What am i missing here?
Is there another way to expose data to the job?
edit: trying to use client.V1Volumemount
I am trying to add this code, and add mount object in different init functions eg.
mount = client.V1VolumeMount(mount_path="/data", name="shai")
client.V1Container
client.V1PodTemplateSpec
client.V1JobSpec
client.V1Job
under multiple keywords, it all results in errors, is this the correct object to use? How shell I use it if at all?
edit: trying to pass volume_mounts as a list with the following code suggested in the answers:
def create_job_object():
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=["/home/user/data"],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
# Create the specification of deployment
spec = client.V1JobSpec(
template=template,
backoff_limit=0)
# Instantiate the job object
job = client.V1Job(
api_version="batch/v1",
kind="Job",
metadata=client.V1ObjectMeta(name=JOB_NAME),
spec=spec)
return job
And still getting a similar error
kubernetes.client.rest.ApiException: (422) Reason: Unprocessable
Entity HTTP response headers: HTTPHeaderDict({'Content-Type':
'application/json', 'Date': 'Tue, 06 Aug 2019 06:19:13 GMT',
'Content-Length': '401'}) HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job.batch
\"pi\" is invalid:
spec.template.spec.containers[0].volumeMounts[0].name: Not found:
\"d\"","reason":"Invalid","details":{"name":"pi","group":"batch","kind":"Job","causes":[{"reason":"FieldValueNotFound","message":"Not
found:
\"d\"","field":"spec.template.spec.containers[0].volumeMounts[0].name"}]},"code":422}
The V1Container call is expecting a list of V1VolumeMount objects for volume_mounts parameter but you passed in a list of string:
Code:
def create_job_object():
volume_mount = client.V1VolumeMount(
mount_path="/home/user/data"
# other optional arguments, see the volume mount doc link below
)
# Configureate Pod template container
container = client.V1Container(
name="pi",
image="perl",
volume_mounts=[volume_mount],
command=["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"])
# Create and configurate a spec section
template = client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(labels={
"app": "pi"}),
spec=client.V1PodSpec(restart_policy="Never",
containers=[container]))
....
references:
https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1Container.md
https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1VolumeMount.md
I want to save a csv file ("test.csv") in S3 using boto3.
my bucket is "outputS3Bucket" and the key is "folder/newFolder".
I want to check if "newFolder" exists and if not to create it.
import boto3
client = boto3.client('s3')
s3 = boto3.resource('s3')
bucket = s3.Bucket("outputS3Bucket")
result = client.list_objects(Bucket='outputS3Bucket',Prefix="folder/newFolder")
if len(result)==0:
key = bucket.new_key("folder/newFolder")
newKey = key + "/" + "test.csv"
client.put_object(Bucket="outputS3Bucket", Key=newKey, Body=content)
# put_object path: 's3://outputS3Bucket/folder/newFolder/test.csv'
I have few problems:
if I don't write the full key name (such as "folder/ne") and there is a "neaFo" folder instead it still says it exists.
key = bucket.new_key("folder/newFolder")
AttributeError: 's3.Bucket' object has no attribute 'new_key'
Firstly, according to boto3 documentation, it's preferred to use the new API method - list_objects_v2() instead to list a bucket's objects.
I suggest using a simple boolean function to check whether a folder exist (makes your code cleaner and more readable).
for question 1, you can check if the prefix ends with '/' character and append it if not, - this will make sure your are looking for EXACT match and not Starts With.
Sample Function:
def bucket_folder_exists(client, bucket, path_prefix):
# make path_prefix exact match and not path/to/folder*
if list(path_prefix)[-1] is not '/':
path_prefix += '/'
# check if 'Contents' key exist in response dict - if it exist it indicate the folder exists, otherwise response will be None
response = client.list_objects_v2(Bucket=bucket, Prefix=path_prefix).get('Contents')
if response:
return True
return False
Sample Implementation:
if bucket_folder_exists(client, 'outputS3Bucket', 'folder/newFolder'):
pass # Do something if folder already exist
else:
pass # Do something if folder does not exist
Regarding your second question, I added a comment - it seems your code mentions bucket variable\object used as key = bucket.new_key("folder/newFolder"), however bucket is not set anywhere in your code, -> according to the error you are getting, it looks like a s3.Bucket object, which doesn't have the the new_key attribute defined.