Printing changed file paths in latest commit in gitpython - python

I'm trying to get the changed file paths between the latest commit and the one before it in git python.
The problem is even if the latest commit has 1 changed file its displaying a lot more.
Below is my code:-
repo = git.Repo(path)
commits_list = list(repo.iter_commits())
a_commit = commits_list[0]
b_commit = commits_list[-1]
itemDiff = a_commit.diff(b_commit)
for item in itemDiff
print(item.a_path)
I'm trying this against a local cloned repo. What am I doing wrong?

If you need to read from the repo, consider using GitPython's abstraction Pydriller.
for commit in RepositoryMining("repo").traverse_commits():
for modified_file in commit.modifications:
modified_file.new_path # here you have the path of all the files changed in the commit

Related

How to access DVC-controlled files from Oracle?

I have been storing my large files in CLOBs within Oracle, but I am thinking of storing my large files in a shared drive, then having a column in Oracle contain pointers to the files. This would use DVC.
When I do this,
(a) are the paths in Oracle paths that point to the files in my shared drive, as in, the actual files themselves?
(b) or do the paths in Oracle point somehow to the DVC metafile?
Any insight would help me out!
Thanks :)
Justin
EDIT to provide more clarity:
I checked here (https://dvc.org/doc/api-reference/open), and it helped, but I'm not fully there yet ...
I want to pull a file from a remote dvc repository using python (which I have connected to the Oracle database). So, if we can make that work, I think I will be good. But, I am confused. If I specify 'remote' below, then how do I name the file (e.g., 'activity.log') when the remote files are all encoded?
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
(NOTE: For testing purposes, my "remote" DVC directory is just another folder on my MacBook.)
I feel like I'm missing a key concept about getting remote files ...
I hope that adds more clarity. Any help figuring out remote file access is appreciated! :)
Justin
EDIT to get insights on 'rev' parameter:
Before my question, some background/my setup:
(a) I have a repo on my MacBook called 'basics'.
(b) I copied into 'basics' a directory of 501 files (called 'surface_files') that I subsequently pushed to a remote storage folder called 'gss'. After the push, 'gss' contains 220 hash directories.
The steps I used to get here are as follows:
> cd ~/Desktop/Work/basics
> git init
> dvc init
> dvc add ~/Desktop/Work/basics/surface_files
> git add .gitignore surface_files.dvc
> git commit -m "Add raw data"
> dvc remote add -d remote_storage ~/Desktop/Work/gss
> git commit .dvc/config -m "Configure remote storage"
> dvc push
> rm -rf ./.dvc/cache
> rm -rf ./surface_files
Next, I ran the following Python code to take one of my surface files, named surface_100141.dat, and used dvc.api.get_url() to get the corresponding remote storage file name. I then copied this remote storage file into my desktop under the file's original name, i.e., surface_100141.dat.
The code that does all this is as follows, but FIRST, MY QUESTION --- when I run the code as it is shown below, no problems; but when I uncomment the 'rev=' line, it fails. I am not sure why this is happening. I used git log and cat .git/refs/heads/master to make sure that I was getting the right hash. WHY IS THIS FAILING? That is my question.
(In full disclosure, my git knowledge is not too strong yet. I'm getting there, but it's still a work in progress! :))
import dvc.api
import os.path
from os import path
import shutil
filename = 'surface_100141.dat' # This file name would be stored in my Oracle database
home_dir = os.path.expanduser('~')+'/' # This simply expanding '~' into '/Users/ricej/'
resource_url = dvc.api.get_url(
path=f'surface_files/{filename}', # Works when 'surface_files.dvc' exists, even when 'surface_files' directory and .dvc/cache do not
repo=f'{home_dir}Desktop/Work/basics',
# rev='5c92710e68c045d75865fa24f1b56a0a486a8a45', # Commit hash, found using 'git log' or 'cat .git/refs/heads/master'
remote='remote_storage')
resource_url = home_dir+resource_url
print(f'Remote file: {resource_url}')
new_dir = f'{home_dir}Desktop/' # Will copy fetched file to desktop, for demonstration
new_file = new_dir+filename
print(f'Remote file copy: {new_file}')
if path.exists(new_file):
os.remove(new_file)
dest = shutil.copy(resource_url, new_file) # Check your desktop after this to see remote file copy
I'm not 100% sure that I understand the question (it would be great to expand it a bit on the actual use case you are trying to solve with this database), but I can share a few thoughts.
When we talk about DVC, I think you need to specify a few things to identify the file/directory:
Git commit + path (actual path like data/data/xml). Commit (or to be precise any Git revision) is needed to identify the version of the data file.
Or path in the DVC storage (/mnt/shared/storage/00/198493ef2343ao ...) + actual name of this file. This way you would be saving info that .dvc` files have.
I would say that second way is not recommended since to some extent it's an implementation detail - how does DVC store files internally. The public interface to DVC organized data storage is its repository URL + commit + file name.
Edit (example):
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
location/of/dvc/project this path must point to an actual Git repo. This repo should have a .dvc or dvc.lock file that has activity.log name in it + its hash in the remote storage:
outs:
- md5: a304afb96060aad90176268345e10355
path: activity.log
By reading this Git repo and analyzing let's say activity.log.dvc DVC will be able to create the right path s3://my-bucket/storage/a3/04afb96060aad90176268345e10355
remote='my-s3-bucket' argument is optional. By default it will use the one that is defined in the repo itself.
Let's take another real example:
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
In the https://github.com/iterative/dataset-registry you could find the .dvc file that is enough for DVC to create a path to the file by also analyzing its config
https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355
you could run wget on this file to download it

Python API for Github, getting contents in specific directory for specific branch not returning all content

Using the PyGithub API, I am attempting to retrieve all contents from a specific folder from a specific branch of a repository hosted with Github. I can't share the actual repository or specifics regarding the data, but the code I am using is this:
import github
import json
import requests
import base64
from collections import namedtuple
Package = namedtuple('Package', 'name version')
# Parameters
gh_token = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
header = {"Authorization": f"token {gh_token}"}
gh_hostname = 'devtopia.xxx.com'
gh = github.Github(base_url=f'https://{gh_hostname}/api/v3', login_or_token = gh_token)
repo_name = "xxxxxxxxx/SupportFiles"
conda_meta = "xxxxxxx/bin/Python/envs/xxxxxx-xx/conda-meta"
repo = gh.get_repo(repo_name)
def parse_conda_meta(branch):
package_list = []
meta_contents = repo.get_contents(conda_meta, ref=branch) #<< Returns less files than expected for
# a specified branch "xxx/release/3.2.0",
# returns expected number of files for
# "master" branch.
for i, pkg in enumerate(meta_contents):
if ".json" in pkg.name: # filter for JSON files
print(i, pkg.name)
# Need to use GitHub Data API (REST) blobs instead of easier
# `github` with `pkg.decoded_content` here because that method
# only works with files <= 1MB whereas Data API allows for
# reading files <= 100MB.
resp = requests.get(f"https://devtopia.xxxx.com/api/v3/repos/xxxxxxxxx/SupportFiles/git/blobs/{pkg.sha}?ref={branch}", headers=header)
pkg_cont = json.loads(base64.b64decode(json.loads(resp.content)["content"]))
package_list.append(Package(pkg_cont['name'], pkg_cont['version']))
else:
print('>>', i, pkg.name)
return package_list
if __name__ == "__main__":
pkgs = parse_conda_meta("xxx/release/3.2.0")
print(pkgs)
print(len(pkgs))
For some reason that I can't get to the bottom of, I am not getting the correct number of files returned by repo.get_contents(conda_meta, ref=branch). For the branch that I am specifying, when that branch is checked out I am seeing 186 files in the conda-meta folder. However, repo.get_contents(conda_meta, ref=branch) returns only 182, I am missing four JSON files.
Is there some limitation to repo.get_contents that I'm not aware of? I've been reading the doc but can't find anything that hints at the problem I am having. There is one bit about it only handling files up to 1mb, but I am seeing files larger than this returned (e.x: python is 1.204mb and is returned in the list of files). I believe this just applies to reading file content over 1mb, which I deal with by using the GitHub Data API (REST) further downstream. Is there something I'm doing wrong here?
Thanks for reading, any help with this is much appreciated!
Update with solution!
The Problem:
After some more digging, I have found the problem's cause. It's not to do with the code above or repo.get_contents(conda_meta, ref=branch) specifically. It is actually a unix/windows clash that was mistakenly introduced into our repository for this specific branch "xxx/release/3.2.0" but not present in others.
So what was the problem? NTFS (and Windows more broadly) by default is case insensitive, but Git is from a Unix world and is case-sensitive by default
We inadvertently created two folders for Python in the bin directory of the conda_meta path (xxxxxx/bin/), one folder called "Python" and one called "python" (note the lower-case). When pulling the repository locally, only the "Python" folder shows up containing all 168 files. On GitHub, however, the path with "Python" contains 182 files while the path with "python" contains the remaining 4 files.
The Solution:
Solution is to add a conda_meta_folders parameter that takes a list of paths to parse_conda_meta and search each directory. There might be a slicker solution though, I'm looking into whether it is possible to do something like git config core.ignorecase true with the PyGithub API. Does anyone know if it is possible to have PyGithub honor this or be configured for this?

How to checkout multiple commits and copy the version of code on another directory?

This is my csv file witch contain the CommitId:
CommitId
d38f7b334856ed4007fb3ec0f8a5f7499ee2f2b8
d38f7b334856ed4007fb3ec0f8a5f7499ee2f2b8
d38f7b334856ed4007fb3ec0f8a5f7499ee2f2b8
4bb968a47ce00279d6051df95bd782650700179e
c3d7ec38417ecff03d1cd3be0163e6ce07578eb3
00568c9886e739d6b5dd61b4a4326d598552fb6f
00568c9886e739d6b5dd61b4a4326d598552fb6f
00568c9886e739d6b5dd61b4a4326d598552fb6f
00568c9886e739d6b5dd61b4a4326d598552fb6f
6e062098453febbfb0169cd0af56f70f2e3fc77f
63f658918c2f4b851b0d0fffbffab4df0cfe13ca
I need to checkout each commit and copy the version of code on another directory so for this exemple I need 11 version of code in a directory.
I tried this code for one commit:
import os
from distutils.dir_util import copy_tree
path='C:/Users/AQ42770/Desktop/RefactoringMiner/bin/BTC-e-client-for-Android'
os.chdir(path)
commande1='git chekcout d38f7b334856ed4007fb3ec0f8a5f7499ee2f2b8'
os.system(commande1)
copy_tree("C:/Users/AQ42770/Desktop/RefactoringMiner/bin/BTC-e-client-for-Android", "C:/Users/AQ42770/Desktop/test")
the first problem is : copy_tree() copy the files into the destination folder not the directory
Second:I did not find a way to do this for all the commits on may csv
Thanks for help!
Instead of a checkout, you could use git worktree.
More precisely: git worktree add C:/Users/AQ42770/Desktop/test1 <commit1>.
And repeat for <commit2> to C:/Users/AQ42770/Desktop/test2, and so on.
That way, you have only one clone, but 11 working tree, all with the right content.
You need git cherry-pick for that.
git cherry-pick A..B where A and B are your two commits(A being the older one and B being the newer one.

Number of lines added and deleted in a git repository using python

The following code prints the files which have been modified in current tree vs previous tree (if changed):
for modified in commit.diff('HEAD~1').iter_change_type('M'):
print(modified.a_blob.path) # prints all files modified
How to get number of lines added and deleted too?
(Just like we do using git log --numstat).
You can use git directly to do this from within gitpython:
the_git = repo.git
log = the_git.log('--numstat')
More here if you like: http://gitpython.readthedocs.io/en/stable/tutorial.html#using-git-directly

GitPython pull/fetch retrieve progress

How do I retrieve the file list when pull/fetch? Assume that the upstream is set and pulling the upstream.
repo = git.Repo('/repo_location/')
result = repo.git.pull()
According to the API reference, it says the return is iterable list. but i'm unable to use it that way.
If I do print(result) it'll correctly prints to stdout, but not when I iterate.
I found no gitpython-method to do this, so I executed git diff-tree:
repo = Repo(repo_dir)
# first get the remote information to get all changed files, than pull the repository
# has to be done in this order, because after pulling the local repository does not differ from the remote
repo.remote().fetch()
repo.remote().pull()
# the "-t"-option also lists all changed directories
diff_tree = repo.git.execute('git diff-tree --name-status -t master origin/master')
for line in diff_tree.splitlines():
change_status, file_path = line.split('\t')

Categories

Resources