Finding first and last commit for every subfolder in directory

Finding first and last commit for every subfolder in directory - python

I need to loop over every folder in a directory and find the user responsible for the first and last commit. Is there any smart way to do this in git bash? I tried looking into this with the subprocess module in Python, and using that to loop through the folders, but not sure that is a good approach
What I have tried is
git log -- path/to/folder: This solution just lists all commits to that subfolder. But I wish to filter only the first and last commit. I also wish to loop through all folders in the directory
The replies in this stackoverflow thread link: They didn't seem to work for me (either printing nothing, or giving an error)

Assuming you are interested in the current branch only, you can get the first commit via Git Bash with
git rev-list HEAD -- path/to/folder | tail -1
and the last commit with
git rev-list HEAD -- path/to/folder | head -1
git rev-list is similar to git log, but it is a "plumbing" command. "Plumbing" commands are a bit less user-friendly than "porcelain" commands like git log, but they are guaranteed to behave consistently regardless of your personal settings whereas "porcelain" commands may have different output depending on your config. Because of this, it's usually a good idea to use "plumbing" commands when writing scripts/programs.
git rev-list returns only the commit hash by default, but you can use --pretty/--format options similar to git log.
head and tail take a longer input—in this case, the entire list of commits for a path—and return only the first/last n lines, where n is whatever number you give as the parameter. git log and git rev-list show the most recent commit first, so you need tail to get the first commit and head to get the last.
You could also get the last commit using
git rev-list HEAD -1 -- path/to/folder
without piping to head. However, you cannot get the first commit using Git's built-in commit-limiting options, because e.g.
git rev-list HEAD --reverse -1 -- path/to/folder
applies the -1 limiter first, returning only the last commit, before applying --reverse.
Finally, it's worth noting that Git doesn't truly track directories, only files. If you create a folder with no files in it, it's not possible to commit that folder, and if you delete all the files within a folder, then as far as Git is concerned that folder doesn't exist anymore. The upshot is: these commands will get you the first and last commits that touch any file within the directory (and its subdirectories) as opposed to the directory itself. This distinction may or may not be important for your scenario.

I solved my issue with subprocess in the end
import subprocess
import os
dir_path = os.path.normpath('C:/folder_path')
for f in os.listdir(dir_path):
subpath = os.path.join(dir_path, f)
subprocess_args = ['git', 'log', "--pretty=format:{'author': '%aN', 'date': '%as', 'email': '%ce'}", subpath]
commits = subprocess.check_output(subprocess_args).decode().split('\n')
print(f'{f} -- first: {commits[-1]}, last: {commits[0]}')

Related

How to transfer contents from init.py in git (and maintain history) to another file whilst still keeping empty init.py

I created an import scheme that imported from __init__.py, rather than __init__.py importing from it's modules.
To fix this I ran:
$ git mv package/__init__.py package/utils.py
This looked correct:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
renamed: package/__init__.py -> package/utils.py
However if I run the following:
$ touch package/__init__.py
This is what I see:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: package/__init__.py
new file: package/utils.py
How can I get git to do the following?
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: package/utils.py
new file: package/__init__.py

TL;DR
You can make two commits if you like. There's not a lot of value to that, but there is a little. Some of its value is positive and some of it is negative. It is your choice.
Long
Git has no file history. Git has commits; the commits are the history.
Commits themselves are relatively simple: each one has a full snapshot of every file, plus some metadata containing things like the name and email address of the author of the commit. The metadata of any one commit includes the raw hash ID(s) of any earlier commit(s). Most commits, called ordinary commits, have one earlier commit, and that one earlier commit also has a snapshot and metadata, which points to one more still-earlier commit, and so on. This is how the snapshots-and-metadata are the history.
With that in mind, note that git log -p or git show shows an ordinary commit by:
displaying (the interesting part(s) of) its metadata, with formatting; then
showing what changed in that commit.
In order to achieve item 2, Git actually extracts both the commit and its parent to a temporary area (in memory), and then compares the two sets of snapshot files.1 This comparison takes the form of a diff (git diff), i.e., the difference between two snapshots.
The git status command also runs git diff. In fact, it runs git diff twice, once to compare the current (aka HEAD) commit to Git's index—your proposed next commit, resulting from any git add updates—and again to compare Git's index to your working tree, in case there are things you forgot to git add. (This form of diff uses at least one snapshot that's not saved in a commit, and one of the two forms uses real files, which takes more work than using Git's shortcut hash ID tricks. But the end result is the same.)
When Git runs this kind of diff, it can—and now, by default, will—look for renamed files. Its method of finding these renames is imperfect, though. What it does is this:
List out all the files on the left ("before" or "old version").
List out all the files on the right ("after" or "new version").
If there is a pair of files on left and right with the same name, pair those up: they must be the same file.
Take all the left-over, unpaired names. Some of these might be renames. Check all the left-side files against all the right-side files.2 If a left-side file is "sufficiently similar" to a right-side file, pair up the best matches. (100%-identical matches go faster here in most cases, and reduce the remaining pile of unpaired names, so Git always does this first.)
When you ran:
git mv package/__init__.py package/utils.py
the setup was perfect for Git: every other file matched 100% left and right, and the remaining list was that the left side had __init__.py and the right side had utils.py and the contents matched 100%. So that must be a rename! (In a way, these files are named package/__init__.py etc.: Git considers the whole thing, including the slashes, to be a file name. But it's shorter for me to leave out the package/, and you probably think of these as files-in-a-folder or files-in-a-directory yourself.)
As soon as you created a new file named __init__.py, however, Git now had both left and right side files named __init__.py, plus this one leftover right-side file named utils.py. So Git paired up the files with the same name and had one left over right-side-only file that cannot be paired.
If you make a new commit now, with this situation, git diff will continue to find things set up this way, at least until some mythical future Git is smart enough to notice that, even though the two files have the same name, a diff that says "rename and then create anew" is somehow superior.3
If, however, you make a commit that contains only the renaming step, and then create a new __init__.py file so that the package works right and commit that as a second commit, git log -p and git show will resume detecting the rename. The upside of doing this is that git log --follow, which goes step-by-step and works by changing the name it's looking for when it detects a rename, will work. The downside of doing this is that you will have one commit that is deliberately broken. You should probably note this in its commit message. If you have to do this sort of thing often, and the commit messages consistently mark such commits, you can automatically skip such commits during git bisect by writing your bisect script tester to check for these marks.
1Technically, Git gets to compare just the hash IDs of trees and blobs, which makes this go very fast in most cases.
2This checking is very expensive, computationally, so Git has a number of shortcuts, and also a cutoff limit where it just gives up. You can tweak some of these settings.
3If some future git diff is this smart, the future Git author will have to consider whether this might break some scripts. Fortunately git diff is a porcelain command, not a plumbing command, but git diff-tree and the other plumbing commands will need new options.

Using python to stash git changes, switch branches, commit files, switch back and undo stash

I am about to automate adding a large number of files to a specific branch of my git repository. I want to be certain I'm not about to cause major problems for myself.
The issue is I have a code base with which I run several hundreds of experiments. I want the results to be automatically stored to their own branch, while leaving the master branch unaffected (i.e. The master branch will NOT track experimental results). I am not as familiar with the stash command as I'd like to be, and want to be certain I'm using it correctly.
import subprocess
from git import Repo
# Stash changes and switch to result_branch
subprocess.run(["git", "stash"])
subprocess.run(["git", "checkout", "result_branch"], check=True)
#add_results_to_repo() -- Calls method that finds result files and uses Repo.git.add to add them to repo
#git_commit() -- Calls method that uses Repo.git.commit to commit branches
# Return to master branch and undo stash
subprocess.run(["git", "checkout", "master"], check=True)
subprocess.run(["git", "stash", "pop"], check=True)
I use subprocess to switch branches, because I had trouble using Repo. I use subprocess to stash, because I'm lazy and I'm familiar with subprocess.run. Perhaps:
repo.git.stash() # To create stash, and
repo.git.stash('pop') # To restore stash??
Is the approach I'm taking a valid one, or do I risk causing all sorts of repository problems for myself?

How do you perform a git command on the root commit in git using system calls in code?

I would like to be able to the the SHA of the root commit in a git repository
The catch is that I am using a script to automate a certain git task that I need to be performed many times on various repositories.
I am using the function system(), C's standard library function for making system calls, and most languages have an equivalent.
The following process does not work with system():
get SHAs of all commits with system("<git command for listing SHAs here>") <– this outputs text to the command line rather than returning a list of values to the code
find SHA of root commit <– this cannot happen if the code cannot get a list of all commits
run system("<git command here> <SHA of root commit>")
It is possible the command I am looking for looks like this:
system("git checkout root");
If this is the case, what is the command? If this is not the case, what is the appropriate solution? Is there a better alternative to this that doesn't use system() (the function for executing commands in C)?

First, note that there is not necessarily a single root commit: given N ≥ 1 commits there is at at least one root, but there could be more than one.
That said, each commit has a backwards link to its parent(s), unless it is a root commit, which by definition has no parent. So given any commit hash, you can find its root(s) by walking the graph backwards. If you start at all reachable commits and walk all paths, you will find all root commits.
There is a Git command that does precisely that: git rev-list. You give it some set of starting point commit specifiers, and it walks the graph. By default, it emits every commit hash ID as it comes across it, but it takes many options, including those that limit its output. For instance, it has the --min-parents and --max-parents options that tell it to emit only commits that have at least min, and at most max, parents. Hence:
git rev-list --all --max-parents=0
emits all root commits, as found from all references (--all).
[git rev-list] outputs text to the command line rather than returning a list data structure to code
It outputs text to standard output. Any sensible programming language and operating system offers a way to capture that output:
proc = subprocess.Popen(['git', 'rev-list', '--all', '--max-parents=0'],
stdout=subprocess.PIPE)
output = proc.stdout.read()
result = proc.wait()
for instance. (If using Python 3, note that output is made up of bytes rather than str.) You can then parse the output into a series of lines, to find the root commits. If there is more than one root, it's up to you to decide what to do about this.
Since git rev-list is a plumbing command, its output is generally designed to be machine readable.
system("git rebase <SHA of root commit>")
It's rarely sensible to rebase a complex history, but if you have a simple history, this could be fine. Having a simple history may also guarantee you a single root commit: it could be wise to verify (using the output of git rev-list --parents, for instance) that you do in fact have a simple history.

Find all deleted files in a git repositor along with who deleted them

I have a project under version control with Git. In this project there is a "grid" of files which are organized like
/parts
/a
01.src
02.src
...
90.src
/b
01.src
02.src
...
90.src
/...
(It doesn't matter for the question, but maybe it helps to know that these numbered files are small excisions from a musical score.)
These numbered files are generated by a script, and one part of our work is deleting those files that are not used in the musical score.
Now I would like to retrieve information on who deleted each file (as part of our project documentation and workflow). Information retrieval is done from a Python script.
I have a working approach, but that is extremely inefficient because it calls Git as a subprocess for each file in question, which may be far beyond 1.000 times.
What I can do is calling for each file that is missing in the directory tree:
git log --pretty=format:"%an" --diff-filter=D -- FILENAME
This gives me the author name of the last and deleting commit affecting the file. This works correctly, but as said I have to spawn a new subprocess for each deleted file.
I can do the same with a for loop on the shell:
for delfile in $(git log --all --pretty=format: --name-only --diff-filter=D | sort -u); do echo $delfile: $( git log --pretty=format:"%an" --diff-filter=D -- $delfile); done
But this is really slow, which is understandable because it spawns a new git call for every single file (just as if I'd do it from Python).
So the bottom line is: Is there an efficient way to ask Git about
all files that have been deleted from the repository
(possibly restricted to a subdirectory)
along with the author name of the last commit touching each file
(or actually: The author who deleted the file)
?

It seems my last comment brought me on the right track myself:
git log --diff-filter='D|R' --pretty=format:'%an' --name-only parts
gives me the right thing:
--diff-filter filters the right commits
--pretty=format:'%an' returns only the author
--name-only returns a list of deleted files
So as a result I get something like
Author-1
deleted-file-1
deleted-file-2
Author-2
deleted-file-3
deleted-file-4
Author-1
deleted-file-5
This doesn't give me any more information on the commits, but I don't need that for my use-case. This result can easily be processed from within Python.
(For anybody else landing on this paeg: If you need a similar thing but also want information on the result you can modify the --pretty=format:'..' option. See http://git-scm.com/book/en/Git-Basics-Viewing-the-Commit-History for a list of items that can be displayed)

How do I modify gitstats to only utilize a specified file extension for its statistics?

The website of the statistics generator in question is:
http://gitstats.sourceforge.net/
Its git repository can be cloned from:
git clone git://repo.or.cz/gitstats.git
What I want to do is something like:
./gitstatus --ext=".py" /input/foo /output/bar
Failing being able to easily pass the above option without heavy modification, I'd just hard-code the file extension I want to be included.
However, I'm unsure of the relevant section of code to modify and even if I did know, I'm unsure of how to start such modifications.
It's seems like it'd be rather simple but alas...

I found this question today while looking for the same thing. After reading sinelaw's answer I looked into the code and ended up forking the project.
https://github.com/ShawnMilo/GitStats
I added an "exclude_extensions" config option. It doesn't affect all parts of the output, but it's getting there.
I may end up doing a pretty extensive rewrite once I fully understand everything it's doing with the git output. The original project was started almost exactly four years ago today and there's a lot of clean-up that can be done due to many updates to the standard library and the Python language.

EDIT: apparently even the previous solution below only affects the "Files" stat page, which is not interesting. I'm trying to find something better. The line we need to fix is 254, this:
lines = getpipeoutput(['git rev-list --pretty=format:"%%at %%ai %%aN <%%aE>" %s' % getcommitrange('HEAD'), 'grep -v ^commit']).split('\n')
Previous attempt was:
Unfortunately, seems like git does not provide options for easily filtering by files in a commit (in the git log and git rev-list). This solution doesn't really filter all the statistics for certain file types (such as the statistics on tags), but does so for the part that calculates activity by number of lines changed.
So the best I could come up with is at line 499 of gitstats (the main script):
res = int(getpipeoutput(['git ls-tree -r --name-only "%s"' % rev, 'wc -l']).split('\n')[0])
You can change that by either adding a pipe into grep in the command, like this:
res = int(getpipeoutput(['git ls-tree -r --name-only "%s"' % rev, 'grep \\.py$', 'wc -l']).split('\n')[0])
OR, you could split out the 'wc -l' part, get the output of git ls-tree into a list of strings, and filter the resulting file names by using the fnmatch module (and then count the lines in each file, possibly by using 'wc -l') but that sounds like overkill for the specific problem you're trying to solve.
Still doesn't solve the problem (the rest of the stats will ignore this filter), but hopefully helpful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.