Does creating a Git branch copy ALL of the source code? - python

So e.g. you are working # Google in the YouTube team and you want to modify how the search bar looks like, or just want to change the font size, or work on a major project like the recommender system etc., does making a Git branch copy over ALL of the backend code for YouTube on your machine? So if there are 100 engineers working from their laptops in the YouTube team, are there 100 copies of YouTube code on their tiny laptops in circulation? Because as I understand Git, when you branch off, you create a copy of the source code, which you merge back into the production branch, which merges into the master branch.
Please correct me if I am wrong as I have only worked on MUCH smaller projects which use Git (~100 files, ~15k lines of code).
Your support will be much appreciated.
Thanks.

Creating a branch in Git copies nothing.
OK, this is a bit of an overstatement. It copies one hash ID. That is, suppose you have an existing repository with N branches. When you create a new branch, Git writes one new file holding a short (currently 40-byte long, eventually to be 64-byte long) hash ID. So if your previous disk usage was 50 megabytes, your new disk usage is ... 50 megabytes.
On the other hand, cloning a repository copies everything. If the repository over on Server S is 50 megabytes, and you clone it to Laptop L, the repository on Laptop L is also 50 megabytes.1 There are ways to reduce the size of the clone (by omitting some commits), but they should be used with care. In any case, these days 50 megabytes is pretty small anyway. :-)
There's a plan in the works for Git to perform a sort of mostly-delayed cloning, where an initial clone copies some of the commits and replaces all the rest with a sort of IOU. This is not ready for production yet, though.
The way to understand all of this is that Git does not care about files, nor about branches. Git cares about commits. Commits contain files, so you get files when you get commits, and commits are identified by incomprehensible hash IDs, so we have branch names with which to find the hash IDs. But it's the commits that matter. Creating a new branch name just stores one existing commit hash ID into the new branch name. The cost of this is tiny.
1This isn't quite guaranteed, due to the way objects stored in Git repositories get "packed". Git will run git gc, the Garbage Collector, now and then to collect and throw out rubbish and shrink the repository size, and depending on how much rubbish there is in any given repository, you might see different sizes.
There have been various bugs in which Git didn't run git gc --auto often enough (in particular, up through 2.17 git commit neglected to start an auto-gc afterward) or in which the auto-gc would never finish cleaning up (due to left-over failure log from an earlier gc, fixed in 2.12.2 and 2.13.0). In these cases a clone might wind up much smaller than the original repository.

Related

How to transfer contents from __init__.py in git (and maintain history) to another file whilst still keeping empty __init__.py

I created an import scheme that imported from __init__.py, rather than __init__.py importing from it's modules.
To fix this I ran:
$ git mv package/__init__.py package/utils.py
This looked correct:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
renamed: package/__init__.py -> package/utils.py
However if I run the following:
$ touch package/__init__.py
This is what I see:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: package/__init__.py
new file: package/utils.py
How can I get git to do the following?
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: package/utils.py
new file: package/__init__.py
TL;DR
You can make two commits if you like. There's not a lot of value to that, but there is a little. Some of its value is positive and some of it is negative. It is your choice.
Long
Git has no file history. Git has commits; the commits are the history.
Commits themselves are relatively simple: each one has a full snapshot of every file, plus some metadata containing things like the name and email address of the author of the commit. The metadata of any one commit includes the raw hash ID(s) of any earlier commit(s). Most commits, called ordinary commits, have one earlier commit, and that one earlier commit also has a snapshot and metadata, which points to one more still-earlier commit, and so on. This is how the snapshots-and-metadata are the history.
With that in mind, note that git log -p or git show shows an ordinary commit by:
displaying (the interesting part(s) of) its metadata, with formatting; then
showing what changed in that commit.
In order to achieve item 2, Git actually extracts both the commit and its parent to a temporary area (in memory), and then compares the two sets of snapshot files.1 This comparison takes the form of a diff (git diff), i.e., the difference between two snapshots.
The git status command also runs git diff. In fact, it runs git diff twice, once to compare the current (aka HEAD) commit to Git's index—your proposed next commit, resulting from any git add updates—and again to compare Git's index to your working tree, in case there are things you forgot to git add. (This form of diff uses at least one snapshot that's not saved in a commit, and one of the two forms uses real files, which takes more work than using Git's shortcut hash ID tricks. But the end result is the same.)
When Git runs this kind of diff, it can—and now, by default, will—look for renamed files. Its method of finding these renames is imperfect, though. What it does is this:
List out all the files on the left ("before" or "old version").
List out all the files on the right ("after" or "new version").
If there is a pair of files on left and right with the same name, pair those up: they must be the same file.
Take all the left-over, unpaired names. Some of these might be renames. Check all the left-side files against all the right-side files.2 If a left-side file is "sufficiently similar" to a right-side file, pair up the best matches. (100%-identical matches go faster here in most cases, and reduce the remaining pile of unpaired names, so Git always does this first.)
When you ran:
git mv package/__init__.py package/utils.py
the setup was perfect for Git: every other file matched 100% left and right, and the remaining list was that the left side had __init__.py and the right side had utils.py and the contents matched 100%. So that must be a rename! (In a way, these files are named package/__init__.py etc.: Git considers the whole thing, including the slashes, to be a file name. But it's shorter for me to leave out the package/, and you probably think of these as files-in-a-folder or files-in-a-directory yourself.)
As soon as you created a new file named __init__.py, however, Git now had both left and right side files named __init__.py, plus this one leftover right-side file named utils.py. So Git paired up the files with the same name and had one left over right-side-only file that cannot be paired.
If you make a new commit now, with this situation, git diff will continue to find things set up this way, at least until some mythical future Git is smart enough to notice that, even though the two files have the same name, a diff that says "rename and then create anew" is somehow superior.3
If, however, you make a commit that contains only the renaming step, and then create a new __init__.py file so that the package works right and commit that as a second commit, git log -p and git show will resume detecting the rename. The upside of doing this is that git log --follow, which goes step-by-step and works by changing the name it's looking for when it detects a rename, will work. The downside of doing this is that you will have one commit that is deliberately broken. You should probably note this in its commit message. If you have to do this sort of thing often, and the commit messages consistently mark such commits, you can automatically skip such commits during git bisect by writing your bisect script tester to check for these marks.
1Technically, Git gets to compare just the hash IDs of trees and blobs, which makes this go very fast in most cases.
2This checking is very expensive, computationally, so Git has a number of shortcuts, and also a cutoff limit where it just gives up. You can tweak some of these settings.
3If some future git diff is this smart, the future Git author will have to consider whether this might break some scripts. Fortunately git diff is a porcelain command, not a plumbing command, but git diff-tree and the other plumbing commands will need new options.

How to distinguish several small changes from daily changes within git?

I work on a major private Python project in an exploratory way for several months using Pycharm.
I use git to track the changes in that project.
I am the only person that contributes to that project.
Normally I commit changes to that project roughly once a day.
Now I want to track changes of my code every time I execute my code (the background is that I sometimes get lost which intermediate result was achieved using which version of my code).
Therefore I want to perform a git commit of all changed files at the end of the script execution.
As now every commit just gets a 'technical' commit message, I would like to distinguish these 'technical' commits from the other commits I do roughly once a day (see above). The background is that I still would like to see and compare the daily differences from each other. The technical commits might sum up to several dozens per day and would hinder me to see the major changes over the course of time.
Which techniques does git offer to distinguish the technical commits from the daily commits I do?
Is maybe branching a valid approach for this? If yes, would I delete these branches later on? (I am a git novice)
You could use a branch for that, yes. Just use a working branch when doing your scripted autocommits, and then when you want to make a commit for the history, switch to your main branch.
To get to re-add the final changes as a single commit, one way would be to soft reset the history when you are done with the changes. So you would:
git reset prev-real-commit
Which jumps the history back to before your new batch of wip auto commits, but does not touch the files so you don't loose work. Then you can make a new commit normally for the changes.
That technique also works without a branch. Using a branch might still be nice though so you can easily check what the version was before your new wip commits.
Git also has rebasing which would allow squashing multiple commits to one and rewriting the messages. But for the workflow you describe, I think simply reseting the autocommits away and redoing a normal commit is better.
Also the suggestion to add some tag to the message of the autocommits is good.
That said, I usually just commit the checkpoints that I need in normal dev flow. It can be nicer to have a commit e.g. every hour instead of only once a day. Small atomic commits are good. You can use feature branches and on GitHub pull requests if you want to record and manage larger wholes of work.
I think even if you work on this project alone it might still be a good idea to adopt typical github flow approach - and start using branches.
The idea is that you distinguish your "technical" commits (many issued throughout the day) from your daily commits (rarely more than one) in terms of Git entities used:
your main code stays in master branch
your daily commits remain 'normal' commits going into a specific long-running branch (develop is a common name)
your 'once-a-day' commit becomes a merge commit, pushing all the changes in develop into master branch
This allows to you to save the history - yet see a clear distinction between those two types. You can opt for 'no fast forward' approach, so that each merge commit becomes clearly distinct from 'regular' ones.
And if you actually don't want all the history to be there (as #antont said, there might be a LOT of commits), you might consider 'squashing' those commits when either merging or rebasing, like described here.

How to merge the display of logs from several Mercurial repositories

Is there a way to merge the change logs from several different Mercurial repositories? By "merge" here I just mean integrate into a single display; this is nothing to do with merging in the source control sense.
In other words, I want to run hg log on several different repositories at once. The entries should be sorted by date regardless of which repository they're from, but be limited to the last n days (configurable), and should include entries from all branches of all the repositories. It would also be nice to filter by author and do this in a graphical client like TortoiseHg. Does anyone know of an existing tool or script that would do this? Or, failing that, a good way to access the log entries programmically? (Mercurial is written in Python, which would be ideal, but I can't find any information on a simple API for this.)
Background: We are gradually beginning to transition from SVN to Mercurial. The old repository was not just monolithic in the sense of one server, but also in the sense that there was one huge repository for all projects (albeit with a sensible directory structure). Our new Mercurial repositories are more focused! In general, this works much better, but we miss one useful feature from SVN: being able to use svn log at the root of the repository to see everything we have been working on recently. It's very useful for filling in timesheets, giving yourself a sense of purpose, etc.
I figured out a way of doing this myself. In short, I merge all the revisions into one mega-repo, and I can then look at this in TortoiseHG. Of course, it's a total mess, but it's good enough to get a summary of what happened recently.
I do this in three steps:
(Optional) Run hg convert on each source repository using the branchmap feature to rename each branch from original to reponame/original. This makes it easier later to identify which revision came from which source repository. (More faithful to SVN would be to use the filemap feature instead.)
On a new repository, run hg pull -f to force-pull from the individual repositories into a one big one. This gets all the revisions in one place, but they show up in the wrong order.
Use the method described in this answer to create yet another repository that contains all the changes from the one created in step 2 but sorted into the right order. (Actually I use a slight variant: I get the hashes and compare against the hashes in the destination, check that the destination has a prefix of the source's, and only copy the new ones across.)
This is all done from a Python script, but although Mercurial is written in Python I just use the command line interface using the subprocess module. Running through the three steps only copies the new revisions without rebuilding everything from scratch, unless you add a new repo.

Find all deleted files in a git repositor along with who deleted them

I have a project under version control with Git. In this project there is a "grid" of files which are organized like
/parts
/a
01.src
02.src
...
90.src
/b
01.src
02.src
...
90.src
/...
(It doesn't matter for the question, but maybe it helps to know that these numbered files are small excisions from a musical score.)
These numbered files are generated by a script, and one part of our work is deleting those files that are not used in the musical score.
Now I would like to retrieve information on who deleted each file (as part of our project documentation and workflow). Information retrieval is done from a Python script.
I have a working approach, but that is extremely inefficient because it calls Git as a subprocess for each file in question, which may be far beyond 1.000 times.
What I can do is calling for each file that is missing in the directory tree:
git log --pretty=format:"%an" --diff-filter=D -- FILENAME
This gives me the author name of the last and deleting commit affecting the file. This works correctly, but as said I have to spawn a new subprocess for each deleted file.
I can do the same with a for loop on the shell:
for delfile in $(git log --all --pretty=format: --name-only --diff-filter=D | sort -u); do echo $delfile: $( git log --pretty=format:"%an" --diff-filter=D -- $delfile); done
But this is really slow, which is understandable because it spawns a new git call for every single file (just as if I'd do it from Python).
So the bottom line is: Is there an efficient way to ask Git about
all files that have been deleted from the repository
(possibly restricted to a subdirectory)
along with the author name of the last commit touching each file
(or actually: The author who deleted the file)
?
It seems my last comment brought me on the right track myself:
git log --diff-filter='D|R' --pretty=format:'%an' --name-only parts
gives me the right thing:
--diff-filter filters the right commits
--pretty=format:'%an' returns only the author
--name-only returns a list of deleted files
So as a result I get something like
Author-1
deleted-file-1
deleted-file-2
Author-2
deleted-file-3
deleted-file-4
Author-1
deleted-file-5
This doesn't give me any more information on the commits, but I don't need that for my use-case. This result can easily be processed from within Python.
(For anybody else landing on this paeg: If you need a similar thing but also want information on the result you can modify the --pretty=format:'..' option. See http://git-scm.com/book/en/Git-Basics-Viewing-the-Commit-History for a list of items that can be displayed)

What is the best way to archive a data CD/DVD in python?

I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
The data is very valuable, so verification would be useful
The discs are pretty old, so a lot of them will be hard or impossible to read
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?
When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)
Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...
If you are still bound to write your own... Answering your specific questions:
With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.
To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).

Categories

Resources