Cleanup Huge Git Repository

Cleanup Huge Git Repository - python

My company has a single git repository which is over 15 years and is really massive with about 60% of it which can be archived. I want to find these scripts (python, perl, ruby, java etc) and create a new git repository with only frequently used scripts. The scripts also have cross dependencies.
One solution that I thought was to setup inotify to watch over the files in git repo and collect the names of recently accessed scripts, collect data over few months and then create new repo based on that data. Not sure how efficient it would be though.
Another solution I thought was to use git commit date for each file and remove files which are over 5 years old.
Could anyone let me know of an efficient solution to cleanup this mess ? Or any tool similar to NewRelic that would monitor the filesystem ?

First, it's not clear what problem you are trying to solve. Is the 15-year git history slowing things down when cloning? If so, maybe just do a shallow git clone instead? (i.e. A shallow cone doesn't download the history.)
As Thilo pointed out, cutting the repo in half isn't going to make things that much faster.
But if the scripts are really that disorganized, it's highly likely that some of them need to be rewritten, documented, etc. If you just move the scripts forward, it's likely you are moving lots of inefficiencies forward too. I'd pick them off one at at time, and give them a little love, test them, etc.
One idea: You can use strace -ff -o strace.out ./myscript to figure out what other files a script opens.

Related

Is there a way with Git or Python to see all the dependencies between the selected files?

The goal is to PR a lot of code into a beta branch of our application on github. However, I'm having a hard time figuring out the best way to do this.
Say I had a function or process called X and in order for X to be functional it needs all of these files:
hello.py
x.py
test.py
run.py
example.py
whatever.py
The above are just random example files I made up, but imagine a process or function requiring many files to work.
So obviously, my intuition is just to remove some files while keeping others together that makes the best sense. I then run a test script and see if I get any import errors.
I'm strictly just adding and removing files, I'm not editing any code.
Currently I'm using git checkout, git diff, and git status to do this work and choosing files based on what I think makes sense, but I feel as if this is inefficient.
I'll then merge the branches with beta and continue forward like below:
Can anyone recommend python, git methods, other techniques, or a mindset that would help me untangle this web of code into good bite size pieces of code?
Is there a way with Git or Python to see all the dependencies between the selected files shown above?

Rename folder in git without changing the contributors

I have a problem: we are using a package that is not maintained for a while now. So we forked it in order to maintain it ourselves. The package already exists lets say it is named package_a. Most of the code and the __init__ are in the package_a/ folder.
Now we want to make our own package that will include our maintained code and we want to name is package_b. So far so good but the problems is that package_b wants to have the code and the __init__ in package_b/ folder and github changes the contributions for all files when a folder is renamed. And I would like that credit for contributions stays where it is due, the 10k+ lines of code didn't just appear in my local repo out of thin air. Any suggestions how we can have package named package_b but keep the code in the original folder package_a/?
I am thinking along the lines of trying with some clever way of importing package_a into package_b or something along the line but I hope for a definite answer.

Instead of copying the code or trying to import A into B, extract the common code into a 3rd package which both A and B import. Or perhaps a subclass. This doesn't solve your contribution problem, but it does avoid making a big maintenance hassle by copying and pasting 10,000 lines of code.
Git doesn't record copies and renames, but it can recognize when they happen. To give Git the best chance of recognizing a copy, do only the copy in its own commit. Make no changes to the content. Then in a second commit make any necessary changes to the copied code.
In normal Git you can nudge git log and git blame to honor copies and renames with -C. Git doesn't do this by default because it's more expensive.
Github will do what Github will do.
Regardless of who Github says who wrote what line their contributions will still be in the project history. That's how it goes. You make your contribution and then others put their own work on top of it. This is normal. Their contributions remain in the history.
"History sheer" is also normal, that's when a change touches many lines but is otherwise insignificant. For example, if you were to restyle the code that would cause a history sheer. git blame will say that was the last commit to touch the code. git blame -w mitigates this somewhat, and Github has an "ignore whitespace" option. History sheer is normal and so is learning to skip over it.
The tools work for us. Don't bend yourself for the benefit of the tools.
If you want to make a special shout out to your contributors, make a contributor's section to your README.md.

How to organize and categorize small projects in GitHub?

I'm new to development and trying to upload small projects that I've worked on to my GitHub profile. These projects are not dependent on each other.
My issue is that some of them are small single-file projects. Sort of like mini challenges that I've solved. So I'm thinking of grouping them together under one repo called "Python programming", for example.
Is this a good practice?
If yes, how should I go about it in Git, and
how can I still have a README file showing up for each mini project.
If no, what would you recommend doing?

GitHub will render a README file for every folder you visit, so when using just a single repository, one solution would be to still create one sub folder for each “subproject” that as such can have its own README file.
But before going that route, you should think about if those small projects actually belong together. That’s ultimately what should decide whether you want to put them all in the same repository or whether you want to split it up into several repositories.
Some things to consider for that decision:
If the projects do not depend on another, do they still relate to another? For example, are those projects part of a bigger programming challenge like Project Euler and you’re just collecting all your solutions? Then a single repository might make more sense.
What is the chance for individual projects to grow into bigger things? Many things start very small but can eventually grow into real things that justify their own repository. At that point, you might even get others to contribute.
Does it make sense for those individual files to share a history? Are the files even going to be edited once they are “done”? I.e. is this just a collection of finished things, or are they actually ongoing experiments?
Ultimately, it comes down to your personal choice. But GitHub, as the repository hoster, should not be driving your decision. You should create Git repositories locally as it makes sense to you. If that means you just have a single one, that’s fine. If that means you create lots of them, that’s also fine.
Unfortunately, the GitHub UI is not really made for small one-off projects. The repository list is just to unorganized for that. If you decide to use small projects, I advise you to add some prefix for categorization within your GitHub profile, so you know what this is about.
A good alternative for one-off projects, especially when it’s just a single (or a few) files are Gists. Gists are born as a way to share code snippets but under the hood, every Gist is actually a full Git repository. Of course, Gists do not offer the tools normal repositories on GitHub have (e.g. issues, pull requests, wikis). But for what you describe, you probably need neither of those. Then, Gists are a fine way to share simple things without adding full repositories to your profile. And you can still clone them (the remote URL is git#gist.github.com:/<gist-id>.git) and have a full history and support for multiple files if you need those.

Commonly, you'll see that the top level of the repo contains the README file, maybe a setup.py and some other extraneous information, and perhaps a tests folder. Then there will be a folder that shares a name with the repo. Inside of that folder is the code that's intended to be core content of the module/package/script.
It's also not unusual to see different organization, particularly with very small projects of single-file scripts.
For the specific case you mention, do whatever you like. What you propose sounds totally reasonable to me. I would not want to have a separate repo for all the challenges I solve!
I usually use a gist for trivial items I don't necessarily want to make a repo for, including coding challenges. So I would offer that as an alternative. Do whatever suits you best.

Script to install and compile Python, Django, Virtualenv, Mercurial, Git, LessCSS, etc... on Dreamhost

The Story
After cleaning up my Dreamhost shared server's home folder from all the cruft accumulated over time, I decided to start afresh and compile/reinstall Python.
All tutorials and snippets I found seemed overly simplistic, assuming (or ignoring) a bunch of dependencies needed by Python to compile all modules correctly. So, starting from http://andrew.io/weblog/2010/02/installing-python-2-6-virtualenv-and-virtualenvwrapper-on-dreamhost/ (so far the best guide I found), I decided to write a set-and-forget Bash script to automate this painful process, including along the way a bunch of other things I am planning to use.
The Script
I am hosting the script on http://bitbucket.org/tmslnz/python-dreamhost-batch/src/
The TODOs
So far it runs fine, and does all it needs to do in about 900 seconds, giving me at the end of the process a fully functional Python / Mercurial / etc... setup without even needing to log out and back in.
I though this might be of use for others too, but there are a few things that I think it's missing and I am not quite sure how to go for it, what's the best way to do it, or if this just doesn't make any sense at all.
Check for errors and break
Check for minor version bumps of the packages and give warnings
Check for known dependencies
Use arguments to install only some of the packages instead of commenting out lines
Organise the code in a manner that's easy to update
Optionally make the installers and compiling silent, with error logging to file
failproof .bashrc modification to prevent breaking ssh logins and having to log back via FTP to fix it
EDIT: The implied question is: can anyone, more bashful than me, offer general advice on the worthiness of the above points or highlight any problems they see with this approach? (see my answer to Ry4an's comment below)
The Gist
I am no UNIX or Bash or compiler expert, and this has been built iteratively, by trial and error. It is somehow going towards apt-get (well, 1% of it...), but since Dreamhost and others obviously cannot give root access on shared servers, this looks to me like a potentially very useful workaround; particularly so with some community work involved.

One way to streamline this would be to make it work with one of: capistrano/fabric, puppet/chef, jhbuild, or buildout+minitage (and a lot of cmmi tasks). There are some opportunities for factoring in common code, especially with something more high-level than bash. You will run into bootstrapping issues, however, so maybe leave good enough alone.
If you want to look into userland package managers, there is autopackage (bootstraps well), nix (quickstart), and stow (simple but helps with isolation).

Honestly, I would just build packages with a name prefix for all of the pieces and have them install under /opt so that they're out of the way. That way it only takes the download time and a bit of install time to do.

Organizing Python projects with shared packages

What is the best way to organize and develop a project composed of many small scripts sharing one (or more) larger Python libraries?
We have a bunch of programs in our repository that all use the same libraries stored in the same repository. So in other words, a layout like
trunk
libs
python
utilities
projects
projA
projB
When the official runs of our programs are done, we want to record what version of the code was used. For our C++ executables, things are simple because as long as the working copy is clean at compile time, everything is fine. (And since we get the version number programmatically, it must be a working copy, not an export.) For Python scripts, things are more complicated.
The problem is that, often one project (e.g. projA) will be running, and projB will need to be updated. This could cause the working copy revision to appear mixed to projA during runtime. (The code takes hours to run, and can be used as inputs for processes that take days to run, hence the strong traceability goal.)
My current workaround is, if necessary, check out another copy of the trunk to a different location, and run off there. But then I need to remember to change my PYTHONPATH to point to the second version of lib/python, not the one in the first tree.
There's not likely to be a perfect answer. But there must be a better way.
Should we be using subversion keywords to store the revision number, which would allow the data user to export files? Should we be using virtualenv? Should we be going more towards a packaging and installation mechanism? Setuptools is the standard, but I've read mixed things about it, and it seems designed for non-developer end users (of which we have none).

The much better solution involves not storing all your projects and their shared dependencies in the same repository.
Use one repository for each project, and externals for the shared libraries.
Make use of tags in the shared library repositories, so consumer projects may use exactly the version they need in their external.
Edit: (just copying this from my comment) use virtualenv if you need to provide isolated runtime environments for the different apps on the same server. Then each environment can contain a unique version of the library it needs.

If I'm understanding your question properly, then you definitely want virtualenv. Add in some virtualenvwrapper goodness to make it that much better.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.