Rename folder in git without changing the contributors

Rename folder in git without changing the contributors - python

I have a problem: we are using a package that is not maintained for a while now. So we forked it in order to maintain it ourselves. The package already exists lets say it is named package_a. Most of the code and the __init__ are in the package_a/ folder.
Now we want to make our own package that will include our maintained code and we want to name is package_b. So far so good but the problems is that package_b wants to have the code and the __init__ in package_b/ folder and github changes the contributions for all files when a folder is renamed. And I would like that credit for contributions stays where it is due, the 10k+ lines of code didn't just appear in my local repo out of thin air. Any suggestions how we can have package named package_b but keep the code in the original folder package_a/?
I am thinking along the lines of trying with some clever way of importing package_a into package_b or something along the line but I hope for a definite answer.

Instead of copying the code or trying to import A into B, extract the common code into a 3rd package which both A and B import. Or perhaps a subclass. This doesn't solve your contribution problem, but it does avoid making a big maintenance hassle by copying and pasting 10,000 lines of code.
Git doesn't record copies and renames, but it can recognize when they happen. To give Git the best chance of recognizing a copy, do only the copy in its own commit. Make no changes to the content. Then in a second commit make any necessary changes to the copied code.
In normal Git you can nudge git log and git blame to honor copies and renames with -C. Git doesn't do this by default because it's more expensive.
Github will do what Github will do.
Regardless of who Github says who wrote what line their contributions will still be in the project history. That's how it goes. You make your contribution and then others put their own work on top of it. This is normal. Their contributions remain in the history.
"History sheer" is also normal, that's when a change touches many lines but is otherwise insignificant. For example, if you were to restyle the code that would cause a history sheer. git blame will say that was the last commit to touch the code. git blame -w mitigates this somewhat, and Github has an "ignore whitespace" option. History sheer is normal and so is learning to skip over it.
The tools work for us. Don't bend yourself for the benefit of the tools.
If you want to make a special shout out to your contributors, make a contributor's section to your README.md.

Related

Removing pycache in git

How can I remove existing and future pycahce files from git repository in Windows? The commands I found online are not working for example when I send the command "git rm -r --cached __pycache__" I get the command "pathspec '__pycache__' did not match any files".

The __pycache__ folders that you are seeing are not in your current and future Git commits. Because of the way Git works internally—which Git forces you to know, at least if you're going to understand it—understanding this is a bit tricky, even once we get past the "directory / folder confusion" we saw in your comments.
The right place to start, I believe, is at the top. Git isn't about files (or even files-and-folders / files-and-directories). Those new to Git see it as storing files, so they think it's about files, but that's just not true. Or, they note the importance of the ideas behind branches, and think that Git is about branches, and that too is not really true, because people confuse one kind of "branch" (that does matter) with branch names (which don't matter). The first thing to know, then, is that Git is really all about commits.
This means that you really need to know:
what a commit is, and
what a commit does for you
(these two overlap but are not identical). We won't really cover what a commit is here, for space reasons, but let's look at the main thing that one does for you: Each commit stores a full snapshot of every file.
We now need a small digression into files and folders and how Git and your OS differ in terms of how they organize files. Your computer insists that a file has a name like file.ext and lives in a folder or directory—the two terms are interchangeable—such as to, which in turn lives in another folder such as path. This produces path/to/file.ext or, on Windows, path\to\file.ext.
Git, by contrast, has only files, and their names always use forward slashes and include the slashes. The file named path/to/file.ext is literally just the file, with that name. But Git does understand that your computer demands the file-in-folder format, and will convert back and forth as needed. If Git needs to extract a file whose name is some/long/file/name.ext, Git will create folders some, some/long, and so on when it must, all automatically.
The strange side effect of this is that because Git stores only the files, not the folders, Git is unable to store an empty folder. This distinction actually occurs in Git's index aka staging area, which we won't get into in any detail, but it explains the problem whose answers are given in How do I add an empty directory to a Git repository?
In any case, commits in Git store files, using these path names. Each commit has a full copy of every file—but the files' contents are stored in a special, Git-ized, read-only, Git-only format in which the contents are de-duplicated. So if a million commits store one particular version of one particular file, there's really only one copy, shared between all million commits. Git can do this kind of sharing because, unlike regular files on your computer, files stored in a commit, in Git, literally can't be changed.
Going back to the commits now: each commit contains a full snapshot of every file (that it had when you, or whoever, made the commit). But these files are read-only—they literally can't have their contents replaced, which is what enables that sharing—and only Git itself can even read them. This makes them useless for actually getting any work done. They're fine as archives, but no good for real work.
The solution to this problem is simple (and the same as in almost all other version control systems): when you select some commit to work on / with, Git will extract the files from that commit. This creates ordinary files, in ordinary folders, in an ordinary area in which you can do your work (whether that's ordinary or substandard or exemplary work—that's all up to you, not to Git 😀). What this means is that you do your work in a working tree or work-tree (Git uses these two terms interchangeably). More importantly, it means this: The files you see and work on / with are not in Git. They may have just been extracted by Git, from some commit. But now they're ordinary files and you use them without Git being aware of what you're doing.
Since Git has extracted these files into ordinary folders, you can create new files and/or new folders if you like. When you run Python programs, Python itself will, at various times, create __pycache__ folders and stuff *.pyc and/or *.pyo files into them. Python does this without Git's knowledge or understanding.
Because these files are generated by Python, based on your source, and just used to speed up Python, it's a good idea to avoid putting them into the commits. There's no need to save a permanent snapshot of these files, especially since the format and contents may depend on the specific Python version (e.g., Python 3.7 generates *.cpython-37.pyc files, Python 3.9 generates *.cpython-39.pyc files, and so on). So we tell Git two things:
Don't complain about the existence of these particular untracked files in the working tree.
When I use an en-masse "add everything" operation like git add ., don't add these files to the index / staging-area, so that they won't go into the next commit either.
We generally do this with the (poorly named) .gitignore file. Listing a file name in a .gitignore does not make Git ignore it; instead, it has the effect of doing the two things I listed here.
This uses the Git-specific term untracked file, which has a simple definition that has a complex back-story. An untracked file is simply any file in your working tree that is not currently in Git's index (staging area). Since we're not going to get into a discussion of Git's index here, we have to stop there for now, but the general idea is that we don't allow the __pycache__ files to get into the index, which keeps them untracked, which keeps Git from committing them, which keeps them from getting into Git's index. It's all a bit circular here, and if you accidentally do get these files into Git's index, that's when you need the git rm -r --cached __pycache__ command.
Since that command is failing, it means you don't have the problem this command is meant to solve. That's good!

Well, you don't need __pycache__ files in your git repositories and you'd better to ignore all related files to it by adding __pycache__/ to your .gitignore file.

Is there a way with Git or Python to see all the dependencies between the selected files?

The goal is to PR a lot of code into a beta branch of our application on github. However, I'm having a hard time figuring out the best way to do this.
Say I had a function or process called X and in order for X to be functional it needs all of these files:
hello.py
x.py
test.py
run.py
example.py
whatever.py
The above are just random example files I made up, but imagine a process or function requiring many files to work.
So obviously, my intuition is just to remove some files while keeping others together that makes the best sense. I then run a test script and see if I get any import errors.
I'm strictly just adding and removing files, I'm not editing any code.
Currently I'm using git checkout, git diff, and git status to do this work and choosing files based on what I think makes sense, but I feel as if this is inefficient.
I'll then merge the branches with beta and continue forward like below:
Can anyone recommend python, git methods, other techniques, or a mindset that would help me untangle this web of code into good bite size pieces of code?
Is there a way with Git or Python to see all the dependencies between the selected files shown above?

How to organize and categorize small projects in GitHub?

I'm new to development and trying to upload small projects that I've worked on to my GitHub profile. These projects are not dependent on each other.
My issue is that some of them are small single-file projects. Sort of like mini challenges that I've solved. So I'm thinking of grouping them together under one repo called "Python programming", for example.
Is this a good practice?
If yes, how should I go about it in Git, and
how can I still have a README file showing up for each mini project.
If no, what would you recommend doing?

GitHub will render a README file for every folder you visit, so when using just a single repository, one solution would be to still create one sub folder for each “subproject” that as such can have its own README file.
But before going that route, you should think about if those small projects actually belong together. That’s ultimately what should decide whether you want to put them all in the same repository or whether you want to split it up into several repositories.
Some things to consider for that decision:
If the projects do not depend on another, do they still relate to another? For example, are those projects part of a bigger programming challenge like Project Euler and you’re just collecting all your solutions? Then a single repository might make more sense.
What is the chance for individual projects to grow into bigger things? Many things start very small but can eventually grow into real things that justify their own repository. At that point, you might even get others to contribute.
Does it make sense for those individual files to share a history? Are the files even going to be edited once they are “done”? I.e. is this just a collection of finished things, or are they actually ongoing experiments?
Ultimately, it comes down to your personal choice. But GitHub, as the repository hoster, should not be driving your decision. You should create Git repositories locally as it makes sense to you. If that means you just have a single one, that’s fine. If that means you create lots of them, that’s also fine.
Unfortunately, the GitHub UI is not really made for small one-off projects. The repository list is just to unorganized for that. If you decide to use small projects, I advise you to add some prefix for categorization within your GitHub profile, so you know what this is about.
A good alternative for one-off projects, especially when it’s just a single (or a few) files are Gists. Gists are born as a way to share code snippets but under the hood, every Gist is actually a full Git repository. Of course, Gists do not offer the tools normal repositories on GitHub have (e.g. issues, pull requests, wikis). But for what you describe, you probably need neither of those. Then, Gists are a fine way to share simple things without adding full repositories to your profile. And you can still clone them (the remote URL is git#gist.github.com:/<gist-id>.git) and have a full history and support for multiple files if you need those.

Commonly, you'll see that the top level of the repo contains the README file, maybe a setup.py and some other extraneous information, and perhaps a tests folder. Then there will be a folder that shares a name with the repo. Inside of that folder is the code that's intended to be core content of the module/package/script.
It's also not unusual to see different organization, particularly with very small projects of single-file scripts.
For the specific case you mention, do whatever you like. What you propose sounds totally reasonable to me. I would not want to have a separate repo for all the challenges I solve!
I usually use a gist for trivial items I don't necessarily want to make a repo for, including coding challenges. So I would offer that as an alternative. Do whatever suits you best.

Cleanup Huge Git Repository

My company has a single git repository which is over 15 years and is really massive with about 60% of it which can be archived. I want to find these scripts (python, perl, ruby, java etc) and create a new git repository with only frequently used scripts. The scripts also have cross dependencies.
One solution that I thought was to setup inotify to watch over the files in git repo and collect the names of recently accessed scripts, collect data over few months and then create new repo based on that data. Not sure how efficient it would be though.
Another solution I thought was to use git commit date for each file and remove files which are over 5 years old.
Could anyone let me know of an efficient solution to cleanup this mess ? Or any tool similar to NewRelic that would monitor the filesystem ?

First, it's not clear what problem you are trying to solve. Is the 15-year git history slowing things down when cloning? If so, maybe just do a shallow git clone instead? (i.e. A shallow cone doesn't download the history.)
As Thilo pointed out, cutting the repo in half isn't going to make things that much faster.
But if the scripts are really that disorganized, it's highly likely that some of them need to be rewritten, documented, etc. If you just move the scripts forward, it's likely you are moving lots of inefficiencies forward too. I'd pick them off one at at time, and give them a little love, test them, etc.
One idea: You can use strace -ff -o strace.out ./myscript to figure out what other files a script opens.

How to deal with third party Python imports when sharing a script with other people on Git?

I made a Python module (https://github.com/Yannbane/Tick.py) and a Python program (https://github.com/Yannbane/Flatland.py). The program imports the module, and without it, it cannot work. I have intended for people to download both of these files before they can run the program, but, I am concerned about this a bit.
In the program, I've added these lines:
sys.path.append("/home/bane/Tick.py")
import tick
"/home/bane/Tick.py" is the path to my local repo of the module that needs to be included, but this will obviously be different to other people! How can I solve this situation better?

What suggested by #Lattyware is a viable option. However, its not uncommon to have core dependencies boundled with the main program (Django and PyDev do this for example). This works fine especially if the main code is tweaked against a specific version of the library.
In order to avoid the troubles mentioned by Lattyware when it comes to code maintenance, you should look into git submodules, which allow precisely this kind of layout, keeping code versioning sane.
From the structure of your directory it seems that both files live in the same directory. This might be the tell-tale than they might be two modules of a same package. In that case you should simply add an empty file called __init__.py to the directory, and then your import could work by:
import bane.tick
or
from bane import tick
Oh, and yes... you should use lower case for module names (it's worth to take a in-depth look at PEP8 if you are going to code in python! :)
HTH!

You might want to try submitting your module to the Python Package Index, that way people can easily install it (pip tick) into their path, and you can just import it without having to add it to the python path.
Otherwise, I would suggest simply telling people to download the module as well, and place it in a subdirectory of the program. If you really feel that is too much effort, you could place a copy of the module into the repository for the program (of course, that means ensuring you keep both versions up-to-date, which is a bit of a pain, although I imagine it may be possible just to use a symlink).
It's also worth noting your repo name is a bit misleading, capitalisation is often important, so you might want to call the repo tick.py to match the module, and python naming conventions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.