How can I remove existing and future pycahce files from git repository in Windows? The commands I found online are not working for example when I send the command "git rm -r --cached __pycache__" I get the command "pathspec '__pycache__' did not match any files".
The __pycache__ folders that you are seeing are not in your current and future Git commits. Because of the way Git works internally—which Git forces you to know, at least if you're going to understand it—understanding this is a bit tricky, even once we get past the "directory / folder confusion" we saw in your comments.
The right place to start, I believe, is at the top. Git isn't about files (or even files-and-folders / files-and-directories). Those new to Git see it as storing files, so they think it's about files, but that's just not true. Or, they note the importance of the ideas behind branches, and think that Git is about branches, and that too is not really true, because people confuse one kind of "branch" (that does matter) with branch names (which don't matter). The first thing to know, then, is that Git is really all about commits.
This means that you really need to know:
what a commit is, and
what a commit does for you
(these two overlap but are not identical). We won't really cover what a commit is here, for space reasons, but let's look at the main thing that one does for you: Each commit stores a full snapshot of every file.
We now need a small digression into files and folders and how Git and your OS differ in terms of how they organize files. Your computer insists that a file has a name like file.ext and lives in a folder or directory—the two terms are interchangeable—such as to, which in turn lives in another folder such as path. This produces path/to/file.ext or, on Windows, path\to\file.ext.
Git, by contrast, has only files, and their names always use forward slashes and include the slashes. The file named path/to/file.ext is literally just the file, with that name. But Git does understand that your computer demands the file-in-folder format, and will convert back and forth as needed. If Git needs to extract a file whose name is some/long/file/name.ext, Git will create folders some, some/long, and so on when it must, all automatically.
The strange side effect of this is that because Git stores only the files, not the folders, Git is unable to store an empty folder. This distinction actually occurs in Git's index aka staging area, which we won't get into in any detail, but it explains the problem whose answers are given in How do I add an empty directory to a Git repository?
In any case, commits in Git store files, using these path names. Each commit has a full copy of every file—but the files' contents are stored in a special, Git-ized, read-only, Git-only format in which the contents are de-duplicated. So if a million commits store one particular version of one particular file, there's really only one copy, shared between all million commits. Git can do this kind of sharing because, unlike regular files on your computer, files stored in a commit, in Git, literally can't be changed.
Going back to the commits now: each commit contains a full snapshot of every file (that it had when you, or whoever, made the commit). But these files are read-only—they literally can't have their contents replaced, which is what enables that sharing—and only Git itself can even read them. This makes them useless for actually getting any work done. They're fine as archives, but no good for real work.
The solution to this problem is simple (and the same as in almost all other version control systems): when you select some commit to work on / with, Git will extract the files from that commit. This creates ordinary files, in ordinary folders, in an ordinary area in which you can do your work (whether that's ordinary or substandard or exemplary work—that's all up to you, not to Git 😀). What this means is that you do your work in a working tree or work-tree (Git uses these two terms interchangeably). More importantly, it means this: The files you see and work on / with are not in Git. They may have just been extracted by Git, from some commit. But now they're ordinary files and you use them without Git being aware of what you're doing.
Since Git has extracted these files into ordinary folders, you can create new files and/or new folders if you like. When you run Python programs, Python itself will, at various times, create __pycache__ folders and stuff *.pyc and/or *.pyo files into them. Python does this without Git's knowledge or understanding.
Because these files are generated by Python, based on your source, and just used to speed up Python, it's a good idea to avoid putting them into the commits. There's no need to save a permanent snapshot of these files, especially since the format and contents may depend on the specific Python version (e.g., Python 3.7 generates *.cpython-37.pyc files, Python 3.9 generates *.cpython-39.pyc files, and so on). So we tell Git two things:
Don't complain about the existence of these particular untracked files in the working tree.
When I use an en-masse "add everything" operation like git add ., don't add these files to the index / staging-area, so that they won't go into the next commit either.
We generally do this with the (poorly named) .gitignore file. Listing a file name in a .gitignore does not make Git ignore it; instead, it has the effect of doing the two things I listed here.
This uses the Git-specific term untracked file, which has a simple definition that has a complex back-story. An untracked file is simply any file in your working tree that is not currently in Git's index (staging area). Since we're not going to get into a discussion of Git's index here, we have to stop there for now, but the general idea is that we don't allow the __pycache__ files to get into the index, which keeps them untracked, which keeps Git from committing them, which keeps them from getting into Git's index. It's all a bit circular here, and if you accidentally do get these files into Git's index, that's when you need the git rm -r --cached __pycache__ command.
Since that command is failing, it means you don't have the problem this command is meant to solve. That's good!
Well, you don't need __pycache__ files in your git repositories and you'd better to ignore all related files to it by adding __pycache__/ to your .gitignore file.
Related
I've recently discovered GitPython, and, given that I'm currently trying to create a Python script which pushes to and pulls from Git repositories automatically, I was really excited to try it out.
When committing to a repository using command line Git, I call git add -A, pretty much to the exclusion of all other arguments. I know that you can call git add . instead, or add/remove files by name; I've just never felt the need to use that functionality. (Is that bad practice on my part?) However, I've been trying to put together a GitPython script today, and, despite combing through the API reference, I can't find any straightforward way of emulating the git add -A command.
This is a snippet from my efforts so far:
repo = Repo(absolute_path)
repo.index.add("-A")
repo.index.commit("Commit message.")
repo.remotes.origin.push()
This throws the following error: FileNotFoundError: [Errno 2] No such file or directory: '-A'. If, instead, I try to call repo.index.add(), I get: TypeError: add() missing 1 required positional argument: 'items'. I understand that .add() wants me to specify the files I want to add by name, but the whole point of GitPython, from my point of view, is that it's automated! Having to name the files manually defeats the purpose of the module!
Is it possible to emulate git add -A in GitPython? If so, how?
The API you linked goes to a version of GitPython that supports invoking the Git binaries themselves directly, so you could just have it run git add -A for you.
That aside, git add -A means:
Update the index not only where the working tree has a file matching <pathspec> but also where the index already has an entry. This adds, modifies, and removes index entries to match the working tree.
If no <pathspec> is given when -A option is used, all files in the entire working tree are updated (old versions of Git used to limit the update to the current directory and its subdirectories).
So git add -A is just the same as git add . from the top level of the working tree. If you want the old (pre-2.0) git add -A behavior, run git add . from a lower level of the working tree; to get the 2.0-or-later git add -A behavior, run git add . from the top level of the working tree. But see also --no-all:
This option is primarily to help users who are used to older versions of Git, whose "git add <pathspec>…" was a synonym for "git add --no-all <pathspec>…", i.e. ignored removed files.
So, if you want the pre-2.0 behavior, you will also need --no-all.
If you intend to do all of these within GitPython without using the git.cmd.Git class, I'll also add that in my experience, the various Python implementations of bits of Git vary in their fidelity to fiddly matters like --no-all (and/or their mapping to pre-2.0 Git, post-2.0 Git, post-2.23 Git, etc.), so if you intend to depend on these behaviors, you should test them.
I have a problem: we are using a package that is not maintained for a while now. So we forked it in order to maintain it ourselves. The package already exists lets say it is named package_a. Most of the code and the __init__ are in the package_a/ folder.
Now we want to make our own package that will include our maintained code and we want to name is package_b. So far so good but the problems is that package_b wants to have the code and the __init__ in package_b/ folder and github changes the contributions for all files when a folder is renamed. And I would like that credit for contributions stays where it is due, the 10k+ lines of code didn't just appear in my local repo out of thin air. Any suggestions how we can have package named package_b but keep the code in the original folder package_a/?
I am thinking along the lines of trying with some clever way of importing package_a into package_b or something along the line but I hope for a definite answer.
Instead of copying the code or trying to import A into B, extract the common code into a 3rd package which both A and B import. Or perhaps a subclass. This doesn't solve your contribution problem, but it does avoid making a big maintenance hassle by copying and pasting 10,000 lines of code.
Git doesn't record copies and renames, but it can recognize when they happen. To give Git the best chance of recognizing a copy, do only the copy in its own commit. Make no changes to the content. Then in a second commit make any necessary changes to the copied code.
In normal Git you can nudge git log and git blame to honor copies and renames with -C. Git doesn't do this by default because it's more expensive.
Github will do what Github will do.
Regardless of who Github says who wrote what line their contributions will still be in the project history. That's how it goes. You make your contribution and then others put their own work on top of it. This is normal. Their contributions remain in the history.
"History sheer" is also normal, that's when a change touches many lines but is otherwise insignificant. For example, if you were to restyle the code that would cause a history sheer. git blame will say that was the last commit to touch the code. git blame -w mitigates this somewhat, and Github has an "ignore whitespace" option. History sheer is normal and so is learning to skip over it.
The tools work for us. Don't bend yourself for the benefit of the tools.
If you want to make a special shout out to your contributors, make a contributor's section to your README.md.
I'm building a little python script that is supposed to update itself everytime it starts. Currently I'm thinking about putting MD5 hashes on a "website" and downloading the files into a temp folder via the srcipt itself. Then if the MD5 Hashes line up the temp files will be moved over the old ones.
But now I'm wondering if git will just do something like this anyway.
What if the internet connection breaks away or power goes down when doing a git pull? Will I still have the "old" version or some intermediate mess?
Since my aproach works with an atomic rename from the os I can at least be shure that every file is either old or new, but not messed up. Is that true for git as well?
A pull is a complex command which will do a few different things depending on the configuration. It is not something you should use in a script, as it will try to merge (or rebase if so configured) which means that files with conflict markers may be left on the filesystem, which will make anything that tries to compile/interpret those files fail to do so.
If you want to switch to a particular version of files, you should use something like checkout -f <remote>/<branch> after fetching from <remote>. Keep in mind that git cannot know what particular needs you have, so if you're writing a script, it should be able to perform some sanity checks (e.g. make sure there are no extra files lying around)
I use Python in one of my products.
I compiled the source code using:
./configure --prefix=/home/myname/python_install
make
make install.
I looked inside python_install directory and noticed that many files (config, pyc, pyo), disclose information about my environment (i.e. strings with where i compiled it, directory, date, name, etc..)
I used 'grep -i -r "myname"' *
How do I remove this metadata from all those files? I do not want to ship my product with this information.
This is probably not something you have to worry about. Is it a secret where you stored your files? If so, choose a different directory name to begin with. Otherwise, I doubt you're going to be able to remove all trace of its location.
BTW, shipping a Python project means an interested party could basically read your Python source, so why worry about the locations of the files?
I decided to rename some directories in my home/hobby Python package (doc to docs, test to tests, util to utils) because, now that I've thought more about it, I think the new names are more appropriate. My general thinking now is that if containers are named after their contents their names should be plural nouns.
Now that I'm ready for my next hg commit I'm wondering how to tell Mercurial about these directory name changes. I am new to RCS software in general and have only been using Mercurial for a couple months. When I run hg status it shows all the the files in these directories being removed and added, so I'm afraid if I just do a hg addremove I will loose all the change history for the files in these directories, or at the very least the change history will become fragmented and untraceable. I've come across the hg rename command, but the docs only discuss its use for individual files, not directories.
After further reading in Bryan O'Sullivan's 'Definitive Guide' it appears that maybe rename can refer to directories.
So here's what I've decided to try:
hg rename --after doc docs
hg rename --after test tests
hg rename --after util utils
hg status
hg addremove
Can anyone tell me if this is the accepted and preferred method for renaming directories in Mercurial, and if not, how should I do it.
Since you've already renamed the directories, this is perfectly OK. (It would have saved you a manual step if you'd let Mercurial rename them for you: hg rename doc docs, etc. instead of doing it yourself then letting Mercurial know about it).
If you don't have any other files to check in, the hg addremove is superfluous. Look in the output of hg stat and you should only see lines beginning with 'R' (for doc/*, test/* and util/*) and 'A' (for docs/*, etc.)
Finally, don't forget to commit the changes.
EDIT: Forgot to say... use hg log --follow to track changes across the rename.
Mercurial has no concept of directories; it treats everything (files and directories) as files. Also, I usually never rename files or directories manually; I just use
hg rename old-name new-name
I suggest you do that too.
Mercurial offers a rename-tracking feature, which means that mercurial can trace the complete history of a file which has been renamed. If you manually rename this, its not possible.
However, as you have already renamed them manually, you need to use the --follow argument along with hg log to track the file changes through the history.
Personally I'd just go with hg rename and it should be the preferred method.
One reason to use --after instead of renaming with hg is if you are using a refactoring tool that does more than just rename e.g. also fixes references.