How can I remove existing and future pycahce files from git repository in Windows? The commands I found online are not working for example when I send the command "git rm -r --cached __pycache__" I get the command "pathspec '__pycache__' did not match any files".
The __pycache__ folders that you are seeing are not in your current and future Git commits. Because of the way Git works internally—which Git forces you to know, at least if you're going to understand it—understanding this is a bit tricky, even once we get past the "directory / folder confusion" we saw in your comments.
The right place to start, I believe, is at the top. Git isn't about files (or even files-and-folders / files-and-directories). Those new to Git see it as storing files, so they think it's about files, but that's just not true. Or, they note the importance of the ideas behind branches, and think that Git is about branches, and that too is not really true, because people confuse one kind of "branch" (that does matter) with branch names (which don't matter). The first thing to know, then, is that Git is really all about commits.
This means that you really need to know:
what a commit is, and
what a commit does for you
(these two overlap but are not identical). We won't really cover what a commit is here, for space reasons, but let's look at the main thing that one does for you: Each commit stores a full snapshot of every file.
We now need a small digression into files and folders and how Git and your OS differ in terms of how they organize files. Your computer insists that a file has a name like file.ext and lives in a folder or directory—the two terms are interchangeable—such as to, which in turn lives in another folder such as path. This produces path/to/file.ext or, on Windows, path\to\file.ext.
Git, by contrast, has only files, and their names always use forward slashes and include the slashes. The file named path/to/file.ext is literally just the file, with that name. But Git does understand that your computer demands the file-in-folder format, and will convert back and forth as needed. If Git needs to extract a file whose name is some/long/file/name.ext, Git will create folders some, some/long, and so on when it must, all automatically.
The strange side effect of this is that because Git stores only the files, not the folders, Git is unable to store an empty folder. This distinction actually occurs in Git's index aka staging area, which we won't get into in any detail, but it explains the problem whose answers are given in How do I add an empty directory to a Git repository?
In any case, commits in Git store files, using these path names. Each commit has a full copy of every file—but the files' contents are stored in a special, Git-ized, read-only, Git-only format in which the contents are de-duplicated. So if a million commits store one particular version of one particular file, there's really only one copy, shared between all million commits. Git can do this kind of sharing because, unlike regular files on your computer, files stored in a commit, in Git, literally can't be changed.
Going back to the commits now: each commit contains a full snapshot of every file (that it had when you, or whoever, made the commit). But these files are read-only—they literally can't have their contents replaced, which is what enables that sharing—and only Git itself can even read them. This makes them useless for actually getting any work done. They're fine as archives, but no good for real work.
The solution to this problem is simple (and the same as in almost all other version control systems): when you select some commit to work on / with, Git will extract the files from that commit. This creates ordinary files, in ordinary folders, in an ordinary area in which you can do your work (whether that's ordinary or substandard or exemplary work—that's all up to you, not to Git 😀). What this means is that you do your work in a working tree or work-tree (Git uses these two terms interchangeably). More importantly, it means this: The files you see and work on / with are not in Git. They may have just been extracted by Git, from some commit. But now they're ordinary files and you use them without Git being aware of what you're doing.
Since Git has extracted these files into ordinary folders, you can create new files and/or new folders if you like. When you run Python programs, Python itself will, at various times, create __pycache__ folders and stuff *.pyc and/or *.pyo files into them. Python does this without Git's knowledge or understanding.
Because these files are generated by Python, based on your source, and just used to speed up Python, it's a good idea to avoid putting them into the commits. There's no need to save a permanent snapshot of these files, especially since the format and contents may depend on the specific Python version (e.g., Python 3.7 generates *.cpython-37.pyc files, Python 3.9 generates *.cpython-39.pyc files, and so on). So we tell Git two things:
Don't complain about the existence of these particular untracked files in the working tree.
When I use an en-masse "add everything" operation like git add ., don't add these files to the index / staging-area, so that they won't go into the next commit either.
We generally do this with the (poorly named) .gitignore file. Listing a file name in a .gitignore does not make Git ignore it; instead, it has the effect of doing the two things I listed here.
This uses the Git-specific term untracked file, which has a simple definition that has a complex back-story. An untracked file is simply any file in your working tree that is not currently in Git's index (staging area). Since we're not going to get into a discussion of Git's index here, we have to stop there for now, but the general idea is that we don't allow the __pycache__ files to get into the index, which keeps them untracked, which keeps Git from committing them, which keeps them from getting into Git's index. It's all a bit circular here, and if you accidentally do get these files into Git's index, that's when you need the git rm -r --cached __pycache__ command.
Since that command is failing, it means you don't have the problem this command is meant to solve. That's good!
Well, you don't need __pycache__ files in your git repositories and you'd better to ignore all related files to it by adding __pycache__/ to your .gitignore file.
I have the following problem: I'm developing a package in python that will need a static data file for its operation that is somewhat large (currently around 70 MB, may get larger over time).
This isn't excessively large, but it's likely beyond what pypi will accept, so just having the file as a resource file as part of the package is not really an option. It also doesn't compress very well.
So I'm expecting to do something of the following: I'll store the file somewhere where it can be downloaded via https and add a command to the tool that will download that extra data needed. (I.e. expect something like a commandline tool with a --fetch-operational-data parameter that one might call once after installation and may call for updates every now and then, though updates of that file are expected to be rare.)
However this leads to the question where to store that data, and I feel there's no really good option.
"Usually" package resource files can be managed with importlib_resources which can access files that are stored within module directories. But the functions like open_binary are all read only and while one could probably get the path and write there, this probably goes against the intention of how it is supposed to be used (e.g. a major selling point for the importlib functionality is that it can be used in zip'ed packages, and that would obviously break).
Alternatively one could have a dot directory (~/.mytool/). However this means there's no good way to install this globally.
On the other hand there could be a system-wide directory (/var/lib/mytool ?), but then a user couldn't use the package. One could try to autodetect if the data is in /var/lib and fallback to ~/.mytool and write to whatever is writable on the update command.
Furthermore currently the tool is executable from its git repo, which adds another complexity (would it download the file into an extra dir in the gitrepo if it's executed from there? or also use /var/lib/mytool / ~/.mytool ?)
Whatever I would do, it feels like an ugly hack. Any good ideas?
I wrote 2-3 Plugins for pyload.
Sometimes they change and i let users know over forum that theres a new version.
To avoid that i'd like to give my scripts an auto selfupdate function.
https://github.com/Gutz-Pilz/pyLoad-stuff/blob/master/FileBot.py
Something like that easy to setup ?
Or someone can point me in a direction ?
Thanks in advance!
It is possible, with some caveats. But it can easily become very complicated. Before you know it, your auto-update "feature" will be bigger than the original code!
First you need to have an URL that always contains the latest version. Since you are using github, using raw.githubusercontent might do very well.
Have your code download the latest version from that URL (e.g. using requests), and compare the version with that in the current code. For this purpose I would recommend a simple integer version number, so you don't need any complicated parsing logic.
However, you might want to consider only running that check once per day, or once per week. If you do it every time your file is run, the server might get hammered! So now you have to save a file with the date when the check was last done, and read that to see if it is time to run the check again. This file will need to be saved in a location that you can access on every platform your code is liable to run on. That in itself can be a challenge.
If it is just a single python file, which is installed as the user that is running it, updating is relatively easy. But if the original was installed as root in the global Python directory and your script is running as a nonprivileged user it will be difficult. Especially if it is running as a plugin and cannot ask the user for (temporary) root credentials to install the file.
And what are you going to do if a newer version has more dependencies outside the standard library?
Last but not least, as a sysadmin I don't really like auto-updating software. Especially for critical system infrstructure I like to be able to estimate the consequences before an update.
I use Python in one of my products.
I compiled the source code using:
./configure --prefix=/home/myname/python_install
make
make install.
I looked inside python_install directory and noticed that many files (config, pyc, pyo), disclose information about my environment (i.e. strings with where i compiled it, directory, date, name, etc..)
I used 'grep -i -r "myname"' *
How do I remove this metadata from all those files? I do not want to ship my product with this information.
This is probably not something you have to worry about. Is it a secret where you stored your files? If so, choose a different directory name to begin with. Otherwise, I doubt you're going to be able to remove all trace of its location.
BTW, shipping a Python project means an interested party could basically read your Python source, so why worry about the locations of the files?
Whenever I change my python source files in my Django project, the .pyc files become out of date. Of course that's because I need to recompile them in order to test them through my local Apache web server. I would like to get around this manual process by employing some automatic means of compiling them on save, or on build through Eclipse, or something like that. What's the best and proper way to do this?
You shouldn't ever need to 'compile' your .pyc files manually. This is always done automatically at runtime by the Python interpreter.
In rare instances, such as when you delete an entire .py module, you may need to manually delete the corresponding .pyc. But there's no need to do any other manual compiling.
What makes you think you need to do this?