I've just seen that my web applications docker image is enormous. A 600-MB reason is the packages I install for it. The biggest single offender is botocore with 77.7 MB.
Apparently this is known behavior: https://github.com/boto/botocore/issues/1629
Is it possible to redue that size?
Analysis
The tar.gz distribution is just 10.8 MB: https://pypi.org/project/botocore/#files
75MB are in the data directory
For every single AWS service, there seem to be multiple folders (some kind of versioning?) and a service-2.json
The service-2.json files probably use most of the space. They are not minified and they contain a lot of information that seems not to be necessary for running a production system (e.g. description).
Is there a way to either completely avoid botocore or in any other way reduce botocores size for the Docker image? (I'm only using S3)
How can I remove existing and future pycahce files from git repository in Windows? The commands I found online are not working for example when I send the command "git rm -r --cached __pycache__" I get the command "pathspec '__pycache__' did not match any files".
The __pycache__ folders that you are seeing are not in your current and future Git commits. Because of the way Git works internally—which Git forces you to know, at least if you're going to understand it—understanding this is a bit tricky, even once we get past the "directory / folder confusion" we saw in your comments.
The right place to start, I believe, is at the top. Git isn't about files (or even files-and-folders / files-and-directories). Those new to Git see it as storing files, so they think it's about files, but that's just not true. Or, they note the importance of the ideas behind branches, and think that Git is about branches, and that too is not really true, because people confuse one kind of "branch" (that does matter) with branch names (which don't matter). The first thing to know, then, is that Git is really all about commits.
This means that you really need to know:
what a commit is, and
what a commit does for you
(these two overlap but are not identical). We won't really cover what a commit is here, for space reasons, but let's look at the main thing that one does for you: Each commit stores a full snapshot of every file.
We now need a small digression into files and folders and how Git and your OS differ in terms of how they organize files. Your computer insists that a file has a name like file.ext and lives in a folder or directory—the two terms are interchangeable—such as to, which in turn lives in another folder such as path. This produces path/to/file.ext or, on Windows, path\to\file.ext.
Git, by contrast, has only files, and their names always use forward slashes and include the slashes. The file named path/to/file.ext is literally just the file, with that name. But Git does understand that your computer demands the file-in-folder format, and will convert back and forth as needed. If Git needs to extract a file whose name is some/long/file/name.ext, Git will create folders some, some/long, and so on when it must, all automatically.
The strange side effect of this is that because Git stores only the files, not the folders, Git is unable to store an empty folder. This distinction actually occurs in Git's index aka staging area, which we won't get into in any detail, but it explains the problem whose answers are given in How do I add an empty directory to a Git repository?
In any case, commits in Git store files, using these path names. Each commit has a full copy of every file—but the files' contents are stored in a special, Git-ized, read-only, Git-only format in which the contents are de-duplicated. So if a million commits store one particular version of one particular file, there's really only one copy, shared between all million commits. Git can do this kind of sharing because, unlike regular files on your computer, files stored in a commit, in Git, literally can't be changed.
Going back to the commits now: each commit contains a full snapshot of every file (that it had when you, or whoever, made the commit). But these files are read-only—they literally can't have their contents replaced, which is what enables that sharing—and only Git itself can even read them. This makes them useless for actually getting any work done. They're fine as archives, but no good for real work.
The solution to this problem is simple (and the same as in almost all other version control systems): when you select some commit to work on / with, Git will extract the files from that commit. This creates ordinary files, in ordinary folders, in an ordinary area in which you can do your work (whether that's ordinary or substandard or exemplary work—that's all up to you, not to Git 😀). What this means is that you do your work in a working tree or work-tree (Git uses these two terms interchangeably). More importantly, it means this: The files you see and work on / with are not in Git. They may have just been extracted by Git, from some commit. But now they're ordinary files and you use them without Git being aware of what you're doing.
Since Git has extracted these files into ordinary folders, you can create new files and/or new folders if you like. When you run Python programs, Python itself will, at various times, create __pycache__ folders and stuff *.pyc and/or *.pyo files into them. Python does this without Git's knowledge or understanding.
Because these files are generated by Python, based on your source, and just used to speed up Python, it's a good idea to avoid putting them into the commits. There's no need to save a permanent snapshot of these files, especially since the format and contents may depend on the specific Python version (e.g., Python 3.7 generates *.cpython-37.pyc files, Python 3.9 generates *.cpython-39.pyc files, and so on). So we tell Git two things:
Don't complain about the existence of these particular untracked files in the working tree.
When I use an en-masse "add everything" operation like git add ., don't add these files to the index / staging-area, so that they won't go into the next commit either.
We generally do this with the (poorly named) .gitignore file. Listing a file name in a .gitignore does not make Git ignore it; instead, it has the effect of doing the two things I listed here.
This uses the Git-specific term untracked file, which has a simple definition that has a complex back-story. An untracked file is simply any file in your working tree that is not currently in Git's index (staging area). Since we're not going to get into a discussion of Git's index here, we have to stop there for now, but the general idea is that we don't allow the __pycache__ files to get into the index, which keeps them untracked, which keeps Git from committing them, which keeps them from getting into Git's index. It's all a bit circular here, and if you accidentally do get these files into Git's index, that's when you need the git rm -r --cached __pycache__ command.
Since that command is failing, it means you don't have the problem this command is meant to solve. That's good!
Well, you don't need __pycache__ files in your git repositories and you'd better to ignore all related files to it by adding __pycache__/ to your .gitignore file.
I am using Python and Git for scientific applications. I often develop software tools that consist of a relatively small codebase and some large binary datasets.
Ideally, I want the software to work "out of the box", without the user having to manually download additional binary files. If I commit the binary data to Git, users can install everything using a single pip command (pip install git+https://github.com/myuser/myrepo). However, committing binary files is considered bad practice for a number of good reasons, such as slow download times if users want to upgrade the codebase. Also, a small backwards-compatible change to the binary files may lead to a large increase in the repo size.
Alternatively, I can put the binary data on an external server and let the software download the data if not present on the system (or if a newer version is available). Is there a preferred way to implement this procedure? In particular, is there a standard "download location" where the datasets should be placed?
For big files, you can enable Github's LFS service.
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
I think, if your data will not change very often, you can just put your data in github release as attachment.
I have a github project available to others. One of the scripts, update.py, checks github everyday (via cron) to see if there is a newer version available.
Locally, the script is located at directory /home/user/.Project/update.py
If the version on github is newer, then update.py moves /home/user/.Project/ to /home/user/.OldProject/, clones the github repo and moves/renames the downloaded repo to /home/user/.Project/
It has worked perfectly for me about five times, but I just realized that the script is moving itself while it is still running. Are there any unforeseen consequences to this approach, and it there a better way?
As long as all of the code used by the script has been compiled and loaded into the Python VM there will be no issue with the source moving since it will remain resident in memory until the process ends or is replaced (or swapped out, but since it is considered dirty data it will be swapped in exactly the same). The operating system, though, may attempt to block the move operation if any files remain open during the process.
I'm building a little python script that is supposed to update itself everytime it starts. Currently I'm thinking about putting MD5 hashes on a "website" and downloading the files into a temp folder via the srcipt itself. Then if the MD5 Hashes line up the temp files will be moved over the old ones.
But now I'm wondering if git will just do something like this anyway.
What if the internet connection breaks away or power goes down when doing a git pull? Will I still have the "old" version or some intermediate mess?
Since my aproach works with an atomic rename from the os I can at least be shure that every file is either old or new, but not messed up. Is that true for git as well?
A pull is a complex command which will do a few different things depending on the configuration. It is not something you should use in a script, as it will try to merge (or rebase if so configured) which means that files with conflict markers may be left on the filesystem, which will make anything that tries to compile/interpret those files fail to do so.
If you want to switch to a particular version of files, you should use something like checkout -f <remote>/<branch> after fetching from <remote>. Keep in mind that git cannot know what particular needs you have, so if you're writing a script, it should be able to perform some sanity checks (e.g. make sure there are no extra files lying around)