Using git to manage virtualenv state: will this cause problems?

Using git to manage virtualenv state: will this cause problems? - python

I currently have git and virtualenv set up in a way which exactly
suits my needs and, so far, hasn't caused any problems. However I'm aware that
my setup is non-standard and I'm wondering if anyone more familiar with virtualenv's
internals can point out if, and where, it's likely to go wrong.
My setup
My virtualenv is inside my git repository, but git is set to ignore the bin and include directories and everything in lib except for the site-packages directory.
More precisely, my .gitignore file looks like this:
*.pyc
# Ignore all the virtualenv stuff except the actual packages
# themselves
/bin
/include
/lib/python*/*
!/lib/python*/site-packages
# Ignore easyinstall and setuptools
/lib/python*/site-packages/easy-install.pth
/lib/python*/site-packages/setuptools.pth
/lib/python*/site-packages/setuptools-*
/lib/python*/site-packages/pip-*
With this arrangement I -- and anyone else working on a checkout of the project -- can use virtualenv and pip as normal but with the following advantages:
If anyone updates or installs a package and pushes their changes, anyone else who pulls those changes automatically gets the update: they don't need to notice that a requirements.txt file has changed or do any post-receive hook magic.
There are no network dependencies: all the code to make the application work lives in the git repository.
I'm aware that this only works with pure-Python packages, but that's all I'm concerned with at the moment.
Does anyone know of any other problems with this approach that I should be aware of?

This is an interesting question. I think the other two answers (thus far) raise good specific points. Clearly you've thought this through and have arrived at a solution you like, but I'll note that there does seem to be a philosophical split here among virtualenv users.
One camp, to which I'd guess you belong, feels that the local VE is part of the project (i.e. it should be under version control). The other feels that the VE should essentially be treated as a development artifact -- that requirements.txt should be part of the project repo, but that you should be able to blow away and recreate the VE as needed.
I just mention this because when I first saw this distinction made, it helped shape my thinking about virtualenv. (I'm in the second camp, FWIW, because it seems simpler and cleaner to me, but that's not to say that being in the first camp is wrong for your particular project.)

If you have any -e items in your requirements.txt - in other words if you have any editable dependencies as described in the format that are using the git version control system you will most likely run into issues when your src directory is committed. If you just add /src to your .gitignore then you can avoid this problem, but often the installation just adds a pointer in site-packages to this location, so you need it!

In /lib/python2.7/site-packages in my virtualenv I have absolute paths, for example in Django.egg-link I have /Users/henry/.virtualenv/mowapp/src/django (this is a Mac).
Check to see if you have any absolute paths checked into git, I think that is a major problem with this approach.

Related

Moving Anaconda installation from one user account to another

I apologize if this is not the correct site for this. If it is not, please let me know.
Here's some background on what I am attempting. We are working on a series of chat bots that will go into production. Each of them will run on a environment in Anaconda. However, our setup uses tensorflow, which uses gcc to be compiled, and compliance has banned compilers from production. In addition, compliance rules also frown on us using pip or conda install in production.
As a way to get around this, I'm trying to tar the Anaconda 3 folder and move it into prod, with all dependencies already compiled and installed. However, the accounts between environments have different names, so this requires me to go into the bin folder (at the very least; I'm sure I will need to change them in the lib and pckg folders as well) and use sed -i to rename the hard coded paths to point from \home\<dev account>\anaconda to \home\<prod account>\anaconda, and while this seems to work, its also a good way to mangle my installation.
My questions are as follows:
Is there any good way to transfer anaconda from one user to another, without having to use sed -i on these paths? I've already read that Anaconda itself does not support this, but I would like your input.
Is there any way for me to install anaconda in dev so the scripts in it are either hard coded to use the production account name in their paths, or to use ~.
If I must continue to use sed, is there anything critical I should be aware of? For example, when I use grep <dev account> *, I will some files listed as binary file matches. DO I need to do anything special to change these?
And once again, I am well aware that I should just create a new Anaconda installation on the production machine, but that is simply not an option.
Edit:
So far, I've changed the conda.sh and conda.csh files in /etc, as well as the conda, activate, and deactivate files in the root bin. As such, I'm able to activate and deactivate my environment on the new user account. Also, I've changed the files in the bin folder under the bot environment. Right now, I'm trying to train the bot to test if this works, but it keeps failing and stating that a custom action does not exist in the the list. I don't think that is related to this, though.
Edit2:
I've confirmed that the error I was getting was not related to this. In order to get the bot to work properly with a ported version of Anaconda, all I had to change was the the conda.sh and conda.csh files in /etc so their paths to python use ~, do the same for the activate and deactivate files in /bin, and change the shebang line in the conda file in /bin to use the actual account name. This leaves every other file in /bin and lib still using the old account name in their shebang lines and other variable that use the path, and yet the bots work as expected. By all rights, I don't think this should work, but it does.

Anaconda is touchy about path names. They're obviously inserted into scripts, but they may be inserted into binaries as well. Some approaches that come to mind are:
Use Docker images in production. When building the image:
Install compilers as needed.
Build your stuff.
Uninstall the compilers and other stuff not needed at runtime.
Squash the image into a single layer.
This makes sure that the uninstalled stuff is actually gone.
Install Anaconda into the directory \home\<prod account>\anaconda on the development or build systems as well. Even though accounts are different, there should be a way to create a user-writeable directory in the same location.
Even better: Install Anaconda into a directory \opt\anaconda in all environments. Or some other directory that does not contain a username.
If you cannot get a directory outside of the user home, negotiate for a symlink or junction (mklink.exe /d or /j) at a fixed path \opt\anaconda that points into the user home.
If necessary, play it from the QA angle: Differing directory paths in production, as compared to all other environments, introduce a risk for bugs that can only be detected and reproduced in production. The QA or operations team should mandate that all applications use fixed paths everywhere, rather than make an exception for yours ;-)
Build inside a Docker container using directory \home\<prod account>\anaconda, then export an archive and run on the production system without Docker.
It's generally a good idea to build inside a reproducible Docker environment, even if you can get a fixed path without an account name in it.
Bundle your whole application as a pre-compiled Anaconda package, so that it can be installed without compilers.
That doesn't really address your problem though, because even conda install is frowned upon in production. But it could simplify building Docker images without squashing.
I've been building Anaconda environments inside Docker and running them on bare metal in production, too. But we've always made sure that the paths are identical across environments. I found mangling the paths too scary to even try. Life has become much simpler when we switched to Docker images everywhere. But if you have to keep using sed... Good Luck :-)

This is probably what you need : pip2pi.
This only works for pip compatible packages.
As I understand you need to move your whole setup as previously compiled as .tar.gz file, then here are a few things you could try:
Create a requirements.txt. These packages can help :
a. pipreqs
$ pipreqs /home/project/location
Successfully saved requirements file in /home/project/location/requirements.txt
b. snakefood.
Then, install pip2pi
$ pip install pip2pi
$ pip2tgz packages/ foo==1.2
...
$ ls packages/
foo-1.2.tar.gz
bar-0.8.tar.gz
pip2tgz passes package arguments directly to pip, so packages can be specified in any format that pip recognises:
$ cat requirements.txt
foo==1.2
http://example.com/baz-0.3.tar.gz
$ pip2tgz packages/ -r requirements.txt bam-2.3/
...
$ ls packages/
foo-1.2.tar.gz
bar-0.8.tar.gz
baz-0.3.tar.gz
bam-2.3.tar.gz
After getting all .tar.gz files, .tar.gz files can be turned into PyPI-compatible "simple" package index using the dir2pi command:
$ ls packages/
bar-0.8.tar.gz
baz-0.3.tar.gz
foo-1.2.tar.gz
$ dir2pi packages/
$ find packages/
packages/
packages/bar-0.8.tar.gz
packages/baz-0.3.tar.gz
packages/foo-1.2.tar.gz
packages/simple
packages/simple/bar
packages/simple/bar/bar-0.8.tar.gz
packages/simple/baz
packages/simple/baz/baz-0.3.tar.gz
packages/simple/foo
packages/simple/foo/foo-1.2.tar.gz

but they may be inserted into binaries as well
I can confirm that some packages have hard-coded the absolute path (including username) into the compiled binary. But if you restrict usernames to have the same length, you can apply sed on both binary and text files to make almost everything work as perfect.
On the other hand, if you copy the entire folder and use sed to replace usernames on only text files, you can run most of the installed packages. However, operations involving run-time compilation might fail, one example is installing a new package that requires compilation during installation.

Is it safe to manually delete all files in pkgs folder in anaconda python?

I ran this command to release disk space on anaconda
$ conda clean --all
However, there are still some big files that remain in pkgs folder in anaconda python.
Is it safe to manually delete all the files in pkgs folder? Any risk of corrupting my anaconda environment? What are some side effects, if any?
I am using anaconda 2018 on windows 10.

Actually, under certain conditions it is an option to have the pkgs subdirs removed. As stated here by Anaconda Community Support "the pkgs directory is only a cache. You can remove it completely is you want to.
However, when creating new environments, it is more efficient to leave whatever packages are in the cache around."
According to the documentation you can use conda clean --packages to remove unused packages in pkgs (which will move them to pkgs/.trash from which you can then safely delete them). While this does not check for packages installed using symlinks back to the package cache, this is not a topic if you don't use such environments or work under Windows. I guess that's why conda clean --packages is included in conda clean --all.
To more aggressively save space you can use conda clean --force-pkgs-dirs to remove all writable package caches (with the same caveat that there could be environments linked to these dirs). If you don't use environments or use Anaconda under Windows, you're probably safe. Personally, I use this option without issues.

Edit Commentary
After reviewing the documentation pointed out in #Robert's answer, I must admit my initial response was overly alarmist and, in parts, blatantly incorrect. My apologies for the misleading response.
Nevertheless, I do believe some of what I raised still has some merit for this thread, and so I am deciding to retain the answer with amendments. In particular, I think it worth emphasizing that deleting the pkgs directory may not actually achieve what OP was hoping for (to save space) and that removing the package cache undermines Conda's redundancy minimization strategy going forward by making it impossible to share already installed packages.
Instead, my final recommendation concurs with what #Robert suggested, namely, use conda clean -p to delete unused packages, but keep the cache (pkgs dir) so that future environments can still leverage hardlinks. One last point to note, is that some tools, such as conda-pack, rely on the integrity of the package cache in order work, so deleting pkgs will prevent their use.
Amended Original Response
No, it is definitely not safe, and in fact the only way you would actually free disk space is if you broke your base env. The issue is that all envs use hardlinks to the pkgs directory, so even if you delete the link located in the pkgs directory, the ones in the envs will still be there and so you won't delete any physical files on the disk. The only real deletion you might do is something that is only referenced by base, i.e., the only copy is in pkgs, hence the potential for a breaking base.
Correction: The base env still links packages to other locations, so deleting pkgs will not impact base as I originally concluded.
I'd highly recommend looking at this other post on estimating the real disk usage of Conda. You may be overestimating how much space is really being used. For most files in pkgs, there is only one physical copy, so there isn't any additional manual optimization to be done.

When working with a venv virtual environment, which files should I be commiting to my git repository?

Using GitHub's .gitignore, I was able to filter out some files and directories. However, there's a few things that left me a little bit confused:
GitHub's .gitignore did not include /bin and /share created by venv. I assumed they should be ignored by git, however, as the user is meant to build the virtual environment themselves.
Pip generated a pip-selfcheck.json file, which seemed mostly like clutter. I assume it usually does this, and I just haven't seen the file before because it's been placed with my global pip.
pyvenv.cfg is what I really can't make any sense of, though. On one hand, it specifies python version, which ought to be needed for others who want to use the project. On the other hand, it also specifies home = /usr/bin, which, while perhaps probably correct on a lot of Linux distributions, won't necessarily apply to all systems.
Are there any other files/directories I missed? Are there any stricter guidelines for how to structure a project and what to include?

Although venv is a very useful tool, you should not assume (unless you have good reason to do so) that everyone who looks at your repository uses it. Avoid committing any files used only by venv; these are not strictly necessary to be able to run your code and they are confusing to people who don't use venv.
The only configuration file you need to include in your repository is the requirements.txt file generated by pip freeze > requirements.txt which lists package dependencies. You can then add a note in your readme instructing users to install these dependencies with the command pip install -r requirements.txt. It would also be a good idea to specify the required version of Python in your readme.

is it ok to use files inside pip packages without installing them

I am working on a project and I have cloned a repository from github.
After first compile I realized that the project that I cloned has some dependencies and they were in requirements.txt file.
I know I have to install these packages, but I dont want to cause I am on windows development environment and after finishing my project I am going to publish it to my ubuntu production environment and I dont want to take the hassle of double installation.
I have two options:
Using a virtualenv and installing those packages inside it
Downloading the packages and use them the direct way using import foldername
I wanna avoid the first option cause I have less control over my project and the problem gets bigger and bigger If for example I were inside another project's virtualenv and wanted to run my project's main.py file from its own virtualenv and etc... Also moving the virtualenv from windows (bat files) to linux (bash / sh files) seems ugly to me and directs me to approaches that I choose to better avoid.
The second option is my choice. for example I need to use the future package. The scenario would be downloading the package using pip download future and when done extracting the tar.gz file, inside the src folder I can see the future package folder, And I use it with import future_package.src.future without even touching anything else.
Aside from os.path problems (which assume I take care of):
Is this good practice?
I am not running the setup.py preventing any installation. Can it cause problems?
Is there any better approach that involves less work (like the second one) or the better one is my mentioned first approach?
UPDATE 1: I have extracted future and certifi packages which were part of the requirements of my project and I used them the direct way and it is working in this particular case.

PYTHONPATH vs. sys.path

Another developer and I disagree about whether PYTHONPATH or sys.path should be used to allow Python to find a Python package in a user (e.g., development) directory.
We have a Python project with a typical directory structure:
Project
setup.py
package
__init__.py
lib.py
script.py
In script.py, we need to do import package.lib. When the package is installed in site-packages, script.py can find package.lib.
When working from a user directory, however, something else needs to be done. My solution is to set my PYTHONPATH to include "~/Project". Another developer wants to put this line of code in the beginning of script.py:
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
So that Python can find the local copy of package.lib.
I think this is a bad idea, as this line is only useful for developers or people running from a local copy, but I can't give a good reason why it is a bad idea.
Should we use PYTOHNPATH, sys.path, or is either fine?

If the only reason to modify the path is for developers working from their working tree, then you should use an installation tool to set up your environment for you. virtualenv is very popular, and if you are using setuptools, you can simply run setup.py develop to semi-install the working tree in your current Python installation.

I hate PYTHONPATH. I find it brittle and annoying to set on a per-user basis (especially for daemon users) and keep track of as project folders move around. I would much rather set sys.path in the invoke scripts for standalone projects.
However sys.path.append isn't the way to do it. You can easily get duplicates, and it doesn't sort out .pth files. Better (and more readable): site.addsitedir.
And script.py wouldn't normally be the more appropriate place to do it, as it's inside the package you want to make available on the path. Library modules should certainly not be touching sys.path themselves. Instead, you'd normally have a hashbanged-script outside the package that you use to instantiate and run the app, and it's in this trivial wrapper script you'd put deployment details like sys.path-frobbing.

In general I would consider setting up of an environment variable (like PYTHONPATH)
to be a bad practice. While this might be fine for a one off debugging but using this as
a regular practice might not be a good idea.
Usage of environment variable leads to situations like "it works for me" when some one
else reports problems in the code base. Also one might carry the same practice with the
test environment as well, leading to situations like the tests running fine for a
particular developer but probably failing when some one launches the tests.

Along with the many other reasons mentioned already, you could also point outh that hard-coding
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
is brittle because it presumes the location of script.py -- it will only work if script.py is located in Project/package. It will break if a user decides to move/copy/symlink script.py (almost) anywhere else.

Neither hacking PYTHONPATH nor sys.path is a good idea due to the before mentioned reasons. And for linking the current project into the site-packages folder there is actually a better way than python setup.py develop, as explained here:
pip install --editable path/to/project
If you don't already have a setup.py in your project's root folder, this one is good enough to start with:
from setuptools import setup
setup('project')

I think, that in this case using PYTHONPATH is a better thing, mostly because it doesn't introduce (questionable) unneccessary code.
After all, if you think of it, your user doesn't need that sys.path thing, because your package will get installed into site-packages, because you will be using a packaging system.
If the user chooses to run from a "local copy", as you call it, then I've observed, that the usual practice is to state, that the package needs to be added to PYTHONPATH manually, if used outside the site-packages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.