I apologize if this is not the correct site for this. If it is not, please let me know.
Here's some background on what I am attempting. We are working on a series of chat bots that will go into production. Each of them will run on a environment in Anaconda. However, our setup uses tensorflow, which uses gcc to be compiled, and compliance has banned compilers from production. In addition, compliance rules also frown on us using pip or conda install in production.
As a way to get around this, I'm trying to tar the Anaconda 3 folder and move it into prod, with all dependencies already compiled and installed. However, the accounts between environments have different names, so this requires me to go into the bin folder (at the very least; I'm sure I will need to change them in the lib and pckg folders as well) and use sed -i to rename the hard coded paths to point from \home\<dev account>\anaconda to \home\<prod account>\anaconda, and while this seems to work, its also a good way to mangle my installation.
My questions are as follows:
Is there any good way to transfer anaconda from one user to another, without having to use sed -i on these paths? I've already read that Anaconda itself does not support this, but I would like your input.
Is there any way for me to install anaconda in dev so the scripts in it are either hard coded to use the production account name in their paths, or to use ~.
If I must continue to use sed, is there anything critical I should be aware of? For example, when I use grep <dev account> *, I will some files listed as binary file matches. DO I need to do anything special to change these?
And once again, I am well aware that I should just create a new Anaconda installation on the production machine, but that is simply not an option.
Edit:
So far, I've changed the conda.sh and conda.csh files in /etc, as well as the conda, activate, and deactivate files in the root bin. As such, I'm able to activate and deactivate my environment on the new user account. Also, I've changed the files in the bin folder under the bot environment. Right now, I'm trying to train the bot to test if this works, but it keeps failing and stating that a custom action does not exist in the the list. I don't think that is related to this, though.
Edit2:
I've confirmed that the error I was getting was not related to this. In order to get the bot to work properly with a ported version of Anaconda, all I had to change was the the conda.sh and conda.csh files in /etc so their paths to python use ~, do the same for the activate and deactivate files in /bin, and change the shebang line in the conda file in /bin to use the actual account name. This leaves every other file in /bin and lib still using the old account name in their shebang lines and other variable that use the path, and yet the bots work as expected. By all rights, I don't think this should work, but it does.
Anaconda is touchy about path names. They're obviously inserted into scripts, but they may be inserted into binaries as well. Some approaches that come to mind are:
Use Docker images in production. When building the image:
Install compilers as needed.
Build your stuff.
Uninstall the compilers and other stuff not needed at runtime.
Squash the image into a single layer.
This makes sure that the uninstalled stuff is actually gone.
Install Anaconda into the directory \home\<prod account>\anaconda on the development or build systems as well. Even though accounts are different, there should be a way to create a user-writeable directory in the same location.
Even better: Install Anaconda into a directory \opt\anaconda in all environments. Or some other directory that does not contain a username.
If you cannot get a directory outside of the user home, negotiate for a symlink or junction (mklink.exe /d or /j) at a fixed path \opt\anaconda that points into the user home.
If necessary, play it from the QA angle: Differing directory paths in production, as compared to all other environments, introduce a risk for bugs that can only be detected and reproduced in production. The QA or operations team should mandate that all applications use fixed paths everywhere, rather than make an exception for yours ;-)
Build inside a Docker container using directory \home\<prod account>\anaconda, then export an archive and run on the production system without Docker.
It's generally a good idea to build inside a reproducible Docker environment, even if you can get a fixed path without an account name in it.
Bundle your whole application as a pre-compiled Anaconda package, so that it can be installed without compilers.
That doesn't really address your problem though, because even conda install is frowned upon in production. But it could simplify building Docker images without squashing.
I've been building Anaconda environments inside Docker and running them on bare metal in production, too. But we've always made sure that the paths are identical across environments. I found mangling the paths too scary to even try. Life has become much simpler when we switched to Docker images everywhere. But if you have to keep using sed... Good Luck :-)
This is probably what you need : pip2pi.
This only works for pip compatible packages.
As I understand you need to move your whole setup as previously compiled as .tar.gz file, then here are a few things you could try:
Create a requirements.txt. These packages can help :
a. pipreqs
$ pipreqs /home/project/location
Successfully saved requirements file in /home/project/location/requirements.txt
b. snakefood.
Then, install pip2pi
$ pip install pip2pi
$ pip2tgz packages/ foo==1.2
...
$ ls packages/
foo-1.2.tar.gz
bar-0.8.tar.gz
pip2tgz passes package arguments directly to pip, so packages can be specified in any format that pip recognises:
$ cat requirements.txt
foo==1.2
http://example.com/baz-0.3.tar.gz
$ pip2tgz packages/ -r requirements.txt bam-2.3/
...
$ ls packages/
foo-1.2.tar.gz
bar-0.8.tar.gz
baz-0.3.tar.gz
bam-2.3.tar.gz
After getting all .tar.gz files, .tar.gz files can be turned into PyPI-compatible "simple" package index using the dir2pi command:
$ ls packages/
bar-0.8.tar.gz
baz-0.3.tar.gz
foo-1.2.tar.gz
$ dir2pi packages/
$ find packages/
packages/
packages/bar-0.8.tar.gz
packages/baz-0.3.tar.gz
packages/foo-1.2.tar.gz
packages/simple
packages/simple/bar
packages/simple/bar/bar-0.8.tar.gz
packages/simple/baz
packages/simple/baz/baz-0.3.tar.gz
packages/simple/foo
packages/simple/foo/foo-1.2.tar.gz
but they may be inserted into binaries as well
I can confirm that some packages have hard-coded the absolute path (including username) into the compiled binary. But if you restrict usernames to have the same length, you can apply sed on both binary and text files to make almost everything work as perfect.
On the other hand, if you copy the entire folder and use sed to replace usernames on only text files, you can run most of the installed packages. However, operations involving run-time compilation might fail, one example is installing a new package that requires compilation during installation.
Related
CentOS 7 system, Python 2.7. The OS's installed python has directory
/usr/lib/python2.7/site-packages
and that is where a
python setup.py install
command would install a package. On a computing cluster I would like to install some packages so that they are referenced from that directory but which actually reside in an NFS served directory here:
/usr/common/lib/python2.7/site-packages
That is, I do not want to have to run setup.py on each of the cluster nodes to do a local install on each, duplicating the package on every machine. The packages already installed locally must not be affected, some of those are used by the OS's commands. Also the local ones must work even if the network is down for some reason. I am not trying to set up a virtual environment, I am only trying to place a common set of packages in a different directory in such a way that the OS supplied python sees them.
It isn't clear to me what is the best way to do this. It seems like such a common problem that there must be a standard or preferred way of doing this, and if possible, that is the method I would like to use.
This command
/usr/bin/python setup.py install --prefix=/usr/common
would probably install into the target directory. However the "python" command on the cluster nodes will not know this package is present, and there is no "network" python program that corresponds to the shared site-packages.
After the network install one could make symlinks from the local to the shared directory for each of the files created. That would be acceptable, assuming that is sufficient.
It looks like the PYTHONPATH environmental variable might also work here, although I'm unclear about what it expects
for "path" (full path to site-packages, or just the /usr/common part.)
EDIT: This does seem to work as needed, at least for the test case. The software package in question was installed using --prefix, as above. PYTHONPATH was not previously defined.
export PYTHONPATH=/usr/common/lib/python2.7/site-packages
python $PATH_TO_START_SCRIPT/start.py
ran correctly.
Thanks.
Using GitHub's .gitignore, I was able to filter out some files and directories. However, there's a few things that left me a little bit confused:
GitHub's .gitignore did not include /bin and /share created by venv. I assumed they should be ignored by git, however, as the user is meant to build the virtual environment themselves.
Pip generated a pip-selfcheck.json file, which seemed mostly like clutter. I assume it usually does this, and I just haven't seen the file before because it's been placed with my global pip.
pyvenv.cfg is what I really can't make any sense of, though. On one hand, it specifies python version, which ought to be needed for others who want to use the project. On the other hand, it also specifies home = /usr/bin, which, while perhaps probably correct on a lot of Linux distributions, won't necessarily apply to all systems.
Are there any other files/directories I missed? Are there any stricter guidelines for how to structure a project and what to include?
Although venv is a very useful tool, you should not assume (unless you have good reason to do so) that everyone who looks at your repository uses it. Avoid committing any files used only by venv; these are not strictly necessary to be able to run your code and they are confusing to people who don't use venv.
The only configuration file you need to include in your repository is the requirements.txt file generated by pip freeze > requirements.txt which lists package dependencies. You can then add a note in your readme instructing users to install these dependencies with the command pip install -r requirements.txt. It would also be a good idea to specify the required version of Python in your readme.
I really don't get it. When i run npm install on the main folder, why does it have to download all the dependencies in the node_modules and why does this need to be done for each single project? In Sinatra(Ruby microframework), I never had to do this and it is easy to use the gems that are installed globally without having to download and save each one of them into the project folder again.
I read somewhere that it is done to avoid version mismatch issues but if its working by installing it globally and simply 'require'ing it in many other languages like Python(uses virtualenv to tackle version issues), Ruby etc., why can't it be the same for node.js?
Whatever happened to DRY?
You can do npm install --global (or -g for short) to install globally. But that creates a problem when 2 projects reference to different version of the same dependency. You will have conflicts that is hard to trace. Also install it locally makes it more portable. You can reference to this document
This problem does not only exist in node.js world. Just different language approach this problem differently. I don't know ruby well, but:
In python people use virtualenv to separate dependencies, which needs more effort.
Maven of Java will cache artifacts in $HOME/.m2 folder, but when compile the projects, they will copy those bytecode from .m2 folder to a local folder target.
There is time you want to do npm install --global though for those tools such as grunt.js/gulp.js.
I just think it has nothing to do with DRY, because you didn't write code twice. It's just downloaded twice.
That being said, you still can install everything globlally.
I have looked into other python module distribution questions. My need is a bit different (I think!, I am python newbie+)
I have a bunch of python scripts that I need to execute on remote machines. Here is what the target environment looks like;
The machines will have base python run time installed
I will have a SSH account; I can login or execute commands remotely using ssh
i can copy files (scp) into my home dir
I am NOT allowed to install any thing on the machine; the machines may not even have access to Internet
my scripts may use some 'exotic' python modules -- most likely they won't be present in the target machine
after the audit, my home directory will be nuked from the machine (leave no trace)
So what I like to do is:
copy a directory structure of python scripts + modules to remote machine (say in /home/audituser/scripts). And modules can be copied into /home/audituser/scripts/pythhon_lib)
then execute a script (say /home/audituser/scripts/myscript.py). This script will need to resolve all modules used from 'python_lib' sub directory.
is this possible? or is there a better way of doing this? I guess what I am looking is to 'relocate' the 3rd party modules into the scripts dir.
thanks in advance!
Are the remote machines the same as each other? And, if so, can you set up a development machine that's effectively the same as the remote machines?
If so, virtualenv makes this almost trivial. Create a virtualenv on your dev machine, use the virtualenv copy of pip to install any third-party modules into it, build your script within it, then just copy that entire environment to each remote machine.
There are three things that make it potentially non-trivial:
If the remote machines don't (and can't) have virtualenv installed, you need to do one of the following:
In many cases, just copying a --relocatable environment over just works. See the documentation section on "Making Environments Relocatable".
You can always bundle virtualenv itself, and pip install --user virtualenv (and, if they don't even have pip, a few steps before that) on each machine. This will leave the user account in a permanently-changed state. (But fortunately, your user account is going to be nuked, so who cares?)
You can write your own manual bootstrapping. See the section on "Creating Your Own Bootstrap Scripts".
By default, you get a lot more than you need—the Python executable, the standard library, etc.
If the machines aren't identical, this may not work, or at least might not be as efficient.
Even if they are, you're still often making your bundle orders of magnitude bigger.
See the documentation sections on Using Virtualenv without bin/python, --system-site-packages, and possibly bootstrapping.
If any of the Python modules you're installing also need C libraries (e.g., libxml2 for lxml), virtualenv doesn't help with that. In fact, you will need the C libraries to be almost exactly the same (same path, compatible version).
Three other alternatives:
If your needs are simple enough (or the least-simple parts involve things that virtualenv doesn't help with, like installing libxml2), it may be easier to just bundle the .egg/.tgz/whatever files for third-party modules, and write a script that does a pip install --user and so on for each one, and then you're done.
Just because you don't need a full app-distribution system doesn't mean you can't use one. py2app, py2exe, cx_freeze, etc. aren't all that complicated, especially in simple cases, and having a click-and-go executable to copy around is even easier than having an explicit environment.
zc.buildout is an amazingly flexible and manageable tool that can do the equivalent of any of the three alternatives. The main downside is that there's a much, much steeper learning curve.
You can use virtualenv to create a self-contained environment for your project. This can house your own script, as well as any dependency libraries. Then you can make the env relocatable (--relocatable), and sync it over to the target machine, activate it, and run your scripts.
If these machines do have network access (not internet, but just local network), you can also place the virtualenv on a shared location and activate from there.
It looks something like this:
virtualenv --no-site-packages portable_proj
cd portable_proj/
source bin/activate
# install some deps
pip install xyz
virtualenv --relocatable .
Now portable_proj can be disted to other machines.
Just curious how people are deploying their Django projects in combination with virtualenv
More specifically, how do you keep your production virtualenv's synched correctly with your development machine?
I use git for scm but I don't have my virtualenv inside the git repo - should I, or is it best to use the pip freeze and then re-create the environment on the server using the freeze output? (If you do this, could you please describe the steps - I am finding very little good documentation on the unfreezing process - is something like pip install -r freeze_output.txt possible?)
I just set something like this up at work using pip, Fabric and git. The flow is basically like this, and borrows heavily from this script:
In our source tree, we maintain a requirements.txt file. We'll maintain this manually.
When we do a new release, the Fabric script creates an archive based on whatever treeish we pass it.
Fabric will find the SHA for what we're deploying with git log -1 --format=format:%h TREEISH. That gives us SHA_OF_THE_RELEASE
Fabric will get the last SHA for our requirements file with git log -1 --format=format:%h SHA_OF_THE_RELEASE requirements.txt. This spits out the short version of the hash, like 1d02afc which is the SHA for that file for this particular release.
The Fabric script will then look into a directory where our virtualenvs are stored on the remote host(s).
If there is not a directory named 1d02afc, a new virtualenv is created and setup with pip install -E /path/to/venv/1d02afc -r /path/to/requirements.txt
If there is an existing path/to/venv/1d02afc, nothing is done
The little magic part of this is passing whatever tree-ish you want to git, and having it do the packaging (from Fabric). By using git archive my-branch, git archive 1d02afc or whatever else, I'm guaranteed to get the right packages installed on my remote machines.
I went this route since I really didn't want to have extra virtuenvs floating around if the packages hadn't changed between release. I also don't like the idea of having the actual packages I depend on in my own source tree.
I use this bootstrap.py: http://github.com/ccnmtl/ccnmtldjango/blob/master/ccnmtldjango/template/bootstrap.py
which expects are directory called 'requirements' that looks something like this: http://github.com/ccnmtl/ccnmtldjango/tree/master/ccnmtldjango/template/requirements/
There's an apps.txt, a libs.txt (which apps.txt includes--I just like to keep django apps seperate from other python modules) and a src directory which contains the actual tarballs.
When ./bootstrap.py is run, it creates the virtualenv (wiping a previous one if it exists) and installs everything from requirements/apps.txt into it. I do not ever install anything into the virtualenv otherwise. If I want to include a new library, I put the tarball into requirements/src/, add a line to one of the textfiles and re-run ./bootstrap.py.
bootstrap.py and requirements get checked into version control (also a copy of pip.py so I don't even have to have that installed system-wide anywhere). The virtualenv itself isn't. The scripts that I have that push out to production run ./bootstrap.py on the production server each time I push. (bootstrap.py also goes to some lengths to ensure that it's sticking to Python 2.5 since that's what we have on the production servers (Ubuntu Hardy) and my dev machine (Ubuntu Karmic) defaults to Python 2.6 if you're not careful)