We have a relatively large number of machines in a cluster that run a certain Python-based software for computation in an academic research environment. In order to keep the code base up to date, we are using a build server which makes the current code base available in a directory each time we update a dedicated deployment tag on our Mercurial server. Each machine part of the cluster runs a daily rsync script that just synchronises with the deployment directory on the build server (if there's anything to sync) and restart the process if the code base was updated.
Now this approach I find a bit dated and a slight overkill, and would like to optimise in the following way:
Get rid of the build server as all it actually does is clone the latest code base that has a certain tag attached - it doesn't actually compile or do any additional checks (such as testing) on the code base at all. This would also reduce some pain for us as it'd be one less server to maintain and worry about.
Instad of having the build server, I would like to pull straight from our Mercurial server which hosts the code already. This would reduce the need to duplicate the code base each time we update the deployment tag.
Now I had a bit of a read before on how to install / deploy Python-based software with pip (e.g., How to point pip at a Mercurial branch?). It seems to be the right choice as it supports installing packages straight from a code repository. However, I ran into a few problems that I would need help with. The requirements I have are as follows:
Use Mercurial as a source.
Automated background process to update and install into a custom directory on the file system.
Only pull and update from the repository if there is a new version available.
The following command seems to almost do what I need:
pip install -e hg+https://authkey:anypw#mymercurialserver.hostname.com/Code/package#deployment#egg=package --upgrade --src ~/proj
It pulls the package from the Mercurial server, picks the code base with the tag "deployment" and installs it into proj inside the user's home directory.
The problem, however, is that regardless whether there is an update available or not, pip always uninstalls package and reinstalls it. This makes it difficult to decide whether the process needs to be restarted or not if nothing actually changed. In addition, pip always gets stuck with the the message that hg clone in ./proj/yarely exists with URL... and asks me: What to do? (s)witch, (i)gnore, (w)ipe, (b)ackup. Now this is not ideal, as (1) it would be an automated process without user prompt, and, (2) it should only pull the repository if there was an update in the first place to reduce traffic in the network and not overload our Mercurial server. I believe that in this case, a pull instead of clone if there was a local copy of the repository already would be more appropriate and potentially solve the problem.
I wasn't able to find an elegant and nice solution to this problem. Does anyone have a pointer or suggestion how this could be achieved?
Related
I am responsible for maintaining a Python environment that is frequently deployed into production. To ensure that the library's dependencies work well together, I use pip to install all the required Python packages, following best practices I pip freeze all of my dependencies into a requirements.txt file.
The benefit of this approach is that I have a very stable environment that is unlikely to break due to a package issue. However, the drawback is that my environment is static while these packages are constantly releasing new versions that could improve performance and most importantly fix vulnerabilities.
I am wondering what the industry standard is for keeping packages up to date in an easy way, perhaps even in an automated way that detects new updates and tests for any potential issues. For instance, apt has a simple apt-get update and apt-get upgrade command to keep your packages constantly updated and be aware of it.
The obvious answer is to just update all packages to the latest official version. My concern is that this can potentially break some dependencies between packages and cause the environment to break.
Can anyone suggest a similar solution for keeping Python packages up to date while ensuring stability?
Well, I can't say what the industry standard is but one of the ways I could think of to update the packages periodically would be as such:
Create a requirements.txt file that contains the packages to update.
Create a python script that contains bash commands to update the packages in the requirements.txt.
Can use pip freeze to update the requirements.txt
Create a cron job or use the python schedule library to trigger a periodic update
You can refer, as an example for
cron job on mac, to: https://www.jcchouinard.com/python-automation-with-cron-on-mac/
schedule in python, to: https://www.geeksforgeeks.org/python-schedule-library/
The above can solve the update issue but may not solve the compatibility issue.
The importance is not so much current but to not have vulnerabilities that may harm customers and your company in any way.
Industristandard is to use Snyk, Safety, Deptrack, Sonatype Lifecycle etc to monitor package/module vulnerabilities. And a Continuous-Integration(CI) system that support this like docker containers in kubernetes.
Deploy the exact version you are running in prod and run vulnerability scans.
(All CI systems make this easy)
(it is important to do on running code because the requirements.txt most likely never show all installed that will be installed)
The vulnerability-scanners (all except owast deptrack) costs money if you like to be current, but they in its simplest form just give you a more refined answer to:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=python
You can make your own system based on the safety source
https://github.com/pyupio/safety and the CVE list above and make it fail if any of the versions are found in your code.
When it comes to requirements.txt listings we always use >= in favor of == and deploy every night in test-systems and run test-on them, if they fail we manually have to go though errors and check and pin the releases.
Generally you build Docker images or similar. (There are other VM solutions, but Docker seems to be the most common/popular.) Have apt-get update, apt-get install and pip install -r requirements.txt as part of the Dockerfile, and integrate it into your CI/CD pipeline using GitHub or similar
That builds the image from scratch using a well-defined process. You then run the VM, including all automated unit and integration tests and flag any problems for the Devs. It's then their job to figure out whether they broke something, whether they need a particular version of some dependency, etc.
Once the automated tests pass, the image binary (including its "frozen" dependencies) gets pushed to a place where humans can interact with it and look for anything weird. Typically there are several such environments variously called things like "test", "integration", and "production". The names, numbers, and details vary from place to place - but that triplet is fairly common. They're typically defined as a sequence and gating between them is done manually. This might be a typical spec:
Development - This is the first environment where code gets pushed after the automated testing passes. Failures and weird idiosyncrasies are common. If local developers need to replicate a problem or experiment with a patch that can't be handled on their local machines, they put the code here. Images graduate from Development when the basic manual and automated "smoke tests" all pass and the Dev team agrees the code is stable.
Integration - This environment belongs to QA and dedicated software testers. They may have their own set of scripts to run, which may be automated, manual or ad hoc. This is also a good environment for load testing, or for internal red team attacks on the security, or for testing / exploration by internal trusted users who are willing to risk the occasional crash in order to exercise the newest up-and-coming features. QA flags any problems or oddities and sends issues back to the devs. Images graduate from Integration when QA agrees the code is production-quality.
Production - This environment is running somewhat older code since everything has to go through Dev & QA before reaching it, but it's likely to be quite robust. The binary images here are the only ones that are accessible to the outside world, and the only ones carefully monitored by the SOC.
Since VM images are only compiled once, just prior to sending the code to Dev, all three environments should have the same issues. The problem then becomes getting code through the environments in sequence and doing all the necessary checks in each before the security patches become too outdated... or before the customers get tired of waiting for the new features / bug-fixes.
There are a lot of details in the process, and many questions & answers on StackOverflow's sister site for networking administration
I have jupyter-notebook running on my own Mac with the caylsto-processing library plugged in so I can run processing scripts in a notebook in a browser tab. But I am trying to be able to run this all in binder, so that I can share my processing scripts with students during class. I created a Github repository and have it linked to a binder, and the binder builds and launches, but the only kernel available is python 3.
I have read that I can include a bunch of configuration files, but I'm new to these and I don't see any examples that bring in the calysto-processing kernel, so I'm unsure on how to proceed.
Screenshot of my binder with the jupyter-notebook with a processing script - but when you click on kernels, the only kernel it shows is python:
Any help would be appreciated.
Very good question. Ayman suggestion is good.
I've just installed calysto_processing and noticed 3 things are necessary:
installing the calysto_processing package via pip,
running install on the calysto_processing package.
installing Processing.
First point should be easy with requirements.txt.
I'm unsure what the best option is for the second step (maybe a custom setup.py ?).
Step 3 feels the trickiest.
Installing Processing currently isn't supported with apt-get so Dockerfile might be way forward (even through mybinder recommend that only as a last resort).
Let's assume a Dockerfile would contain all the steps to manually download/install processing (and I'm not super experienced with Docker at the moment btw), it will need to be executed which will require a windowing system to render the Processing window.
I don't know how well that plays with Docker, sounds like it's getting into virtual machine territory.
That being said, looking at the source code right here:
Processing is used only to validate the sketch, and pull syntax errors to display them otherwise.
ProcessingJS is used to actually render the processing code in a <canvas/> element within the Jupyter Notebook
I'm not sure what the easiest way to run the current calysto_processing in mybinder as is.
My pragmatic (even hacky if you will) suggestion is to:
fork the project and remove the processing-java dependency (which means might loose error checking)
install the cloned/tweaked version via pip/requirements.txt (pip can install a package from a github repo)
Update I have tried the above: you can run test kernel here
The source is here and the module is installed from this fork which simply comments out the processing-java part.
In terms of the mybinder configuration it boils down to:
create a binder folder in the repo containing the notebook
add requirements.txt which points to the tweaked version of calysto_processing stripped off the processing-java dependency: git+https://github.com/orgicus/calysto_processing.git#hotfix/PJS-only-test
add postBuild file which runs install on the calysto_processing module: python -m calysto_processing install --user
Notes
With this workaround java error checking is gone
Although Processing syntax is used it's execute as javascript and rendered in <canvas/> using ProcessingJS: this means no processing-java libraries, no threads or other java specific features,(buggy or no 3D),etc. just basic Processing drawing sketches
It might be worth looking at replacing ProcessingJS with p5.js and checking out other JS notebooks ? (e.g. Observable or IJavascript)
Deploying to a live server for an existing Django application. It's a very old site that has not been updated in 3+ years.
I was hired on contract to bring it up to date, which included upgrading the Django version to be current. This broke many things on the site that had to be repaired.
I did a test deployment and it went fine. Now it is time for deployment to live and I am having some issues....
First thing I was going to do is keep a log of the current Django version on server, incase of any issues we can roll back. I tried logging in Python command prompt and importing Django to find version number, and it said Django not found.
I was looking further and found the version in a pip requirements.txt file.
Then I decided to update the actual django version on the server. Update went through smoothly. Then I checked the live site, and everything was unchanged (with the old files still in place). Most of the site should have been broken. It was not recognizing any changes in Django.
I am assuming the reason for this might be that the last contractor used virtualenv? And that's why it is not recognizing Django, or the Django update are not doing anything to the live site?
That is the only reason I could come up with to explain this issue, as since there is a pip requirements.txt file, he likely installed Django with pip, which means Python should recognize the path to Django.
So then I was going to try to find the source path for the virtualenv with command "lsvirtualenv". But when I do that, even that gives me a "command not found" error.
My only guess is that this was an older version of virtualenv that does not have this command? If that is not the case, I'm not sure what is going on.
Any advice for how I find the information I need to update the package versions on this server with the tools I have access to?
Create your own virtualenv
If all fails, just recreate the virtualenv from the requirements.txt and go from there
Find out how the old app was being launched
If you insist on finding the old one, IMO the most direct way is to find how is the production Django app being ran. Look for bash scripts that start it, some supervisor entries etc
If you find how it starts, then you can pinpoint the environment it is launched in (e.g. which virtualenv)
Find the virtualenv by searching for common files
Other than that you can use find or locate command to search for files we know to exist in a virtualenv like lib/pythonX.Y/site-packages, bin/activate or bin/python etc
Why not start checking what processes are actually running, and with what commandline, using ps auxf or something of the sort. Then you know if its nginx+uwsgi or django-devserver or what, and maybe even see the virtualenv path, if it's being launched very manually. Then, look at the config file of the server you find.
Alternatively, look around, using netstat -taupen, for example, which processes are listening to which ports. Makes even more sense, if there's a reverse proxy like nginx or whatever running, and you know that it's proxying to.
The requirements.txt I'd ignore completely. You'll get the same, but correct information from the virtualenv once you activate it and run pip freeze. The file's superfluous at best, and misleading at worst.
Btw, if this old contractor compiled and installed a custom Python, (s)he might not even have used a virtualenv, while still avoiding the system libraries and PYTHONPATH. Unlikely, but possible.
So I've been working on a project.
I've reached the stage where everything works.
The only thing that's left for me to do is to deploy/distribute the project so I'll be able to install / run it on my own machines with minimal effort. Since this project is for personal use only, I wasn't planning on distributing it and putting on on PyPi.
So what is the best distribution scheme given this?
Is it to install using distutils and setup.py?
My original plan was to pull the entire project from my git repository on to the host machine and automatically install the dependencies (Which I understood can be done using Pip and a requirement file). Does this make more sense?
I'd appreciate some guidance since I've never done this before.
I have a number of python "script suites" (as I call them) which I would like to make easy-to-install for my colleagues. I have looked into pip, and that seems really nice, but in that regimen (as I understand it) I would have to submit a static version and update them on every change.
As it happens I am going to be adding and changing a lot of stuff in my script suites along the way, and whenever someone installs it, I would like them to get the newest version. With pip, that means that on every commit to my repository, I will also have to re-submit a package to the PyPI index. That's a lot of unnecessary work.
Is there any way to provide an easy cross-platform installation (via pip or otherwise) which pulls the files directly from my github repo?
I'm not sure if I understand your problem entirely, but you might want to use pip's editable installs[1]
Here's a brief example: In this artificial example let's suppose you want to use git as CVS.
git clone url_to_myrepo.git path/to/local_repository
pip install [--user] -e path/to/local_repository
The installation of the package will reflect the state of your local repository. Therefore there is no need to reinstall the package with pip when the remote repository gets updated. Whenever you pull changes to your local repository, the installation will be up-to-date as well.
[1] http://pip.readthedocs.org/en/latest/reference/pip_install.html#editable-installs