Creating a local python pip repo

Creating a local python pip repo - python

I am tasked to create and use a local PIP repo.
(the reason being that we'll be using Python 2.7 for at least one more year and fear of packages or older versions being removed)
I am looking at bandersnatch and it is not clear to me whether it is an on-line mirroring tool which i need to run as a service, or can be used to offload a one-off copy?
I'd prefer a second option (don't want to complicate the system unnecessarily), and would be satisfied by running an update say daily or even weekly.
An alternative approach would be to download only the packages and version we actually use by looking at the requirements.txt file, but this would require running an update every time a developer wants to add or update a package.

A way to create a local python package repository is throught Sonatype Nexus, with Nexus you can create some kinds of repos:
Hosted repo (our own and internal repo)
Proxy repo (proxy others repo)
Group repo (group and priority sort a list of hosted and proxied repos)
For example, you can create a group repo with the following logic order:
- First search the package in my own repo
- If it not exists, search it on global public repo.
It is a transparent way to your app.
https://help.sonatype.com/repomanager3/formats/pypi-repositories
There is a Docker image if you want too. https://hub.docker.com/r/sonatype/nexus3
I used it before to different purposes and I see it very mature and complete.

a script that generates a simple repository with N recent versions of 4000 most used packages on pypi. Advantage is it can hold multiple versions as in pypi. https://gist.github.com/harisankar-krishna-swamy/cac5d1e6c1ae074b39286c1336bff63d

Related

How to have python libraries already installed in python project?

I am working on a python project that requires a few libraries. This project will further be shared with other people.
The problem I have is that I can't use the usual pip install 'library' as part of my code because the project could be shared with offline computers and the work proxy could block the download.
So what I first thought of was installing .whl files and running pip install 'my_file.whl' but this is limited since some .whl files work on some computers but not on others, so this couldn't be the solution of my problem.
I tried sharing my project with another project and i had an error with a .whl file working on one computer but not the other.
What I am looking for is to have all the libraries I need to be already downloaded before sharing my project. So that when the project is shared, the peers can launch it without needing to download the libraries.
Is this possible or is there something else that can solve my problem ?

There are different approaches to the issue here, depending on what the constraints are:
1. Defined Online Dependencies
It is a good practice to define the dependencies of your project (not only when shared). Python offers different methods for this.
In this scenario every developer has access to a pypi repository via the network. Usually the official main mirrors (i.e. via internet). New packages need to be pulled individually from here, whenever there are changes.
Repository (internet) access is only needed when pulling new packages.
Below the most common ones:
1.1 requirements.txt
The requirements.txt is a plain text list of required packages and versions, e.g.
# requirements.txt
matplotlib==3.6.2
numpy==1.23.5
scipy==1.9.3
When you check this in along with your source code, users can freely decide how to install it. The mosty simple (and most convoluted way) is to install it in the base python environment via
pip install -r requirements.txt
You can even automatically generate such a file, if you lost track with pipreqs. The result is usually very good. However, a manual cleanup afterwards is recommended.
Benefits:
Package dependency is clear
Installation is a one line task
Downsides:
Possible conflicts with multiple projects
Not sure that everyone has the exact same version if flexibility is allowed (default)
1.2 Pipenv
There is a nice and almost complete Answer to Pipenv. Also the Pipenv documentation itself is very good.
In a nutshell: Pipenv allows you to have virtual environments. Thus, version conflicts from different projects are gone for good. Also, the Pipfile used to define such an environment allows seperation of production and development dependencies.
Users now only need to run the following commands in the folder with the source code:
pip install pipenv # only needed first time
pipenv install
And then, to activate the virtual environment:
pipenv shell
Benefits:
Seperation between projects
Seperation of development/testing and production packages
Everyone uses the exact same version of the packages
Configuration is flexible but easy
Downsides:
Users need to activate the environment
1.3 conda environment
If you are using anaconda, a conda environment definition can be also shared as a configuration file. See this SO answer for details.
This scenario is as the pipenv one, but with anaconda as package manager. It is recommended not to mix pip and conda.
1.4 setup.py
When you are anyway implementing a library, you want to have a look on how to configure the dependencies via the setup.py file.
2. Defined local dependencies
In this scenario the developpers do not have access to the internet. (E.g. they are "air-gapped" in a special network where they cannot communicate to the outside world. In this case all the scenarios from 1. can still be used. But now we need to setup our own mirror/proxy. There are good guides (and even comlplete of the shelf software) out there, depending on the scenario (above) you want to use. Examples are:
Local Pypi mirror [Commercial solution]
Anaconda behind company proxy
Benefits:
Users don't need internet access
Packages on the local proxy can be trusted (cannot be corrupted / deleted anymore)
The clean and flexible scenarios from above can be used for setup
Downsides:
Network connection to the proxy is still required
Maintenance of the proxy is extra effort
3. Turn key environments
Last, but not least, there are solutions to share the complete and installed environment between users/computers.
3.1 Copy virtual-env folders
If (and only if) all users (are forced to) use an identical setup (OS, install paths, uses paths, libraries, LOCALS, ...) then one can copy the virtual environments for pipenv (1.2) or conda (1.3) between PCs.
These "pre-compiled" environments are very fragile, as a sall change can cause the setup to malfunction. So this is really not recommended.
Benefits:
Can be shared between users without network (e.g. USB stick)
Downsides:
Very fragile
3.2 Virtualisation
The cleanest way to support this is some kind of virtualisation technique (virtual machine, docker container, etc.).
Install python and the dependencies needed and share the complete container.
Benefits:
Users can just use the provided container
Downsides:
Complex setup
Complex maintenance
Virtualisation layer needed
Code and environment may become convoluted
Note: This answer is compiled from the summary of (mostly my) comments

Automated deployment and update of Python-based Software with pip

We have a relatively large number of machines in a cluster that run a certain Python-based software for computation in an academic research environment. In order to keep the code base up to date, we are using a build server which makes the current code base available in a directory each time we update a dedicated deployment tag on our Mercurial server. Each machine part of the cluster runs a daily rsync script that just synchronises with the deployment directory on the build server (if there's anything to sync) and restart the process if the code base was updated.
Now this approach I find a bit dated and a slight overkill, and would like to optimise in the following way:
Get rid of the build server as all it actually does is clone the latest code base that has a certain tag attached - it doesn't actually compile or do any additional checks (such as testing) on the code base at all. This would also reduce some pain for us as it'd be one less server to maintain and worry about.
Instad of having the build server, I would like to pull straight from our Mercurial server which hosts the code already. This would reduce the need to duplicate the code base each time we update the deployment tag.
Now I had a bit of a read before on how to install / deploy Python-based software with pip (e.g., How to point pip at a Mercurial branch?). It seems to be the right choice as it supports installing packages straight from a code repository. However, I ran into a few problems that I would need help with. The requirements I have are as follows:
Use Mercurial as a source.
Automated background process to update and install into a custom directory on the file system.
Only pull and update from the repository if there is a new version available.
The following command seems to almost do what I need:
pip install -e hg+https://authkey:anypw#mymercurialserver.hostname.com/Code/package#deployment#egg=package --upgrade --src ~/proj
It pulls the package from the Mercurial server, picks the code base with the tag "deployment" and installs it into proj inside the user's home directory.
The problem, however, is that regardless whether there is an update available or not, pip always uninstalls package and reinstalls it. This makes it difficult to decide whether the process needs to be restarted or not if nothing actually changed. In addition, pip always gets stuck with the the message that hg clone in ./proj/yarely exists with URL... and asks me: What to do? (s)witch, (i)gnore, (w)ipe, (b)ackup. Now this is not ideal, as (1) it would be an automated process without user prompt, and, (2) it should only pull the repository if there was an update in the first place to reduce traffic in the network and not overload our Mercurial server. I believe that in this case, a pull instead of clone if there was a local copy of the repository already would be more appropriate and potentially solve the problem.
I wasn't able to find an elegant and nice solution to this problem. Does anyone have a pointer or suggestion how this could be achieved?

Contributing to a repository on GitHub on a new branch

Say someone owns a repository with only one master hosting code that is compatible with Python 2.7.X. I would like to contribute to that repository with my own changes to a new branch new_branch to offer a variant of the repository that is compatible with Python 3.
I followed the steps here:
I forked the repository on GitHub on my account
I cloned my fork on my local machine
I created a new branch new_branch locally
I made the relevant changes
I committed and pushed the changes to my own fork on GitHub
I went on the browser to the GitHub page of the official repository, and asked for a pull request
The above worked, but it did a pull request from "my_account:new_branch" to "official_account:master". This is not what I want, since Python 2.7.x and Python 3 are incompatible with each other. What I would like to do is create a PR to a new branch on the official repository (e.g. with the same name "new_branch"). How can I do that? Is this possible at all?

You really don't want to do things this way. But first I'll explain how to do it, then I'll come back to explain why not to.
Using Pull Requests at GitHub has a pretty good overview, in particular the section "Changing the branch range and destination repository." It's easiest if you use a topic branch, and have the upstream owner create a topic branch of the same name; then you just pull down the menu where it says "base: master" and the choice will be right there, and he can just click the "merge" button and have no surprises.
So, why don't you want to do things this way?
First, it doesn't fit the GitHub model. Topic branches that live forever in parallel with the master branch and have multiple forks make things harder to maintain and visualize.
Second, you need both a git URL and an https URL for you code. You need people to be able to share links, pip install from top of tree, just clone the repo instead of cloning and then checking out a different branch, etc. This all means your code has to be on the master branch.
Third, if you want people to be able to install your 3.x version off PyPI, find docs at readthedocs, etc., you need a single project with a single source tree. Most such sites have a single latest version, not a latest version for each Python version, and definitely not multiple variations of the same version. (You could install completely fork the project, and create a separate foo3 project. But it's much easier for people to be able to pip install foo than to have them try that, fail, come to SO and ask why it doesn't work, and get told they probably have Python 3 and need to pip install foo3 instead.)
How do you merge two versions into a single package? The porting docs should have the most up-to-date advice, but briefly: If it's at all possible to create a single codebase that runs on both versions, that's ideal; if not, and if you can't make things work by running 2to3 or 3to2 at install time, create a parallel directory for the 3.x code (e.g., a foo3 alongside foo) and pick the appropriate directory at install time. (You can always start with that and gradually work toward a unified codebase.)

Query yum database for package availability

I am working on consolidating a set of RPM packages into a new, more organized set of Yum repositories. I have already repackaged a subset of them by hand and uploaded them into the repositories, but I have a much larger set that either build and package automatically, or have newer versions that are available from third party sources.
I need to be able to, given a package name (and optionally, a list of repository ids), programmatically check and see if it is already available, and if not, upload it into the repository. I have played around with repoquery and yum search, but neither seem sufficiently scriptable for my purposes.

I have a similar requirement for one of my scripts. I use the repoquery command to check and see if a particular package/version exists in the remote repository.
Using the below command, you can easily see if a particular package (and all it's versions) exist.
repoquery --repoid=<myrepository_id> --qf="package|%{name}|%{version}|%{release}|%{arch}" <packagename_of_interest>

python packages: how to depend on the latest version of a separate package

I'm developing coding a test django site, which I keep in a bitbucket repository in order to be able to deploy it easily on a remote server, and possible share development with a friend. I use hg for version control.
The site depends on a 3rd party app (django-registration), which I needed to customize for my site, so I forked the original app and created a 2nd repository for it (the idea being that this way I can keep up with updates on the original, which wouldnt be possible if I just pasted the code into my main site, plus add to my own custom code) (You can see some more details on this question)
My question is, how do I specify requirements on my setup.py file so that when I install my django site I get the latest version of my fork for the 3rd party app (I use distribute rather than setuptools in case that makes a difference)?
I have tried this:
install_requires = ['django', 'django-registration'],
dependency_links = ['https://myuser#bitbucket.org/myuser/django-registration#egg=django_registration']
but this gets me the latest named version on the original trunk (so not even the tip version)
Using a pip requirements file however works well:
hg+https://myuser#bitbucket.org/myuser/django-registration#egg=django-registration
gets me the latest version from my fork.
Is there a way to get this same behaviour directly from the setup.py file, without having to install first the code for the site, then running pip install -r requirements.txt?
This question is very informative, but seems to suggest I should depend on version 'dev' or the 3rd party package, which doesn't work (I guess there would have to be a specific version tagged as dev for that)
Also I'm a complete newbie in packaging / distribute / setuptools, so dont hold back spelling out the steps :)
Maybe I should change the setup.py file on my fork of the 3rd party app, and make sure it mentions a version number. Generally I'm curious to know what's a source distribution, as opposed to simply having my code on a public repository, and what would be a binary distribution in my case (an egg file?), and whether that would be any more practical for me when deploying remotely / have my friend deploy on his pc. And also would like to know how do I tag a version on my repository for setup.py to refer to it, is it simply a version control tag (hg in my case)?. Feel free to comment on any details you think are important for the starter packager :)
Thanks!

put this:
dependency_links=['https://bitbucket.org/abraneo/django-registration/get/tip.tar.gz#egg=django-registration']
in the dependency_links you have to pass a download url like that one.
"abraneo" is a guy who forked this project too, replace his name by yours.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.