What's going on with Go applications on PyPI? - python

Rather by accident I found myself in a situation in a previous role where the previous admin apparently installed "Python bindings" of InfluxDB and Docker-Compose and magically both applications where available on the systems while I was sure that they where written in Go.
I had a few issues with that:
It's incomprehensible what happens here, there should be some go binary belonging to the application but I can't find it by name, I doubt that docker-compose and influxdb have been rewritten in Python just to have one more option available while at least docker-compose static binaries are available on Github for direct download. It doesn't make a lot of sense to me.
Undermining security guidelines set by the organization and best practices for systems administration.
Dependency Confusion
Links to the packages on PyPI:
https://pypi.org/project/influxdb/
https://pypi.org/project/docker-compose/
I haven't looked into Python wheels and packaging before beyond Debian packaging, I just got curious again the get to the bottom of this strange usage pattern.
Docker-Compose refers to https://github.com/docker/compose a project consisting of 95.5% Go code according to GitHub, which isn't really helpful since the source package and wheel package on PyPI look completely different and at first sight I'm overwhelmed by the amount of Python files. InfluxDB seems to be a better example but I would really appreciate help from a Python developer or package maintainer explaining to me what happening there. Thanks.
Edit 2022-09-10:
From the show notes of Security Now 887: https://www.grc.com/sn/sn-887-notes.pdf
a researcher at Checkmarx noted in a technical report they published last week that “A worrying
feature in pip/PyPI allows code to automatically run when developers are merely downloading a
package.” He added that the feature is alarming because “a great deal of the malicious packages
we are finding in the wild use this feature of code execution upon installation to achieve higher
infection rates.”
With my preexisting misconception about some PyPI packages like docker-compose, that sounded alarming to me.
The following article mentions that compiled libraries from C, Rust, Go and others can be bundled in packages, but no applications "hidden" as artifacts, which I assumed. https://realpython.com/python-wheels/

Related

Information on managing large, multi-faceted enterprise Python codebases?

I've googled and googled, but have found almost nothing in the way of discussions or best practices in managing larger enterprise codebases in Python. Here, I'm simply soliciting any and all pointers to such information. Here's some background and some of the questions I'm looking to answer.
We're long-time Java developers, who have solved similar problems to those mentioned below largely using well established Java best practices, as well as Maven, Ant and a Sonotype Nexus repo.
I'm talking internal software only here. We're not looking to distribute anything Python-based. We've got multiple development groups using Python, each developing sharable utility code libraries, final web applications and stand-alone tools, all in pure Python. Each group has its own Github source repository.
How do we manage our shareable code, both within a group and across groups? Do we create eggs (or something similar) and distribute and install them into the Python system? If so, would we store them in our Nexus repo like our Java jars, or is there a more Python-specific method if internal package distribution? Or, do we just share raw code, checking out sources from multiple Github repos?
If we share raw code, how do we manage getting the Python searchpath right as we bring together code from multiple repositories?
How do we manage package namespaces when we want our packages to all live in a com.ourcompany base namespace? It seems like python isn't too happy when you bring together source trees with overlapping namespaces.
How do we manage third party package versioning? I've never seen easy_install or pip passed a version number. How do we lock down third party package versions?
Do tools exist to aid in Python code reviews, CI, regression testing, etc.?
We're relative newbies to Python code, so some of these questions may have fairly obvious answers. Still, I find it surprising that I can't find more information on managing larger Python codebases.
What issues will we encounter that I haven't thought to ask about, or don't yet know enough to even know to ask about?
Any valuable pointers will be greatly appreciated.
Well, I won't even try to answer all those (excellent) questions, but here are a few opinionated pointers which will hopefully help (as someone who works in both worlds, though more Java).
Packaging
If so, would we store them in our Nexus repo like our Java jars, or is
there a more Python-specific method if internal package distribution?
Or, do we just share raw code, checking out sources from multiple
Github repos?
Packaging in Python is historically a bit of a mess IMHO, though it feels like it's improving. Distutils is the major / native tool here - I've not used it much, feels slightly scary in places. In general, also check recommended tools.
Pip has all but won the war of mindshare, especially when installing 3rd party libraries. I've not solved the local library problem myself, (maybe someone else reading has), but if I were, I'd probably opt for Pip with local/network-disk repos e.g. by installing from wheels.
Another option (which can cause all sorts of hassles itself) is to package in your OS's native packager, be it Debian-style apt or by creating RPMs, etc. Of course, Windows not so much.
Versioning etc
How do we manage third party package versioning? I've never seen
easy_install or pip passed a version number.
Pip
Pip definitely supports version specifiers. Turns out Easy Install does too. I suppose many people / smaller projects opt for latest-and-greatest, which of course isn't always as "appropriate" in the enterprise...
Virtualenv
No discussion of versioning and Python would miss a Python2/3 reference, but I'm sure you're aware of all this already.
More important then would be to mention virtualenv. It truly frees you from the mess you can get in to testing multiple versions, bearing in mind especially that your (*NIX) operating systems typcially rely heavily on Python themselves. It's a big subject so have a look at the docs.
Developer Tooling
Do tools exist to aid in Python code reviews, CI, regression testing,
etc.?
Code Review
Very much so. Most code review tools are multi-language (it's just a formatting issue really), so just pick your favourite enterprise-friendly one, be it Crucible, Github's one (Barkeep?), Gerrit, or whatever.
CI
For CI you have almost as many options again. Running python apps is usually less involved than Java ones, so most CI systems, though often Java-focused, support Python. (FWIW, we use drone.io for Quod Libet). Jenkins should have no problem doing this, and it seems people have done so with TeamCity.
However, the "original" or "most Pythonic" is probably Buildbot, but I've not used it personally. Looks a lot newer than I remember, and it had quite a lot of support in the Python community I think...
Testing
For testing, though not quite as mature as JUnit / TestNG, check out the de-facto / JUnit-like unit testing unittest, but also (nicer?) alternatives like nose.py.
For higher level (BDD) testing, try something like Lettuce - as the name implies heavily inspired by Cucumber, or maybe Behave. I've not tried them, but common opinion is they're less mature than Cucumber / JBehave / Concordion / Rspec etc.

What are the use cases for a Python distribution?

I'm developing a distribution for the Python package I'm writing so I can post
it on PyPI. It's my first time working with distutils, setuptools, distribute,
pip, setup.py and all that and I'm struggling a bit with a learning curve
that's quite a bit steeper than I anticipated :)
I was having a little trouble getting some of my test data files to be
included in the tarball by specifying them in the data_files parameter in setup.py until I came across a different post here that pointed me
toward the MANIFEST.in file. Just then I snapped to the notion that what you
include in the tarball/zip (using MANIFEST.in) and what gets installed in a
user's Python environment when they do easy_install or whatever (based on what
you specify in setup.py) are two very different things; in general there being
a lot more in the tarball than actually gets installed.
This immediately triggered a code-smell for me and the realization that there
must be more than one use case for a distribution; I had been fixated on the
only one I've really participated in, using easy_install or pip to install a
library. And then I realized I was developing work product where I had only a
partial understanding of the end-users I was developing for.
So my question is this: "What are the use cases for a Python distribution
other than installing it in one's Python environment? Who else am I serving
with this distribution and what do they care most about?"
Here are some of the working issues I haven't figured out yet that bear on the
answer:
Is it a sensible thing to include everything that's under source control
(git) in the source distribution? In the age of github, does anyone download
a source distribution to get access to the full project source? Or should I
just post a link to my github repo? Won't including everything bloat the
distribution and make it take longer to download for folks who just want to
install it?
I'm going to host the documentation on readthedocs.org. Does it make any
sense for me to include HTML versions of the docs in the source
distribution?
Does anyone use python setup.py test to run tests on a source
distribution? If so, what role are they in and what situation are they in? I
don't know if I should bother with making that work and if I do, who to make
it work for.
Some things that you might want to include in the source distribution but maybe not install include:
the package's license
a test suite
the documentation (possibly a processed form like HTML in addition to the source)
possibly any additional scripts used to build the source distribution
Quite often this will be the majority or all of what you are managing in version control and possibly a few generated files.
The main reason why you would do this when those files are available online or through version control is so that people know they have the version of the docs or tests that matches the code they're running.
If you only host the most recent version of the docs online, then they might not be useful to someone who has to use an older version for some reason. And the test suite on the tip in version control may not be compatible with the version of the code in the source distribution (e.g. if it tests features added since then). To get the right version of the docs or tests, they would need to comb through version control looking for a tag that corresponds to the source distribution (assuming the developers bothered tagging the tree). Having the files available in the source distribution avoids this problem.
As for people wanting to run the test suite, I have a number of my Python modules packaged in various Linux distributions and occasionally get bug reports related to test failures in their environments. I've also used the test suites of other people's modules when I encounter a bug and want to check whether the external code is behaving as the author expects in my environment.

Tracking global migration to Python 3.x

Python 3.x is looking ever more tempting with cleaned up syntax (I like it, others may not) new features and what looks like a gradual progression towards more speed and better multithreading.
But Python 3.x is still held back by lack of 3rd party support. Important packages like Django, Twisted, etc. are not ported. It's hard to get an overview of where the bottlebecks in the migration are, how far it has come, and if it's progressing at all. The migration dependencies are also hard to map. Also, projects are probably waiting for Python 3.x to offer some major improvement over 2.x that would justify the effort of porting.
Ideally, there would be a site for tracking this migration overall, with (links to) migration plans and dependencies shown so that people willing to help the migration globally could coordinate their efforts and help specific projects. Perhaps also linking to projects' bug tracking systems for relevant migration-related bugs.
But perhaps I'm just not looking hard enough. Does someone know of any efforts to track global migration to Python 3.x?
(By "global", I mean the universe of open source projects built on Python.)
Update:
There's a poll right now on the Python home page which asks about packages you'd like to see ported to Python 3.x.
George Brandl has made a script that generates a graph with the amount of packages supporting Python 3:
The Link on the CheeseShop front page shows the packages in question: http://pypi.python.org/pypi?%3aaction=browse&c=533&show=all
There is also (a pretty crummy) list of unported packages ordered by how many depends on it: http://onpython3yet.com/ Why do I say it's crummy? Well, because it is done entirely without manual fixing up, resulting in things like listing Python as a package. This is to a large extent because people don't know that the "Dependencies" listing isn't a place to just list any sort of random dependencies, it should be used to list the packages that should be auto installed when you use easy_install/PIP. But for example in the Django world, they don't know that so you see things like "django-saddle" depending on Django and Python, and hence not being easy_installable.
That said, the list is interesting, and we see that PIL really should get ported.
Now this is not anything "global" it's just the packages on PyPI, and as such tend to be mostly Python modules, not separate applications. But I think the trend in general is visible there anyway.
The Python Package Index (PyPI) allows you to search for Python 3rd-party modules that support Python 3.x. It even has a Python 3 packages link which lists them all.
But that doesn't track individual projects' progress on Python 3 support. It just tells you which projects have achieved it.
Something I'd be interested to see is a graph of the total number/percentage of Python 3 packages in PyPI over time (from Python 3 release until present). I don't know if anyone has tracked this, or if the PyPI administrators have enough history data to produce such graphs.

How might I handle development versions of Python packages without relying on SCM?

One issue that comes up during Pinax development is dealing with development versions of external apps. I am trying to come up with a solution that doesn't involve bringing in the version control systems. Reason being I'd rather not have to install all the possible version control systems on my system (or force that upon contributors) and deal the problems that might arise during environment creation.
Take this situation (knowing how Pinax works will be beneficial to understanding):
We are beginning development on a new version of Pinax. The previous version has a pip requirements file with explicit versions set. A bug comes in for an external app that we'd like to get resolved. To get that bug fix in Pinax the current process is to simply make a minor release of the app assuming we have control of the app. Apps we don't have control we just deal with the release cycle of the app author or force them to make releases ;-) I am not too fond of constantly making minor releases for bug fixes as in some cases I'd like to be working on new features for apps as well. Of course branching the older version is what we do and then do backports as we need.
I'd love to hear some thoughts on this.
Could you handle this using the "==dev" version specifier? If the distribution's page on PyPI includes a link to a .tgz of the current dev version (such as both github and bitbucket provide automatically) and you append "#egg=project_name-dev" to the link, both easy_install and pip will use that .tgz if ==dev is requested.
This doesn't allow you to pin to anything more specific than "most recent tip/head", but in a lot of cases that might be good enough?
I meant to mention that the solution I had considered before asking was to put up a Pinax PyPI and make development releases on it. We could put up an instance of chishop. We are already using pip's --find-links to point at pypi.pinaxproject.com for packages we've had to release ourselves.
Most open source distributors (the Debians, Ubuntu's, MacPorts, et al) use some sort of patch management mechanism. So something like: import the base source code for each package as released, as a tar ball, or as a SCM snapshot. Then manage any necessary modifications on top of it using a patch manager, like quilt or Mercurial's Queues. Then bundle up each external package with any applied patches in a consistent format. Or have URLs to the base packages and URLs to the individual patches and have them applied during installation. That's essentially what MacPorts does.
EDIT: To take it one step further, you could then version control the set of patches across all of the external packages and make that available as a unit. That's quite easy to do with Mercurial Queues. Then you've simplified the problem to just publishing one set of patches using one SCM system, with the patches applied locally as above or available for developers to pull and apply to their copies of the base release packages.
EDIT: I am not sure I am reading your question correctly so the following may not answer your question directly.
Something I've considered, but haven't tested, is using pip's freeze bundle feature. Perhaps using that and distributing the bundle with Pinax would work? My only concern would be how different OS's are handled. For example, I've never used pip on Windows, so I wouldn't know how a bundle would interact there.
The full idea I hope to try is creating a paver script that controls management of the bundles, making it easy for users to upgrade to newer versions. This would require a bit of scaffolding though.
One other option may be you keeping a mirror of the apps you don't control, in a consistent vcs, and then distributing your mirrored versions. This would take away the need for "everyone" to have many different programs installed.
Other than that, it seems the only real solution is what you guys are doing, there isn't a hassle-free way that I've been able to find.

Are there any other good alternatives to zc.buildout and/or virtualenv for installing non-python dependencies?

I am a member of a team that is about to launch a beta of a python (Django specifically) based web site and accompanying suite of backend tools. The team itself has doubled in size from 2 to 4 over the past few weeks and we expect continued growth for the next couple of months at least. One issue that has started to plague us is getting everyone up to speed in terms of getting their development environment configured and having all the right eggs installed, etc.
I'm looking for ways to simplify this process and make it less error prone. Both zc.buildout and virtualenv look like they would be good tools for addressing this problem but both seem to concentrate primarily on the python-specific issues. We have a couple of small subprojects in other languages (Java and Ruby specifically) as well as numerous python extensions that have to be compiled natively (lxml, MySQL drivers, etc). In fact, one of the biggest thorns in our side has been getting some of these extensions compiled against appropriate versions of the shared libraries so as to avoid segfaults, malloc errors and all sorts of similar issues. It doesn't help that out of 4 people we have 4 different development environments -- 1 leopard on ppc, 1 leopard on intel, 1 ubuntu and 1 windows.
Ultimately what would be ideal would be something that works roughly like this, from the dos/unix prompt:
$ git clone [repository url]
...
$ python setup-env.py
...
that then does what zc.buildout/virtualenv does (copy/symlink the python interpreter, provide a clean space to install eggs) then installs all required eggs, including installing any native shared library dependencies, installs the ruby project, the java project, etc.
Obviously this would be useful for both getting development environments up as well as deploying on staging/production servers.
Ideally I would like for the tool that accomplishes this to be written in/extensible via python, since that is (and always will be) the lingua franca of our team, but I am open to solutions in other languages.
So, my question then is: does anyone have any suggestions for better alternatives or any experiences they can share using one of these solutions to handle larger/broader install bases?
Setuptools may be capable of more of what you're looking for than you realize -- if you need a custom version of lxml to work correctly on MacOS X, for instance, you can put a URL to an appropriate egg inside your setup.py and have setuptools download and install that inside your developers' environments as necessary; it also can be told to download and install a specific version of a dependency from revision control.
That said, I'd lean towards using a scriptably generated virtual environment. It's pretty straightforward to build a kickstart file which installs whichever packages you depend on and then boot virtual machines (or production hardware!) against it, with puppet or similar software doing other administration (adding users, setting up services [where's your database come from?], etc). This comes in particularly handy when your production environment includes multiple machines -- just script the generation of multiple VMs within their handy little sandboxed subnet (I use libvirt+kvm for this; while kvm isn't available on all the platforms you have developers working on, qemu certainly is, or you can do as I do and have a small number of beefy VM hosts shared by multiple developers).
This gets you out of the headaches of supporting N platforms -- you only have a single virtual platform to support -- and means that your deployment process, as defined by the kickstart file and puppet code used for setup, is source-controlled and run through your QA and review processes just like everything else.
I always create a develop.py file at the top level of the project, and have also a packages directory with all of the .tar.gz files from PyPI that I want to install, and also included an unpacked copy of virtualenv that is ready to run right from that file. All of this goes into version control. Every developer can simply check out the trunk, run develop.py, and a few moments later will have a virtual environment ready to use that includes all of our dependencies at exactly the versions the other developers are using. And it works even if PyPI is down, which is very helpful at this point in that service's history.
Basically, you're looking for a cross-platform software/package installer (on the lines of apt-get/yum/etc.) I'm not sure something like that exists?
An alternative might be specifying the list of packages that need to be installed via the OS-specific package management system such as Fink or DarwinPorts for Mac OS X and having a script that sets up the build environment for the in-house code?
I have continued to research this issue since I posted the question. It looks like there are some attempts to address some of the needs I outlined, e.g. Minitage and Puppet which take different approaches but both may accomplish what I want -- although Minitage does not explicitly state that it supports Windows. Lacking any better options I will try to make either one of these or just extensive customized use of zc.buildout work for our needs, but I still feel like there must be better options out there.
You might consider creating virtual machine appliances with whatever production OS you are running, and all of the software dependencies pre-built. Code can be edited either remotely, or with a shared folder. It worked pretty well for me in a past life that had a fairly complicated development environment.
Puppet doesn't (easily) support the Win32 world either. If you're looking for a deployment mechanism and not just a "dev setup" tool, you might consider looking into ControlTier (http://open.controltier.com/) which has a open-source cross-platform solution.
Beyond that you're looking at "enterprise" software such as BladeLogic or OpsWare and typically an outrageous pricetag for the functionality offered (my opinion, obviously).
A lot of folks have been aggressively using a combination of Puppet and Capistrano (even non-rails developers) for deployment automation tools to pretty good effect. Downside, again, is that it's expecting a somewhat homogeneous environment.

Categories

Resources