Python, NetCDF4 and HDF5

Python, NetCDF4 and HDF5 - python

I don't know why these packages are always such a pain to install. I have been using NetCDF/HDF5 for a long time now and it's always been a pure horror trip getting them to install or run properly, no matter if it's on Linux or OSX, no matter if C, C++ or now python. The simple dependency between netcdf4 and hdf5 is a source of great pain for many people and I really wish the developers of those packages would finally do something about it.
So, the latest concrete problem I am facing is this: I am trying to install netCDF4 for python. I get the following error:
Package hdf5 was not found in the pkg-config search path
Perhaps you should add the directory containing `hdf5.pc'
I tried to install the hdf5 packages using apt-get, including:
libhdf5-serial-dev
libhdf5-serial
libhdf5-7
python-h5py
libhdf5-dev
hdf5-tools
hdf5-helpers
libhdf5-7-dbg
Using pip, I tried:
pip install h5py
which failed miserably to resolve a dependency to Cython, which I then installed manually. After that it installed (apparently) but I can not find the file hdf5.pc anywhere.
I am pulling my hairs out here. Anyone know how to work around this problem?

When you mix distribution packages and self-built packages, you are increasing your chance of problems (as you are finding out).
Also, do you want h5py or do you want netcdf-python? I don't think netcdf-python has a dependency on h5py. Rather, netcdf-python binds to the C netcdf library, which in turn depends on the C HDF5 library.
h5py likewise binds to C HDF5
There is a lot of software involved, it's true. Work your way through step by step and it will make more sense eventually (says the guy who has been doing this for 15 years... it gets easier!)
If you are going to do any parallel programming, you'll need an MPI implementation
HDF5 now provides the foundation for NetCDF4. If you want parallel programming, build HDF5 against your MPI implementation.
Install the C library of NetCDF4
now the python bindings can pick up what they need from NetCDF4, HDF5, and MPI
Yes it is a lot of software to configure and build. pkg-config can help a lot here! When you see Package hdf5 was not found in the pkg-config search path, that means you should adjust your PKG_CONFIG_DIR to point to the location of the package-config files. Unfortunately, hdf5 doesn't provide a .pc (package-config) file, so you'll have to just do that part by hand. Oh, and netcdf doesn't provide a pkg-config either: it provides a script nc-config that netcdf-python will use.
Let me provide a concrete example:
MPICH-master installed in /home/robl/soft/mpich-master
HDF5 installed in /home/robl/soft/hdf5-1.8.16
e.g configured like ../../hdf5-1.8.16/configure --prefix=/home/robl/work/soft/hdf5-1.8.16 CC=/home/robl/work/soft/mpich/bin/mpicc --enable-parallel
NetCDF4 installed in /home/robl/soft/netcdf-master
e.g. configured like ./configure CC=${HOME}/work/soft/mpich/bin/mpicc --prefix=${HOME}/work/soft/netcdf-master CPPFLAGS=-I${HOME}/work/soft/hdf5-1.8.16/include LDFLAGS=-L${HOME}/work/soft/hdf5-1.8.16/lib
now you have all the pre-requisietes for netcdf-python
by the way, http://unidata.github.io/netcdf4-python/ lays out the prerequisites and the necessary configure options
Don't get hung up on the carping about hdf5.pc. If you have nc-config in your path, it will provide the needed information.
If you are building for parallel programming, set CC to your MPI compiler. if not, you can skip the ``export CC=...'' step:
cd netcdf-python
export CC=${HOME}/work/soft/mpich/bin/mpicc
export PATH=${HOME}/work/soft/netcdf-master/bin:${PATH}
python setup.py build

Related

Distribution format for Python libraries uploaded on PyPI

I went through the tutorial of uploading packages to https://test.pypi.org/ and I was successful in doing so.
However, $python setup.py sdist bdist_wheel produces a .whl file and a tar.gz file in the dist/ directory. twine allows uploading just the .whl or tar.gz file or both. I see many repositories on https://pypi.org/ have uploaded both the formats.
I want to understand what is the best practice. Is one format preferred over the other? If .whl file is enough for distributing my code, should I upload tar.gz file too? Or is there anything else I am completely missing here.

Best practice is to provide both.
A "built distribution" (.whl) for users that are able to use that distribution. This saves install time as a "built distribution" is pre-built and can just be dropped into place on the users machine, without any compilation step or without executing setup.py. There may be more than one built distribution for a given release -- once you start including compiled binaries with your distribution, they become platform-specific (see https://pypi.org/project/tensorflow/#files for example)
A "source distribution" (.tar.gz) is essentially a fallback for any user that cannot use your built distribution(s). Source distributions are not "built" meaning they may require compilation to install. At the minimum, they require executing a build-backend (for most projects, this is invoking setup.py with setuptools as the build-backend). Any installer should be able to install from source. In addition, a source distribution makes it easier for users who want to audit your source code (although this is possible with built distributions as well).
For the majority of Python projects, turning a "source distribution" into a "built distribution" results in a single pure-Python wheel (which is indicated by the none-any in a filename like projectname-1.2.3-py2.py3-none-any.whl). There's not much difference between this and the source distribution, but it's still best practice to upload both.

A .tar.gz is a so called source distribution. It contains the source code of your package and instructions on how to build it, and the target system will perform the build before installing it.
A .wheel (spec details in PEP 427) is a built distribution format, which means that the target system doesn't need to build it any more. Installing a wheel usually just means copying its content into the right site-packages.
The wheel sounds downright superior, because it is. It is still best practice to upload both, wheels and a source distribution, because any built distribution format only works for a subset of target systems. For a package that contains only python code, that subset is "everything" - people still often upload source distributions though, maybe to be forward compatible in case a new standard turns up[1], maybe to anticipate system specific extensions that will suddenly require source distributions in order to support all platforms, maybe to give users the option to run a custom build with specific build-parameters.
A good example package to observe the different cases is numpy, which uploads a whooping 25 wheels to cover the most popular platforms, plus a source distribution. If you install numpy from any of the supported platforms, you'll get a nice short install taking a couple of second where the contents of the wheel are copied over. If you are on an unsupported platform (such as alpine), a normal computer will probably take at least 20 minutes to build numpy from source before it can be installed, and you need to have all kinds of system level dev tools for building C-extensions installed. A bit of a pain, but still better then not being able to install it all.
[1] After all, before wheel there was egg, and the adoption of the wheel format would have been a lot harder than it already was if package managers had decided to only upload egg distributions and no sources.

You can just upload whl file and install with below command

Make Python code with third party libraries useable for unix server

I have a Python script which uses NumPy and another third party library. The third party library is written in Python and has no bindings to other languages. It makes us of Cython, SciPy, NumPy and Matplotlib. Though I only use a small subset of this library, it has no easy replacement (scientific software).
I'd like to use a computing server to run my program, since it takes over 10 hours to finish. Needless to say there is no support for python. So I see two possibilities: precompile my code for Unix or convert it to C/C++.
What I tried:
shedskin: Doesn't work with unsupported libraries
cx_freeze et al.: Countless errors, it's difficult to make simple programs work
PyInstaller: Doesn't work using OpenSuse. Is not able to resolve the dependencies of third party libraries
Nuitka: I get a memory error
Any suggestions on what to do are welcome.

Anaconda/Miniconda is a perfect fit for this problem. It installs locally to your users home directory and installs all the binaries you need (with minimal effort to add extra custom packages). It's designed specifically with the python science ecosystem (and all it's annoying build dependencies) in mind.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
export PATH=$PATH:~/miniconda/bin
conda install numpy scipy matplotlib cython
You also get the nice side effect that installing a new machine will take seconds to minutes rather than minutes to hours.
Once it's setup, it's also compatible with pip (ie/ it puts a local copy of pip beside conda)

Python Libraries

I'm in desperate need of a cross platform framework as I have vast numbers of .NET products that I'm trying to port to Linux. I have started to work with Python/pyQt and the standard library and all was going well until I try to import non-standard libraries. I'm hearing about pip and easy_install and I'm completely confused about this.
My products need to ship with everything required to execute them, so in the .NET world I simply package my DLLs (or licensed DLLs) with my product.
As a test bed I'm trying to import this library called requests: https://github.com/kennethreitz/requests
I've got an __init__.py file and the library source in my program directory but it isn't working. Please tell me that there is a simple way to include libraries without needing any kind of extra package installer.

I would suggest you start by familiarizing yourself with python packages (see the distutils docs. Pip is simply a manager that install packages directly from the internet repository, so that you don't need to manually go and download them. So for, example, as stated under "Installing" on the requests homepage, you simply run pip install requests in a terminal, without manually downloading anything.
Packaging your product is a different story, and the way you do it depends on the target system. On windows, the easiest might be to create an installer using NSIS which will install all dependencies. You might also want to use cx-freeze to pull all the dependencies (including the python interpreter) into a single package.
On linux, many of the dependencies will already be including in most distributions. so you should just list them as requirements when creating your package (e.g. deb for ubuntu). Other dependencies might not be included in the distro's repo, but you can still list them as requirements in setup.py.
I can't really comment on Mac, since I've never used python on one, but I think that it would be similar to the linux approach.

How to easily distribute Python software that has Python module dependencies? Frustrations in Python package installation on Unix

My goal is to distribute a Python package that has several other widely used Python packages as dependencies. My package depends on well written, Pypi-indexed packages like pandas, scipy and numpy, and specifies in the setup.py that certain versions or higher of these are needed, e.g. "numpy >= 1.5".
I found that it's immensely frustrating and nearly impossible for Unix savvy users who are not experts in Python packaging (even if they know how to write Python) to install a package like mine, even when using what are supposed to be easy to use package managers. I am wondering if there is an alternative to this painful process that someone can offer, or if my experience just reflects the very difficult current state of Python packaging and distribution.
Suppose users download your package onto their system. Most will try to install it "naively", using something like:
$ python setup.py install
Since if you google instructions on installing Python packages, this is usually what comes up. This will fail for the vast majority of users, since most do not have root access on their Unix/Linux servers. With more searching, they will discover the "--prefix" option and try:
$ python setup.py install --prefix=/some/local/dir
Since the users are not aware of the intricacies of Python packaging, they will pick an arbitrary directory as an argument to --prefix, e.g. "~/software/mypackage/". It will not be a cleanly curated directory where all other Python packages reside, because again, most users are not aware of these details. If they install another package "myotherpackage", they might pass it "~/software/myotherpackage", and you can imagine how down the road this will lead to frustrating hacking of PYTHONPATH and other complications.
Continuing with the installation process, the call to "setup.py install" with "--prefix" will also fail once users try to use the package, even though it appeared to have been installed correctly, since one of the dependencies might be missing (e.g. pandas, scipy or numpy) and a package manager is not used. They will try to install these packages individually. Even if successful, the packages will inevitably not be in the PYTHONPATH due to the non-standard directories given to "--prefix" and patient users will dabble with modifications of their PYTHONPATH to get the dependencies to be visible.
At this stage, users might be told by a Python savvy friend that they should use a package manager like "easy_install", the mainstream manager, to install the software and have dependencies taken care of. After installing "easy_install", which might be difficult, they will try:
$ easy_install setup.py
This too will fail, since users again do not typically have permission to install software globally on production Unix servers. With more reading, they will learn about the "--user" option, and try:
$ easy_install setup.py --user
They will get the error:
usage: easy_install [options] requirement_or_url ...
or: easy_install --help
error: option --user not recognized
They will be extremely puzzled why their easy_install does not have the --user option where there are clearly pages online describing the option. They might try to upgrade their easy_install to the latest version and find that it still fails.
If they continue and consult a Python packaging expert, they will discover that there are two versions of easy_install, both named "easy_install" so as to maximize confusion, but one part of "distribute" and the other part of "setuptools". It happens to be that only the "easy_install" of "distribute" supports "--user" and the vast majority of servers/sys admins install "setuptools"'s easy_install and so local installation will not be possible. Keep in mind that these distinctions between "distribute" and "setuptools" are meaningless and hard to understand for people who are not experts in Python package management.
At this point, I would have lost 90% of even the most determined, savvy and patient users who try to install my software package -- and rightfully so! They wanted to install a piece of software that happened to be written in Python, not to become experts in state of the art Python package distribution, and this is far too confusing and complex. They will give up and be frustrated at the time wasted.
The tiny minority of users who continue on and ask more Python experts will be told that they ought to use pip/virtualenv instead of easy_install. Installing pip and virtualenv and figuring out how these tools work and how they are different from the conventional "python setup.py" or "easy_install" calls is in itself time consuming and difficult, and again too much to ask from users who just wanted to install a simple piece of Python software and use it. Even those who pursue this path will be confused as to whether whatever dependencies they installed with easy_install or setup.py install --prefix are still usable with pip/virtualenv or if everything needs to be reinstalled from scratch.
This problem is exacerbated if one or more of the packages in question depends on installing a different version of Python than the one that is the default. The difficulty of ensuring that your Python package manger is using the Python version you want it to, and that the required dependencies are installed in the relevant Python 2.x directory and not Python 2.y, will be so endlessly frustrating to users that they will certainly give up at that stage.
Is there a simpler way to install Python software that doesn't require users to delve into all of these technical details of Python packages, paths and locations? For example, I am not a big Java user, but I do use some Java tools occasionally, and don't recall ever having to worry about X and Y dependencies of the Java software I was installing, and I have no clue how Java package managing works (and I'm happy that I don't -- I just wanted to use a tool that happened to be written in Java.) My recollection is that if you download a Jar, you just get it and it tends to work.
Is there an equivalent for Python? A way to distribute software in a way that doesn't depend on users having to chase down all these dependencies and versions? A way to perhaps compile all the relevant packages into something self-contained that can just be downloaded and used as a binary?
I would like to emphasize that this frustration happens even with the narrow goal of distributing a package to savvy Unix users, which makes the problem simpler by not worrying about cross platform issues, etc. I assume that the users are Unix savvy, and might even know Python, but just aren't aware (and don't want to be made aware) about the ins and outs of Python packaging and the myriad of internal complications/rivalries of different package managers. A disturbing feature of this issue is that it happens even when all of your Python package dependencies are well-known, well-written and well-maintained Pypi-available packages like Pandas, Scipy and Numpy. It's not like I was relying on some obscure dependencies that are not properly formed packages: rather, I was using the most mainstream packages that many might rely on.
Any help or advice on this will be greatly appreciated. I think Python is a great language with great libraries, but I find it virtually impossible to distribute the software I write in it (once it has dependencies) in a way that is easy for people to install locally and just run. I would like to clarify that the software I'm writing is not a Python library for programmatic use, but software that has executable scripts that users run as individual programs. Thanks.

We also develop software projects that depend on numpy, scipy and other PyPI packages. Hands down, the best tool currently available out there for managing remote installations is zc.buildout. It is very easy to use. You download a bootstrapping script from their website and distribute that with your package. You write a "local deployment" file, called normally buildout.cfg, that explains how to install the package locally. You ship both the bootstrap.py file and buildout.cfg with your package - we use the MANIFEST.in file in our python packages to force the embedding of these two files with the zip or tar balls distributed by PyPI. When the user unpackages it, it should execute two commands:
$ python bootstrap.py # this will download zc.buildout and setuptools
$ ./bin/buildout # this will build and **locally** install your package + deps
The package is compiled and all dependencies are installed locally, which means that the user installing your package doesn't even need root privileges, which is an added feature. The scripts are (normally) placed under ./bin, so the user can just execute them after that. zc.buildout uses setuptools for interaction with PyPI so everything you expect works out of the box.
You can extend zc.buildout quite easily if all that power is not enough - you create the so-called "recipes" that can help the user to create extra configuration files, download other stuff from the net or instantiate custom programs. zc.buildout website contains a video tutorial that explains in details how to use buildout and how to extend it. Our project Bob makes extensive use of buildout for distributing packages for scientific usage. If you would like, please visit the following page that contains detailed instructions for our developers on how they can setup their python packages so other people can build and install them locally using zc.buildout.

We're currently working to make it easier for users to get started installing Python software in a platform independent manner (in particular see https://python-packaging-user-guide.readthedocs.org/en/latest/future.html and http://www.python.org/dev/peps/pep-0453/)
For right now, the problem with two competing versions of easy_install has been resolved, with the competing fork "distribute" being merged backing into the setuptools main line of development.
The best currently available advice on cross-platform distribution and installation of Python software is captured here: https://packaging.python.org/

Deploy Python programs on Windows and fetch big library dependencies

I have some small Python programs which depend on several big libraries, such as:
NumPy & SciPy
matplotlib
PyQt
OpenCV
PIL
I'd like to make it easier to install these programs for Windows users. Currently I have two options:
either create huge executable bundles with PyInstaller, py2exe or similar tool,
or write step-by-step manual installation instructions.
Executable bundles are way too big. I always feel like there is some magic happening, which may or may not work the next time I use a different library or a new library version. I dislike wasted space too. Manual installation is too easy to do wrong, there are too many steps: download this particular interpreter version, download numpy, scipy, pyqt, pil binaries, make sure they all are built for the same python version and the same platform, install one after another, download and unpack OpenCV, copy its .pyd file deep inside Python installation, setup environment variables and file asssociations... You see, few users will have the patience and self-confidence to do all this.
What I'd like to do: distribute only a small Python source and, probably, an installation script, which fetches and installs all the missing dependencies (correct versions, correct platform, installs them in the right order). That's a trivial task with any Linux package manager, but I just don't know which tools can accomplish it on Windows.
Are there simple tools which can generate Windows installers from a list of URLs of dependencies1?
1 As you may have noticed, most of the libraries I listed are not installable with pip/easy_install, but require to run their own installers and modify some files and environment variables.

npackd exists http://code.google.com/p/windows-package-manager/ It could be done through here or use distribute (python 3.x) or setuptools (python 2.x) with easy_install, possibly pip (don't know it's windows compatibility). But I would choose npackd because PyQt and it's unusual setup for pip/easy_install (doesn't play with them nicely, using a configure.py instead of setup.py). Though you would have to create your own repo for npackd to use for some of them. I forget what is contributed in total for python libs with it.

AFAIK there is no tool (and I'd assume you googled), so you must make one yourself.
Fetching the proper library versions seems simple enough -- using python's ftplib you can fetch the proper installers for every library. How would you know which version is compatible with the user's python? You can store different lists of download URLs, each for a different python version (this method came off the top of my head and there is probably a better way; not that it matters much if it's simple and it works).
After you figure out how to make each installer run, you can py2exe your installer script, and even use it to fetch the program itself.
EDIT
Some Considerations
There are a couple of things that popped into my mind just as I posted:
First, some pseudocode (how I would approach it, anyway)
#first, we check modules
try:
import numpy
except ImportError:
#flag numpy for installation
#lather, rinse repeat for all dependencies
#next we check version compatibility -- note that if a library version you need
#is not backwards-compatible, you're in DLL hell, and there is little we can do.
<insert version-checking code here>
#once you have your unavailable dependencies, you install them
import ftplib
<all your file-downloading here>
#now you install. sorry I can't help you here.
There are a few things you can do to make your utility reusable --
put all URL lists, minimum version numbers, required library names etc in config files
Write a script which helps you set up an installer
Py2exe the installer-maker-script
Sell it
Even better, release it under GPL so we can all feast upon fruits of your labours.

I have a similar need as you, but in addition I need the packaged application to work on several platforms. I'm currently exploring the currently available solutions, here are a few interesting ones:
Use SnakeBasket, which wraps around Pip and add a recursive dependency resolution plus a heuristic to choose the right version when there are conflicts.
Package all dependencies as an egg, but not your sourcecode which will still be editable: https://stackoverflow.com/a/528064/1121352
Package all dependencies in a zip file and directly import the modules on the fly: Cross-platform alternative to py2exe or http://davidf.sjsoft.com/mirrors/mcmillan-inc/install1.html
Using buildout: http://www.buildout.org/en/latest/install.html
Using virtualenv with virtualenv-tools (instead of "relocate")
If your main problem when freezing your code using PyInstaller or similar is that you end up with a big single file, you can customize the process so that you get several files, one for each dependency, instead of one big executable.
I will update here if I find something that fills my bill.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.