PyPi download counts seem unrealistic

PyPi download counts seem unrealistic - python

I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version.
What is going on here?
Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?

This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.)
I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate.
It looks like the main reason PyPI needs mirrors is because it has them.

Starting with Cairnarvon's summarizing statement:
"It looks like the main reason PyPI needs mirrors is because it has them."
I would slightly modify this:
It might be more the way PyPI actually works and thus has to be mirrored, that might contribute an additional bit (or two :-) to the real traffic.
At the moment I think you MUST interact with the main index to know what to update in your repository. State is not simply accesible through timestamps on some publicly accessible folder hierarchy. So, the bad thing is, rsync is out of the equation. The good thing is, you MAY talk to the index through JSON, OAuth, XML-RPC or HTTP interfaces.
For XML-RPC:
$> python
>>> import xmlrpclib
>>> import pprint
>>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
>>> client.package_releases('PartitionSets')
['0.1.1']
For JSON eg.:
$> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json
If there are approx. 30.000 packages hosted [1] with some being downloaded 50.000 to 300.000 times a week [2] (like distribute, pip, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors at least from an accessibilty perspective. Just imagine what a user does when pip install NeedThisPackage fails: Wait? Also company wide PyPI mirrors are quite common acting as proxies for otherwise unrouteable networks. Finally not to forget the wonderful multi version checking enabled through virtualenv and friends. These all are IMO legitimate and potentially wonderful uses of packages ...
In the end, you never know what an agent really does with a downloaded package: Have N users really use it or just overwrite it next time ... and after all, IMHO package authors should care more for number and nature of uses, than the pure number of potential users ;-)
Refs: The guestimated numbers are from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for the weekly numbers, requested 2013-03-23).

You also have to take into account that virtualenv is getting more popular. If your package is something like a core library that people use in many of their projects, they will usually download it multiple times.
Consider a single user has 5 projects where he uses your package and each lives in its own virtualenv. Using pip to meet the requirements, your package is already downloaded 5 times this way. Then these projects might be set up on different machines, like work, home and laptop computers, in addition there might be a staging and a live server in case of a web application. Summing this up, you end up with many downloads by a single person.
Just a thought... perhaps your package is simply good. ;)

Hypothesis: CI tools like Travis CI and Appveyor also contribute quite a bit. It might mean that each commit/push leads to a build of a package and the installation of everything in requirements.txt

Related

Information on managing large, multi-faceted enterprise Python codebases?

I've googled and googled, but have found almost nothing in the way of discussions or best practices in managing larger enterprise codebases in Python. Here, I'm simply soliciting any and all pointers to such information. Here's some background and some of the questions I'm looking to answer.
We're long-time Java developers, who have solved similar problems to those mentioned below largely using well established Java best practices, as well as Maven, Ant and a Sonotype Nexus repo.
I'm talking internal software only here. We're not looking to distribute anything Python-based. We've got multiple development groups using Python, each developing sharable utility code libraries, final web applications and stand-alone tools, all in pure Python. Each group has its own Github source repository.
How do we manage our shareable code, both within a group and across groups? Do we create eggs (or something similar) and distribute and install them into the Python system? If so, would we store them in our Nexus repo like our Java jars, or is there a more Python-specific method if internal package distribution? Or, do we just share raw code, checking out sources from multiple Github repos?
If we share raw code, how do we manage getting the Python searchpath right as we bring together code from multiple repositories?
How do we manage package namespaces when we want our packages to all live in a com.ourcompany base namespace? It seems like python isn't too happy when you bring together source trees with overlapping namespaces.
How do we manage third party package versioning? I've never seen easy_install or pip passed a version number. How do we lock down third party package versions?
Do tools exist to aid in Python code reviews, CI, regression testing, etc.?
We're relative newbies to Python code, so some of these questions may have fairly obvious answers. Still, I find it surprising that I can't find more information on managing larger Python codebases.
What issues will we encounter that I haven't thought to ask about, or don't yet know enough to even know to ask about?
Any valuable pointers will be greatly appreciated.

Well, I won't even try to answer all those (excellent) questions, but here are a few opinionated pointers which will hopefully help (as someone who works in both worlds, though more Java).
Packaging
If so, would we store them in our Nexus repo like our Java jars, or is
there a more Python-specific method if internal package distribution?
Or, do we just share raw code, checking out sources from multiple
Github repos?
Packaging in Python is historically a bit of a mess IMHO, though it feels like it's improving. Distutils is the major / native tool here - I've not used it much, feels slightly scary in places. In general, also check recommended tools.
Pip has all but won the war of mindshare, especially when installing 3rd party libraries. I've not solved the local library problem myself, (maybe someone else reading has), but if I were, I'd probably opt for Pip with local/network-disk repos e.g. by installing from wheels.
Another option (which can cause all sorts of hassles itself) is to package in your OS's native packager, be it Debian-style apt or by creating RPMs, etc. Of course, Windows not so much.
Versioning etc
How do we manage third party package versioning? I've never seen
easy_install or pip passed a version number.
Pip
Pip definitely supports version specifiers. Turns out Easy Install does too. I suppose many people / smaller projects opt for latest-and-greatest, which of course isn't always as "appropriate" in the enterprise...
Virtualenv
No discussion of versioning and Python would miss a Python2/3 reference, but I'm sure you're aware of all this already.
More important then would be to mention virtualenv. It truly frees you from the mess you can get in to testing multiple versions, bearing in mind especially that your (*NIX) operating systems typcially rely heavily on Python themselves. It's a big subject so have a look at the docs.
Developer Tooling
Do tools exist to aid in Python code reviews, CI, regression testing,
etc.?
Code Review
Very much so. Most code review tools are multi-language (it's just a formatting issue really), so just pick your favourite enterprise-friendly one, be it Crucible, Github's one (Barkeep?), Gerrit, or whatever.
CI
For CI you have almost as many options again. Running python apps is usually less involved than Java ones, so most CI systems, though often Java-focused, support Python. (FWIW, we use drone.io for Quod Libet). Jenkins should have no problem doing this, and it seems people have done so with TeamCity.
However, the "original" or "most Pythonic" is probably Buildbot, but I've not used it personally. Looks a lot newer than I remember, and it had quite a lot of support in the Python community I think...
Testing
For testing, though not quite as mature as JUnit / TestNG, check out the de-facto / JUnit-like unit testing unittest, but also (nicer?) alternatives like nose.py.
For higher level (BDD) testing, try something like Lettuce - as the name implies heavily inspired by Cucumber, or maybe Behave. I've not tried them, but common opinion is they're less mature than Cucumber / JBehave / Concordion / Rspec etc.

How to tell the amount of times a pypi package is downloaded by actual users?

I uploaded a python package to pypi and I would like to track how many "real" downloads does it have. Like, given that my package has, say, 1000 downloads (per day, week, month, doesn't matter), I would like to discard from that amount the number of downloads made from CI servers and so. I mean, I would like to discard downloads that are not from actual users.
Is there any way to accomplish this?
Thanks!

I could possibly be 7 years late, but there's now a website called pypistats that gives you how many downloads your PyPi package has.

You can use vanity:
To Install:
pip install vanity
To Run:
vanity name-of-package
Note: The Python Packaging Index is to be moved from the current PyPi site to Warehouse, so these stats may be slightly off during this migration. The Warehouse pre-production site can be found here.

Possible answer depending on your library's popularity...:
When your package makes it to the 360 most popular ones, it will get in this list If not, it is bellow that number... :)

Tracking global migration to Python 3.x

Python 3.x is looking ever more tempting with cleaned up syntax (I like it, others may not) new features and what looks like a gradual progression towards more speed and better multithreading.
But Python 3.x is still held back by lack of 3rd party support. Important packages like Django, Twisted, etc. are not ported. It's hard to get an overview of where the bottlebecks in the migration are, how far it has come, and if it's progressing at all. The migration dependencies are also hard to map. Also, projects are probably waiting for Python 3.x to offer some major improvement over 2.x that would justify the effort of porting.
Ideally, there would be a site for tracking this migration overall, with (links to) migration plans and dependencies shown so that people willing to help the migration globally could coordinate their efforts and help specific projects. Perhaps also linking to projects' bug tracking systems for relevant migration-related bugs.
But perhaps I'm just not looking hard enough. Does someone know of any efforts to track global migration to Python 3.x?
(By "global", I mean the universe of open source projects built on Python.)
Update:
There's a poll right now on the Python home page which asks about packages you'd like to see ported to Python 3.x.

George Brandl has made a script that generates a graph with the amount of packages supporting Python 3:
The Link on the CheeseShop front page shows the packages in question: http://pypi.python.org/pypi?%3aaction=browse&c=533&show=all
There is also (a pretty crummy) list of unported packages ordered by how many depends on it: http://onpython3yet.com/ Why do I say it's crummy? Well, because it is done entirely without manual fixing up, resulting in things like listing Python as a package. This is to a large extent because people don't know that the "Dependencies" listing isn't a place to just list any sort of random dependencies, it should be used to list the packages that should be auto installed when you use easy_install/PIP. But for example in the Django world, they don't know that so you see things like "django-saddle" depending on Django and Python, and hence not being easy_installable.
That said, the list is interesting, and we see that PIL really should get ported.
Now this is not anything "global" it's just the packages on PyPI, and as such tend to be mostly Python modules, not separate applications. But I think the trend in general is visible there anyway.

The Python Package Index (PyPI) allows you to search for Python 3rd-party modules that support Python 3.x. It even has a Python 3 packages link which lists them all.
But that doesn't track individual projects' progress on Python 3 support. It just tells you which projects have achieved it.
Something I'd be interested to see is a graph of the total number/percentage of Python 3 packages in PyPI over time (from Python 3 release until present). I don't know if anyone has tracked this, or if the PyPI administrators have enough history data to produce such graphs.

Script to install and compile Python, Django, Virtualenv, Mercurial, Git, LessCSS, etc... on Dreamhost

The Story
After cleaning up my Dreamhost shared server's home folder from all the cruft accumulated over time, I decided to start afresh and compile/reinstall Python.
All tutorials and snippets I found seemed overly simplistic, assuming (or ignoring) a bunch of dependencies needed by Python to compile all modules correctly. So, starting from http://andrew.io/weblog/2010/02/installing-python-2-6-virtualenv-and-virtualenvwrapper-on-dreamhost/ (so far the best guide I found), I decided to write a set-and-forget Bash script to automate this painful process, including along the way a bunch of other things I am planning to use.
The Script
I am hosting the script on http://bitbucket.org/tmslnz/python-dreamhost-batch/src/
The TODOs
So far it runs fine, and does all it needs to do in about 900 seconds, giving me at the end of the process a fully functional Python / Mercurial / etc... setup without even needing to log out and back in.
I though this might be of use for others too, but there are a few things that I think it's missing and I am not quite sure how to go for it, what's the best way to do it, or if this just doesn't make any sense at all.
Check for errors and break
Check for minor version bumps of the packages and give warnings
Check for known dependencies
Use arguments to install only some of the packages instead of commenting out lines
Organise the code in a manner that's easy to update
Optionally make the installers and compiling silent, with error logging to file
failproof .bashrc modification to prevent breaking ssh logins and having to log back via FTP to fix it
EDIT: The implied question is: can anyone, more bashful than me, offer general advice on the worthiness of the above points or highlight any problems they see with this approach? (see my answer to Ry4an's comment below)
The Gist
I am no UNIX or Bash or compiler expert, and this has been built iteratively, by trial and error. It is somehow going towards apt-get (well, 1% of it...), but since Dreamhost and others obviously cannot give root access on shared servers, this looks to me like a potentially very useful workaround; particularly so with some community work involved.

One way to streamline this would be to make it work with one of: capistrano/fabric, puppet/chef, jhbuild, or buildout+minitage (and a lot of cmmi tasks). There are some opportunities for factoring in common code, especially with something more high-level than bash. You will run into bootstrapping issues, however, so maybe leave good enough alone.
If you want to look into userland package managers, there is autopackage (bootstraps well), nix (quickstart), and stow (simple but helps with isolation).

Honestly, I would just build packages with a name prefix for all of the pieces and have them install under /opt so that they're out of the way. That way it only takes the download time and a bit of install time to do.

How might I handle development versions of Python packages without relying on SCM?

One issue that comes up during Pinax development is dealing with development versions of external apps. I am trying to come up with a solution that doesn't involve bringing in the version control systems. Reason being I'd rather not have to install all the possible version control systems on my system (or force that upon contributors) and deal the problems that might arise during environment creation.
Take this situation (knowing how Pinax works will be beneficial to understanding):
We are beginning development on a new version of Pinax. The previous version has a pip requirements file with explicit versions set. A bug comes in for an external app that we'd like to get resolved. To get that bug fix in Pinax the current process is to simply make a minor release of the app assuming we have control of the app. Apps we don't have control we just deal with the release cycle of the app author or force them to make releases ;-) I am not too fond of constantly making minor releases for bug fixes as in some cases I'd like to be working on new features for apps as well. Of course branching the older version is what we do and then do backports as we need.
I'd love to hear some thoughts on this.

Could you handle this using the "==dev" version specifier? If the distribution's page on PyPI includes a link to a .tgz of the current dev version (such as both github and bitbucket provide automatically) and you append "#egg=project_name-dev" to the link, both easy_install and pip will use that .tgz if ==dev is requested.
This doesn't allow you to pin to anything more specific than "most recent tip/head", but in a lot of cases that might be good enough?

I meant to mention that the solution I had considered before asking was to put up a Pinax PyPI and make development releases on it. We could put up an instance of chishop. We are already using pip's --find-links to point at pypi.pinaxproject.com for packages we've had to release ourselves.

Most open source distributors (the Debians, Ubuntu's, MacPorts, et al) use some sort of patch management mechanism. So something like: import the base source code for each package as released, as a tar ball, or as a SCM snapshot. Then manage any necessary modifications on top of it using a patch manager, like quilt or Mercurial's Queues. Then bundle up each external package with any applied patches in a consistent format. Or have URLs to the base packages and URLs to the individual patches and have them applied during installation. That's essentially what MacPorts does.
EDIT: To take it one step further, you could then version control the set of patches across all of the external packages and make that available as a unit. That's quite easy to do with Mercurial Queues. Then you've simplified the problem to just publishing one set of patches using one SCM system, with the patches applied locally as above or available for developers to pull and apply to their copies of the base release packages.

EDIT: I am not sure I am reading your question correctly so the following may not answer your question directly.
Something I've considered, but haven't tested, is using pip's freeze bundle feature. Perhaps using that and distributing the bundle with Pinax would work? My only concern would be how different OS's are handled. For example, I've never used pip on Windows, so I wouldn't know how a bundle would interact there.
The full idea I hope to try is creating a paver script that controls management of the bundles, making it easy for users to upgrade to newer versions. This would require a bit of scaffolding though.
One other option may be you keeping a mirror of the apps you don't control, in a consistent vcs, and then distributing your mirrored versions. This would take away the need for "everyone" to have many different programs installed.
Other than that, it seems the only real solution is what you guys are doing, there isn't a hassle-free way that I've been able to find.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.