How to see users of a python package? - python

I have made a python package named 'Panclus' but I do not know if anyone is using this package or not. Please tell me how to see how many people have installed my python package. Is it possible to get the name of users?

From here:
PyPI does not display download statistics for a number of reasons:
Inefficient to make work with a Content Distribution Network (CDN):
Download statistics change constantly. Including them in project
pages, which are heavily cached, would require invalidating the cache
more often, and reduce the overall effectiveness of the cache.
Highly inaccurate: A number of things prevent the download counts from
being accurate, some of which include:
pip’s download cache (lowers download counts)
Internal or unofficial mirrors (can both raise or lower download
counts)
Packages not hosted on PyPI (for comparisons sake)
Unofficial scripts or attempts at download count inflation (raises
download counts)
Known historical data quality issues (lowers download counts)
Not particularly useful: Just because a project has been downloaded a
lot doesn’t mean it’s good; Similarly just because a project hasn’t
been downloaded a lot doesn’t mean it’s bad!
Get more information about download statistics from this.

Related

Using setuptools, how can I download external data upon installation?

I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.

Consolidated Download Statistics For Python Packages With pip-install

Is there a way to obtain statistics on how many times a package (every package) has been installed using pip-install or any other package manager?
This question came about when trying to determine the popularity of some python packages I was looking into using for a personal project - I am sure developers would find it useful to be able to obtain such statistics.
You can view the download statistics from the official PyPI data. This data is stored in BigQuery (see https://bigquery.cloud.google.com/dataset/the-psf:pypi)
But I developed a page to facilitate this information, take a look at http://pepy.tech/ :-)
You can get this statistics via vanity:
https://pypi.python.org/pypi/vanity

How to tell the amount of times a pypi package is downloaded by actual users?

I uploaded a python package to pypi and I would like to track how many "real" downloads does it have. Like, given that my package has, say, 1000 downloads (per day, week, month, doesn't matter), I would like to discard from that amount the number of downloads made from CI servers and so. I mean, I would like to discard downloads that are not from actual users.
Is there any way to accomplish this?
Thanks!
I could possibly be 7 years late, but there's now a website called pypistats that gives you how many downloads your PyPi package has.
You can use vanity:
To Install:
pip install vanity
To Run:
vanity name-of-package
Note: The Python Packaging Index is to be moved from the current PyPi site to Warehouse, so these stats may be slightly off during this migration. The Warehouse pre-production site can be found here.
Possible answer depending on your library's popularity...:
When your package makes it to the 360 most popular ones, it will get in this list If not, it is bellow that number... :)

What are the use cases for a Python distribution?

I'm developing a distribution for the Python package I'm writing so I can post
it on PyPI. It's my first time working with distutils, setuptools, distribute,
pip, setup.py and all that and I'm struggling a bit with a learning curve
that's quite a bit steeper than I anticipated :)
I was having a little trouble getting some of my test data files to be
included in the tarball by specifying them in the data_files parameter in setup.py until I came across a different post here that pointed me
toward the MANIFEST.in file. Just then I snapped to the notion that what you
include in the tarball/zip (using MANIFEST.in) and what gets installed in a
user's Python environment when they do easy_install or whatever (based on what
you specify in setup.py) are two very different things; in general there being
a lot more in the tarball than actually gets installed.
This immediately triggered a code-smell for me and the realization that there
must be more than one use case for a distribution; I had been fixated on the
only one I've really participated in, using easy_install or pip to install a
library. And then I realized I was developing work product where I had only a
partial understanding of the end-users I was developing for.
So my question is this: "What are the use cases for a Python distribution
other than installing it in one's Python environment? Who else am I serving
with this distribution and what do they care most about?"
Here are some of the working issues I haven't figured out yet that bear on the
answer:
Is it a sensible thing to include everything that's under source control
(git) in the source distribution? In the age of github, does anyone download
a source distribution to get access to the full project source? Or should I
just post a link to my github repo? Won't including everything bloat the
distribution and make it take longer to download for folks who just want to
install it?
I'm going to host the documentation on readthedocs.org. Does it make any
sense for me to include HTML versions of the docs in the source
distribution?
Does anyone use python setup.py test to run tests on a source
distribution? If so, what role are they in and what situation are they in? I
don't know if I should bother with making that work and if I do, who to make
it work for.
Some things that you might want to include in the source distribution but maybe not install include:
the package's license
a test suite
the documentation (possibly a processed form like HTML in addition to the source)
possibly any additional scripts used to build the source distribution
Quite often this will be the majority or all of what you are managing in version control and possibly a few generated files.
The main reason why you would do this when those files are available online or through version control is so that people know they have the version of the docs or tests that matches the code they're running.
If you only host the most recent version of the docs online, then they might not be useful to someone who has to use an older version for some reason. And the test suite on the tip in version control may not be compatible with the version of the code in the source distribution (e.g. if it tests features added since then). To get the right version of the docs or tests, they would need to comb through version control looking for a tag that corresponds to the source distribution (assuming the developers bothered tagging the tree). Having the files available in the source distribution avoids this problem.
As for people wanting to run the test suite, I have a number of my Python modules packaged in various Linux distributions and occasionally get bug reports related to test failures in their environments. I've also used the test suites of other people's modules when I encounter a bug and want to check whether the external code is behaving as the author expects in my environment.

PyPi download counts seem unrealistic

I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version.
What is going on here?
Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?
This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.)
I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate.
It looks like the main reason PyPI needs mirrors is because it has them.
Starting with Cairnarvon's summarizing statement:
"It looks like the main reason PyPI needs mirrors is because it has them."
I would slightly modify this:
It might be more the way PyPI actually works and thus has to be mirrored, that might contribute an additional bit (or two :-) to the real traffic.
At the moment I think you MUST interact with the main index to know what to update in your repository. State is not simply accesible through timestamps on some publicly accessible folder hierarchy. So, the bad thing is, rsync is out of the equation. The good thing is, you MAY talk to the index through JSON, OAuth, XML-RPC or HTTP interfaces.
For XML-RPC:
$> python
>>> import xmlrpclib
>>> import pprint
>>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
>>> client.package_releases('PartitionSets')
['0.1.1']
For JSON eg.:
$> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json
If there are approx. 30.000 packages hosted [1] with some being downloaded 50.000 to 300.000 times a week [2] (like distribute, pip, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors at least from an accessibilty perspective. Just imagine what a user does when pip install NeedThisPackage fails: Wait? Also company wide PyPI mirrors are quite common acting as proxies for otherwise unrouteable networks. Finally not to forget the wonderful multi version checking enabled through virtualenv and friends. These all are IMO legitimate and potentially wonderful uses of packages ...
In the end, you never know what an agent really does with a downloaded package: Have N users really use it or just overwrite it next time ... and after all, IMHO package authors should care more for number and nature of uses, than the pure number of potential users ;-)
Refs: The guestimated numbers are from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for the weekly numbers, requested 2013-03-23).
You also have to take into account that virtualenv is getting more popular. If your package is something like a core library that people use in many of their projects, they will usually download it multiple times.
Consider a single user has 5 projects where he uses your package and each lives in its own virtualenv. Using pip to meet the requirements, your package is already downloaded 5 times this way. Then these projects might be set up on different machines, like work, home and laptop computers, in addition there might be a staging and a live server in case of a web application. Summing this up, you end up with many downloads by a single person.
Just a thought... perhaps your package is simply good. ;)
Hypothesis: CI tools like Travis CI and Appveyor also contribute quite a bit. It might mean that each commit/push leads to a build of a package and the installation of everything in requirements.txt

Categories

Resources