Problems with computing the entropy of random variables via SciPy (stats)

Problems with computing the entropy of random variables via SciPy (stats) - python

Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.

For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.

Related

weighted central tendency in python

I have a list, like the one below:
[1,1,1,1,1,1,1,1,1,10]
I want to calculate the list average, however, arithmetic mean would not be an answer. I want my result to be skewed toward 10 because it is assumed to have much more weight. Geometric mean is not also an option. I wanted to know if there are other measures of central tendency that can be calculated in python using predefined functions. I am preparing my data for a neural network and I shall use various weights to see which one performs better. That is why I am looking for predefined functions so I can change their parameters to create multiple databases for my Neural Network.
Thanks in advance
P.s. The priority is to find the measure. Applying it in python comes next.

From my deep learning practice I don't quite agree that a higher weight should be desired.
Yet regarding your particular question, you can solve it
with straightforward power. High values become much larger, lower values much lower. After computing mean of powered values, compute a respective root, to return to the old scale.
And make sure you use an odd value for power, so that negative numbers retain sign.
import numpy as np
x = np.array([1,1,1,1,1,1,1,1,1,10])
power = 15
powered_mean = np.mean(np.power(x, power))
central_tendency = np.power(powered_mean, 1/power) # root
8.576958985908947

Machine learning : find the closest results to a queried vector

I have thousands of vectors of about 20 features each.
Given one query vector, and a set of potential matches, I would like to be able to select the best N matches.
I have spent a couple of days trying out regression (using SVM), training my model with a data set I have created myself : each vector is the concatenation of the query vector and a result vector, and I give a score (subjectively evaluated) between 0 and 1, 0 for perfect match, 1 for worst match.
I haven't had great results, and I believe one reason could be that it is very hard to subjectively assign these scores. What would be easier on the other hand is to subjectively rank results (score being an unknown function):
score(query, resultA) > score(query, resultB) > score(query, resultC)
So I believe this is more a problem of Learning to rank and I have found various links for Python:
http://fa.bianp.net/blog/2012/learning-to-rank-with-scikit-learn-the-pairwise-transform/
https://gist.github.com/agramfort/2071994
...
but I haven't been able to understand how it works really. I am really confused with all the terminology, pairwise ranking, etc ... (note that I know nothing about machine learning hence my feeling of being a bit lost), etc ... so I don't understand how to apply this to my problem.
Could someone please help me clarify things, point me to the exact category of problem I am trying to solve, and even better how I could implement this in Python (scikit-learn) ?

It seems to me that what you are trying to do is to simply compute the distances between the query and the rest of your data, then return the closest N vectors to your query. This is a search problem.
There is no ordering, you simply measure the distance between your query and "thousands of vectors". Finally, you sort the distances and take the smallest N values. These correspond to the most similar N vectors to your query.
For increased efficiency at making comparisons, you can use KD-Trees or other efficient search structures: http://scikit-learn.org/stable/modules/neighbors.html#kd-tree
Then, take a look at the Wikipedia page on Lp space. Before picking an appropriate metric, you need to think about the data and its representation:
What kind of data are you working with? Where does it come from and what does it represent? Is the feature space comprised of only real numbers or does it contain binary values, categorical values or all of them? Wiki for homogeneous vs heterogeneous data.
For a real valued feature space, the Euclidean distance (L2) is usually the choice metric used, with 20 features you should be fine. Start with this one. Otherwise you might have to think about cityblock distance (L1) or other metrics such as Pearson's correlation, cosine distance, etc.
You might have to do some engineering on the data before you can do anything else.
Are the features on the same scale? e.g. x1 = [0,1], x2 = [0, 100]
If not, then try scaling your features. This is usually a matter of trial and error since some features might be noisy in which case scaling might not help.
To explain this, think about a data set with two features: height and weight. If height is in centimeters (10^3) and weight is in kilograms (10^1), then you should aim to convert the cm to meters so both features weigh equally. This is generally a good idea for feature spaces with a wide range of values, meaning you have a large sample of values for both features. You'd ideally like to have all your features normally distributed, with only a bit of noise - see central limit theorem.
Are all of the features relevant?
If you are working with real valued data, you can use Principal Component Analysis (PCA) to rank the features and keep only the relevant ones.
Otherwise, you can try feature selection http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
Reducing the dimension of the space increases performance, although it is not critical in your case.
If your data consists of continuous, categorical and binary values, then aim to scale or standardize the data. Use your knowledge about the data to come up with an appropriate representation. This is the bulk of the work and is more or less a black art. Trial and error.
As a side note, metric based methods such as knn and kmeans simply store data. Learning begins where memory ends.

Vector quantization for categorical data

Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?

For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.

You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.

Clustering GPS points with a custom distance function in scipy

I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.

You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.

Fitting data to distributions?

I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
Real-world Scenario:
For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?

This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.
There are two methods for setting parameters for a probability distribution function given data:
Least Squares
Maximum Likelihood
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
This gave me the following output:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.

Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).
A couple of quick things to note:
Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
fitdist allows you to fit any distributions you can define in terms of density and cdf.
You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.

This is probably a bit more general than you need, but might give you something to go on.
One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.
These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.
M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1,
Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.
The most general form of your question is the topic of a field known as non-parametric density estimation, where given:
data from a random process with an unknown distribution, and
constraints on the underlying process
...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).
Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.

I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.
Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.
Trigonometry, tangents... would be my last thought.
I'm not an expert, just another humble Web Developer =)

You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.
library("plyr")
library("e1071")
realData <- rnorm(1000) #Real data is normally distributed
distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp = "qexp")
#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
jpeg(paste(x, ".jpeg", sep = ""))
probplot(data, qdist = x)
dev.off()
}
l_ply(distToTest, function(x) testDist(x, realData))

For what it's worth, it seems like you might want to look at the Poisson distribution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.