Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.
For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.
SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value
I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
Real-world Scenario:
For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?
This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.
There are two methods for setting parameters for a probability distribution function given data:
Least Squares
Maximum Likelihood
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
This gave me the following output:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.
Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).
A couple of quick things to note:
Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
fitdist allows you to fit any distributions you can define in terms of density and cdf.
You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.
This is probably a bit more general than you need, but might give you something to go on.
One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.
These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.
M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1,
Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.
The most general form of your question is the topic of a field known as non-parametric density estimation, where given:
data from a random process with an unknown distribution, and
constraints on the underlying process
...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).
Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.
I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.
Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.
Trigonometry, tangents... would be my last thought.
I'm not an expert, just another humble Web Developer =)
You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.
library("plyr")
library("e1071")
realData <- rnorm(1000) #Real data is normally distributed
distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp = "qexp")
#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
jpeg(paste(x, ".jpeg", sep = ""))
probplot(data, qdist = x)
dev.off()
}
l_ply(distToTest, function(x) testDist(x, realData))
For what it's worth, it seems like you might want to look at the Poisson distribution.