I have two 2-D arrays with the same shape (105,234) named A & B essentially comprised of mean values from other arrays. I am familiar with Python's scipy package, but I can't seem to find a way to test whether or not the two arrays are statistically significantly different at each individual array index. I'm thinking this is just a large 2D paired T-test, but am having difficulty. Any ideas or other packages to use?
If we assume that the underlying variance for each mean at the gridpoints is the same, and the number of observations is the same or is known, then we can use the arrays of means to estimate the standard deviation of the means directly.
Dividing the difference between gridpoints by the standard deviation, then gives t distributed random variables, that can be directly tested, i.e. the p-value can be calculated.
As tests for many points, we will run into a multiple testing problem http://en.wikipedia.org/wiki/Multiple_comparisons#Large-scale_multiple_testing and the p-values should be corrected.
If your question is "Do two-dimensional distributions differ ?", see
Numerical Recipes p. 763
(and ask further on how to do that in numpy / scipy).
You might also ask on stats.stackexchange.
I assume that x,y coordinates do not matter and we just have the two huge sets of independent measurements.
One of the possible approaches could be just to compute standard deviation of mean for each array, multiply this value to the Student coefficient (probably somewhat 1.645 for your astronomic number of samples and 95 % confidence level) and obtain the confidence ranges around the mean this way. If the confidence ranges of the two different arrays overlap, the difference between them is not significant. Formulas can be found here.
Go to MS Excel. If you don't have it your work does, there are alternatives
Enter the array of numbers in Excel worksheet. Run the formula in the entry field, =TTEST (array1,array2,tail). One tail is one, Two tail is two...easy peasy. It's a simple Student's T and I believe you may still need a t-table to interpret the statistic (internet). Yet it's quick for on the fly comparison of samples.
Related
I would like to use total variation in Python, but I wasn't able to find an existing implementation.
Assuming that I have an array with a finite number of elements, is the implementation with NumPy simply as:
import numpy as np
a = np.array([...], dtype=float)
tv = np.sum(np.abs(np.diff(a)))
My main doubt is how to compute the supremum of tv across all partitions, and if just the sum of the absolute difference might suffice for a finite array of floats.
Edit: My input array represents a piecewise linear function, therefore the supremum over the full set of partitions is indeed the sum of absolute differences between contiguous points.
Yes, that is correct.
I imagine you're confused by the mathy definition on the Wikipedia page for total variation. Have a look at the more practical definition on the Wikipedia page for total variation denoising instead.
For an actual code (even Python) implementation, see e.g. Tensorflow's total_variation(), though this is for one or more (2D, color) images, so the TV is computed for both rows and columns, and then added together.
I have a matrix A = np.array([[1,1,1],[1,2,3],[4,4,4]]) and I want only the linearly independent rows in my new matrix. The answer might be A_new = np.array([1,1,1],[1,2,3]]) or A_new = np.array([1,2,3],[4,4,4])
Since I have a very large matrix so I need to decompose the matrix into smaller linearly independent full rank matrix. Can someone please help?
There are many ways to do this, and which way is best will depend on your needs. And, as you noted in your statement, there isn't even a unique output.
One way to do this would be to use Gram-Schmidt to find an orthogonal basis, where the first $k$ vectors in this basis have the same span as the first $k$ independent rows. If at any step you find a linear dependence, drop that row from your matrix and continue the procedure.
A simple way do do this with numpy would be,
q,r = np.linalg.qr(A.T)
and then drop any columns where R_{i,i} is zero.
For instance, you could do
A[np.abs(np.diag(R))>=1e-10]
While this will work perfectly in exact arithmetic, it may not work as well in finite precision. Almost any matrix will be numerically independent, so you will need some kind of thresholding to determine if there is a linear dependence. If you use the built in QR method, you will have to make sure that there is no dependence on columns which you previously dropped.
If you need even more stability, you could iteratively solve the least squares problem
A.T[:,dependent_cols] x = A.T[:,col_to_check]
using a stable direct method. If you can solve this exactly, then A.T[:,k] is dependent on the previous vectors, with the combination given by x.
Which solver to use may also be dictated by your data type.
I am trying to calculate correlation amongst three columns in a dataset. The dataset is relatively large (4 GB in size). When I calculate correlation among the columns of interest, I get small values like 0.0024, -0.0067 etc. I am not sure this result makes any sense or not. Should I sample the data and then try calculating correlation?
Any thoughts/experience on this topic would be appreciated.
Firstly, make sure you're applying the right formula for correlation. Remember, given vectors x and y, correlation is ((x-mean(x)) * (y - mean(y)))/(length(x)*length(y)), where * represents the dot-product and length(x) is the square root of the sum of the squares of the terms in x. (I know that's silly, but noticing a mis-typed formula is a lot easier than redoing a program.)
Do you have a strong hunch that there should be some correlation among these columns? If you don't, then those small values are reasonable. On the other hand, if you're pretty sure that there ought to be a strong correlation, then try sampling a random 100 pairs and either finding the correlation there, or plotting them for visual inspection, which can also show you if there is correlation present.
There is nothing special about correlation of large data sets. All you need to do is some simple aggregation.
If you want to improve your numerical precision (remember that floating point math is lossy) you can use Kahan summation and similar techniques, in particular for values close to 0.
But maybe your data justt doesn't have strong correlation?
Try visualizing a sample!
I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.
To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.
The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e
Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).
There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.
It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.
Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.
Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.
This is a standard sample with out replacement. You can't divide the range [0, 3*10^9] into equally binned ranges and sample same amount in each bin.
Also, 3 billion is relative large, many "ready to use" codes only handle 32 bit integers, roughly 2 billion(+-). Please take a close look at their implementations.
I have sets of data with two equally long arrays of data, or I can make an array of two-item entries, and I would like to calculate the correlation and statistical significance represented by the data (which may be tightly correlated, or may have no statistically significant correlation).
I am programming in Python and have scipy and numpy installed. I looked and found Calculating Pearson correlation and significance in Python, but that seems to want the data to be manipulated so it falls into a specified range.
What is the proper way to, I assume, ask scipy or numpy to give me the correlation and statistical significance of two arrays?
If you want to calculate the Pearson Correlation Coefficient, then scipy.stats.pearsonr is the way to go; although, the significance is only meaningful for larger data sets. This function does not require the data to be manipulated to fall into a specified range. The value for the correlation falls in the interval [-1,1], perhaps that was the confusion?
If the significance is not terribly important, you can use numpy.corrcoef().
The Mahalanobis distance does take into account the correlation between two arrays, but it provides a distance measure, not a correlation. (Mathematically, the Mahalanobis distance is not a true distance function; nevertheless, it can be used as such in certain contexts to great advantage.)
You can use the Mahalanobis distance between these two arrays, which takes into account the correlation between them.
The function is in the scipy package: scipy.spatial.distance.mahalanobis
There's a nice example here
scipy.spatial.distance.euclidean()
This gives euclidean distance between 2 points, 2 np arrays, 2 lists, etc
import scipy.spatial.distance as spsd
spsd.euclidean(nparray1, nparray2)
You can find more info here http://docs.scipy.org/doc/scipy/reference/spatial.distance.html