I would like to use total variation in Python, but I wasn't able to find an existing implementation.
Assuming that I have an array with a finite number of elements, is the implementation with NumPy simply as:
import numpy as np
a = np.array([...], dtype=float)
tv = np.sum(np.abs(np.diff(a)))
My main doubt is how to compute the supremum of tv across all partitions, and if just the sum of the absolute difference might suffice for a finite array of floats.
Edit: My input array represents a piecewise linear function, therefore the supremum over the full set of partitions is indeed the sum of absolute differences between contiguous points.
Yes, that is correct.
I imagine you're confused by the mathy definition on the Wikipedia page for total variation. Have a look at the more practical definition on the Wikipedia page for total variation denoising instead.
For an actual code (even Python) implementation, see e.g. Tensorflow's total_variation(), though this is for one or more (2D, color) images, so the TV is computed for both rows and columns, and then added together.
Related
I have a list, like the one below:
[1,1,1,1,1,1,1,1,1,10]
I want to calculate the list average, however, arithmetic mean would not be an answer. I want my result to be skewed toward 10 because it is assumed to have much more weight. Geometric mean is not also an option. I wanted to know if there are other measures of central tendency that can be calculated in python using predefined functions. I am preparing my data for a neural network and I shall use various weights to see which one performs better. That is why I am looking for predefined functions so I can change their parameters to create multiple databases for my Neural Network.
Thanks in advance
P.s. The priority is to find the measure. Applying it in python comes next.
From my deep learning practice I don't quite agree that a higher weight should be desired.
Yet regarding your particular question, you can solve it
with straightforward power. High values become much larger, lower values much lower. After computing mean of powered values, compute a respective root, to return to the old scale.
And make sure you use an odd value for power, so that negative numbers retain sign.
import numpy as np
x = np.array([1,1,1,1,1,1,1,1,1,10])
power = 15
powered_mean = np.mean(np.power(x, power))
central_tendency = np.power(powered_mean, 1/power) # root
8.576958985908947
I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:
Median:
Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint').
The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]).
On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.
A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.
Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.
The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.
What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.
tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.
(This is more of an extended comment than an answer, but it was getting awkwardly long.)
you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.
how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)
medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).
IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.
median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.
mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:
x <- rnorm(1000000,mean=0,sd=1)
> mean(x)
[1] 0.001386724
> sum(x)/length(x)
[1] 0.001386724
> mean(x)-sum(x)/length(x)
[1] -1.734723e-18
Similarly, there are more- and less-stable ways to compute a variance/standard deviation.
I am trying to take the inverse of a 365x365 matrix. Some of the values get as large as 365**365 and so they are converted to long numbers. I don't know if the linalg.matrix_power() function can handle long numbers. I know the problem comes from this (because of the error message and because my program works just fine for smaller matrices) but I am not sure if there is a way around this. The code needs to work for a NxN matrix.
Here's my code:
item=0
for i in xlist:
xtotal.append(arrayit.arrayit(xlist[item],len(xlist)))
item=item+1
print xtotal
xinverted=numpy.linalg.matrix_power(xtotal,-1)
coeff=numpy.dot(xinverted,ylist)
arrayit.arrayit:
def arrayit(number, length):
newarray=[]
import decimal
i=0
while i!=(length):
newarray.insert(0,decimal.Decimal(number**i))
i=i+1
return newarray;
The program is taking x,y coordinates from a list (list of x's and list of y's) and makes a function.
Thanks!
One thing you might try is the library mpmath, which can do simple matrix algebra and other such problems on arbitrary precision numbers.
A couple of caveats: It will almost certainly be slower than using numpy, and, as Lutzl points out in his answer to this question, the problem may well not be mathematically well defined. Also, you need to decide on the precision you want before you start.
Some brief example code,
from mpmath import mp, matrix
# set the precision - see http://mpmath.org/doc/current/basics.html#setting-the-precision
mp.prec = 5000 # set it to something big at the cost of speed.
# Ideally you'd precalculate what you need.
# a quick trial with 100*100 showed that 5000 works and 500 fails
# see the documentation at http://mpmath.org/doc/current/matrices.html
# where xtotal is the output from arrayit
my_matrix = matrix(xtotal) # I think this should work. If not you'll have to create it and copy
# do the inverse
xinverted = my_matrix**-1
coeff = xinverted*matrix(ylist)
# note that as lutlz pointed out you really want to use solve instead of calculating the inverse.
# I think this is something like
from mpmath import lu_solve
coeff = lu_solve(my_matrix,matrix(ylist))
I suspect your real problem is with the maths rather than the software, so I doubt this will work fantastically well for you, but it's always possible!
Did you ever hear of Lagrange or Newton interpolation? This would avoid the whole construction of the VanderMonde matrix. But not the potentially large numbers in the coefficients.
As a general observation, you do not want the inverse matrix. You do not need to compute it. What you want is to solve a system of linear equations.
x = numpy.linalg.solve(A, b)
solves the system A*x=b.
You (really) might want to look up the Runge effect. Interpolation with equally spaced sample points is an increasingly ill-conditioned task. Useful results can be obtained for single-digit degrees, larger degrees tend to give wildly oscillating polynomials.
You can often use polynomial regression, i.e., approximating your data set by the best polynomial of some low degree.
I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.
To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.
The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e
Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).
There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.
It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.
Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.
Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.
This is a standard sample with out replacement. You can't divide the range [0, 3*10^9] into equally binned ranges and sample same amount in each bin.
Also, 3 billion is relative large, many "ready to use" codes only handle 32 bit integers, roughly 2 billion(+-). Please take a close look at their implementations.
I have two 2-D arrays with the same shape (105,234) named A & B essentially comprised of mean values from other arrays. I am familiar with Python's scipy package, but I can't seem to find a way to test whether or not the two arrays are statistically significantly different at each individual array index. I'm thinking this is just a large 2D paired T-test, but am having difficulty. Any ideas or other packages to use?
If we assume that the underlying variance for each mean at the gridpoints is the same, and the number of observations is the same or is known, then we can use the arrays of means to estimate the standard deviation of the means directly.
Dividing the difference between gridpoints by the standard deviation, then gives t distributed random variables, that can be directly tested, i.e. the p-value can be calculated.
As tests for many points, we will run into a multiple testing problem http://en.wikipedia.org/wiki/Multiple_comparisons#Large-scale_multiple_testing and the p-values should be corrected.
If your question is "Do two-dimensional distributions differ ?", see
Numerical Recipes p. 763
(and ask further on how to do that in numpy / scipy).
You might also ask on stats.stackexchange.
I assume that x,y coordinates do not matter and we just have the two huge sets of independent measurements.
One of the possible approaches could be just to compute standard deviation of mean for each array, multiply this value to the Student coefficient (probably somewhat 1.645 for your astronomic number of samples and 95 % confidence level) and obtain the confidence ranges around the mean this way. If the confidence ranges of the two different arrays overlap, the difference between them is not significant. Formulas can be found here.
Go to MS Excel. If you don't have it your work does, there are alternatives
Enter the array of numbers in Excel worksheet. Run the formula in the entry field, =TTEST (array1,array2,tail). One tail is one, Two tail is two...easy peasy. It's a simple Student's T and I believe you may still need a t-table to interpret the statistic (internet). Yet it's quick for on the fly comparison of samples.