Randomizing values accounting for floating point resolution - python

I have an array of values that I'm clipping to be within a certain range. I don't want large numbers of values to be identical though, so I'm adding a small amount of random noise after the operation. I think that I need to be accounting for the floating point resolution for this to work.
Right now I've got code something like this:
import numpy as np
np.minimum(x[:,0:3],topRtBk,x[:,0:3])
np.maximum(x[:,0:3],botLftFrnt,x[:,0:3])
np.add(x[:,0:3],np.random.randn(x.shape[0],3).astype(real_t)*5e-5,x[:,0:3])
where topRtBk and botLftFrnt are the 3D bounding limits (there's another version of this for spheres).
real_t is configurable to np.float32 or np.float64 (other parts of the code are GPU accelerated, and this may be eventually as well).
The 5e-5 is a magic number which is twice np.finfo(np.float32).resolution, and the crux of my question: what's the right value to use here?
I'd like to dither the values by the smallest possible amount while retaining sufficient variation-- and I admit that sufficient is rather ill defined. I'm trying to minimize duplicate values, but having some won't kill me.
I guess my question is two fold: is this the right approach to use, and what's a reasonable scale factor for the random numbers?

Related

Approximate maximum of an unknown curve

I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?
I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.

Finding which points are in a 2D region

I have a very large data set comprised of (x,y) coordinates. I need to know which of these points are in certain regions of the 2D space. These regions are bounded by 4 lines in the 2D domain (some of the sides are slightly curved).
For smaller datasets I have used a cumbersome for loop to test each individual point for membership of each region. This doesn't seem like a good option any more due to the size of data set.
Is there a better way to do this?
For example:
If I have a set of points:
(0,1)
(1,2)
(3,7)
(1,4)
(7,5)
and a region bounded by the lines:
y=2
y=5
y=5*sqrt(x) +1
x=2
I want to find a way to identify the point (or points) in that region.
Thanks.
The exact code is on another computer but from memory it was something like:
point_list = []
for i in range(num_po):
a=5*sqrt(points[i,0]) +1
b=2
c=2
d=5
if (points[i,1]<a) && (points[i,0]<b) && (points[i,1]>c) && (points[i,1]<d):
point_list.append(points[i])
This isn't the exact code but should give an idea of what I've tried.
If you have a single (or small number) of regions, then it is going to be hard to do much better than to check every point. The check per point can be fast, particularly if you choose the fastest or most discriminating check first (eg in your example, perhaps, x > 2).
If you have many regions, then speed can be gained by using a spatial index (perhaps an R-Tree), which rapidly identifies a small set of candidates that are in the right area. Then each candidate is checked one by one, much as you are checking already. You could choose to index either the points or the regions.
I use the python Rtree package for spatial indexing and find it very effective.
This is called the range searching problem and is a much-studied problem in computational geometry. The topic is rather involved (with your square root making things nonlinear hence more difficult). Here is a nice blog post about using SciPy to do computational geometry in Python.
Long comment:
You are not telling us the whole story.
If you have this big set of points (say N of them) and one set of these curvilinear quadrilaterals (say M of them) and you need to solve the problem once, you cannot avoid exhaustively testing all points against the acceptance area.
Anyway, you can probably preprocess the M regions in such a way that testing a point against the acceptance area takes less than M operations (closer to Log(M)). But due to the small value of M, big savings are unlikely.
Now if you don't just have one acceptance area but many of them to be applied in turn on the same point set, then more sophisticated solutions are possible (namely range searching), that can trade N comparisons to about Log(N) of them, a quite significant improvement.
It may also be that the point set is not completely random and there is some property of the point set that can be exploited.
You should tell us more and show a sample case.

Inverse Matrix (Numpy) int too large to convert to float

I am trying to take the inverse of a 365x365 matrix. Some of the values get as large as 365**365 and so they are converted to long numbers. I don't know if the linalg.matrix_power() function can handle long numbers. I know the problem comes from this (because of the error message and because my program works just fine for smaller matrices) but I am not sure if there is a way around this. The code needs to work for a NxN matrix.
Here's my code:
item=0
for i in xlist:
xtotal.append(arrayit.arrayit(xlist[item],len(xlist)))
item=item+1
print xtotal
xinverted=numpy.linalg.matrix_power(xtotal,-1)
coeff=numpy.dot(xinverted,ylist)
arrayit.arrayit:
def arrayit(number, length):
newarray=[]
import decimal
i=0
while i!=(length):
newarray.insert(0,decimal.Decimal(number**i))
i=i+1
return newarray;
The program is taking x,y coordinates from a list (list of x's and list of y's) and makes a function.
Thanks!
One thing you might try is the library mpmath, which can do simple matrix algebra and other such problems on arbitrary precision numbers.
A couple of caveats: It will almost certainly be slower than using numpy, and, as Lutzl points out in his answer to this question, the problem may well not be mathematically well defined. Also, you need to decide on the precision you want before you start.
Some brief example code,
from mpmath import mp, matrix
# set the precision - see http://mpmath.org/doc/current/basics.html#setting-the-precision
mp.prec = 5000 # set it to something big at the cost of speed.
# Ideally you'd precalculate what you need.
# a quick trial with 100*100 showed that 5000 works and 500 fails
# see the documentation at http://mpmath.org/doc/current/matrices.html
# where xtotal is the output from arrayit
my_matrix = matrix(xtotal) # I think this should work. If not you'll have to create it and copy
# do the inverse
xinverted = my_matrix**-1
coeff = xinverted*matrix(ylist)
# note that as lutlz pointed out you really want to use solve instead of calculating the inverse.
# I think this is something like
from mpmath import lu_solve
coeff = lu_solve(my_matrix,matrix(ylist))
I suspect your real problem is with the maths rather than the software, so I doubt this will work fantastically well for you, but it's always possible!
Did you ever hear of Lagrange or Newton interpolation? This would avoid the whole construction of the VanderMonde matrix. But not the potentially large numbers in the coefficients.
As a general observation, you do not want the inverse matrix. You do not need to compute it. What you want is to solve a system of linear equations.
x = numpy.linalg.solve(A, b)
solves the system A*x=b.
You (really) might want to look up the Runge effect. Interpolation with equally spaced sample points is an increasingly ill-conditioned task. Useful results can be obtained for single-digit degrees, larger degrees tend to give wildly oscillating polynomials.
You can often use polynomial regression, i.e., approximating your data set by the best polynomial of some low degree.

Sampling from a huge uniform distribution in Python

I need to select 3.7*10^8 unique values from the range [0, 3*10^9] and either obtain them in order or keep them in memory.
To do this, I started working on a simple algorithm where I sample smaller uniform distributions (that fit in memory) in order to indirectly sample the large distribution that really interests me.
The code is available at the following gist https://gist.github.com/legaultmarc/7290ac4bef4edb591d1e
Since I'm having trouble implementing something more robust, I was wondering if you had other ideas to sample unique values from a large discrete uniform. I'm looking for either an algorithm, a module or an idea on how to manage very large lists directly (perhaps using the hard drive instead of memory).
There is an interesting post, Generating sorted random ints without the sort? O(n) which suggests that instead of generating uniform random ints, you can do a running-sum on exponential random deltas, which gives you a uniform random result generated in sorted order.
It's not guaranteed to give exactly the number of samples you want, but should be pretty close, and much faster / lower memory requirements.
Edit: I found a second post, generating sorted random numbers without exponentiation involved? which suggests tweaking the distribution density as you go to generate an exact number of samples, but I am leery of just exactly what this would do to your "uniform" distribution.
Edit2: Another possibility that occurs to me would be to use an inverse cumulative binomial distribution to iteratively split your sample range (predict how many uniformly generated random samples would fall in the lower half of the range, then the remainder must be in the upper half) until the block-size reaches something you can easily hold in memory.
This is a standard sample with out replacement. You can't divide the range [0, 3*10^9] into equally binned ranges and sample same amount in each bin.
Also, 3 billion is relative large, many "ready to use" codes only handle 32 bit integers, roughly 2 billion(+-). Please take a close look at their implementations.

python & numpy: forcing the matrix to contain values known to range from x to y?

I use numpy to prototype a mathematical code. My mathematics contain only probabilities on which i perform matrix arithmetics (multiplication, dot function in numpy). As I know that all values range from 0 to 1, I wonder if I can force numpy to code values (saving memory or enjoy more precision) on 32/64bit but ranging with an upper boundary fixed at 1?
try1 = numpy.array([1.0,0.2564654646546],dtype='f16')
Can dtype be forced to range from x to y with a same amount of memory per value?
As far as I know, numpy arrays don't support fixed point arithmetic and I haven't heard of any plans to add it. If you are interested in playing with that stuff, you could check out matlab's fixed-pt toolbox, or if you really love mathematics you can cook your own using integer datatypes and keeping track of the 'point'.
The way floating point works is already pretty neat though and I'm not sure you would gain a heap of precision per bit just with the knowledge that numbers are in [0,1]. Floating point is similar to scientific notation, increasing the number of bits mainly gives you more "significant digits" rather than (just) a larger range of numbers.
I suppose nowadays you can acchieve this with:
a = np.linspace(0, 1, number_of_points)

Categories

Resources