Trouble finding large jumps between data points in an array - python

I am trying to write a sigma clipping program that calculates the differences between each point in an array and its neighbor, and if the difference is greater than x times the standard deviation of the array, it sets the neighbor equal to the average of the two points closest to it. For example, if I had an array, testarray = np.array([1.01, 2.0, 1.22, 1.005, .996, 0.95]), and wanted to change any points that were more than 2 times deviant from their neighbor, then this function would search through the array and set the 2.0 in the testarray equal to 1.115, the average of 1.01 and 1.22.
def sigmaclip2(array, stand):
originalDeviation = np.std(array)
differences = np.abs(np.diff(array))
for i in range(len(differences)):
if differences[i] > stand*originalDeviation:
if array[i+1] != array[-1]:
array[i+1] = (array[i] + array[i+2]) / 2.0
else:
array[i+1] = (array[i] + array[i-1]) / 2.0
else:
pass
return array
This code works for this small testarray. But, I am working with a larger data set (~12000 elements). When I try to run it on the larger data set, I get the same array back that I plugged in.
Does anyone know what might be going wrong?
I should note that I have tried some of Python's built in sigma clipping routines, such as the one from Astropy, but it appears as if that cuts off any values that are greater than x times the standard deviation of the array. This is not what I want to do. I want to find any large, sudden jumps (often caused by 1 bad value) and set that bad value equal to the average of the 2 points around it if the bad value is more than x times the standard deviation discrepant from its neighbor.

in line 6 of your function array[-1] may be a typo as it always uses the last element of the array. Are you missing an i? In which case you might need to shift by one as difference[0] is the diff between array[0] and array[1]
PS I think I would use np.where with slice notation on array to find just the indexes to alter rather than useing a normal python loop. With numpy a loop is almost always a bad idea.
EDIT
Understand about edges but I don't think your code does what you expect. When I run it it averages array[2] to 1.06 as well as array[1] to 1.115
If I change line 6 to if array[i+1] != array[i-1]: (array[-1] is the last entry, always 0.95) it still doesn't work properly.
You also have to think about what you want your code to do where you get more than one outlier.. 1.01, 2.0, 2.25, 1.99, 1.22, 1.005, .996, 0.95 To cope with single outliers I would use something like
def sigmaclip3(array, stand):
cutoff = stand * np.std(array)
diffs = np.abs(np.diff(array))
ix = np.where((diffs[:-1] > cutoff) &
(diffs[1:] > cutoff))[0] + 1
array[ix] = (array[ix - 1] + array[ix + 1]) / 2.0
return array

Related

Need Help Trying to Simplify this algorithm to map points on an arbitrarily large 2d plane to unique integers

So like the title says I need help trying to map points from a 2d plane to a number line in such a way that each point is associated with a unique positive integer. Put another way, I need a function f:ZxZ->Z+ and I need f to be injective. Additionally I need to to run in a reasonable time.
So the way I've though about doing this is to basically just count points, starting at (1,1) and spiraling outwards.
Below I've written some python code to do this for some point (i,j)
def plot_to_int(i,j):
a=max(i,j) #we want to find which "square" we are in
b=(a-1)^2 #we can start the count from the last square
J=abs(j)
I=abs(i)
if i>0 and j>0: #the first quadrant
#we start counting anticlockwise
if I>J:
b+=J
#we start from the edge and count up along j
else:
b+=J+(J-i)
#when we turn the corner, we add to the count, increasing as i decreases
elif i<0 and j>0: #the second quadrant
b+=2a-1 #the total count from the first quadrant
if J>I:
b+=I
else:
b+=I+(I-J)
elif i<0 and j<0: #the third quadrant
b+=(2a-1)2 #the count from the first two quadrants
if I>J:
b+=J
else:
b+=J+(J-I)
else:
b+=(2a-1)3
if J>I:
b+=I
else:
b+=I+(I-J)
return b
I'm pretty sure this works, but as you can see it quite a bulky function. I'm trying to think of some way to simplify this "spiral counting" logic. Or possibly if there's another counting method that is simpler to code that would work too.
Here's a half-baked idea:
For every point, calculate f = x + (y-y_min)/(y_max-y_min)
Find the smallest delta d between any given f_n and f_{n+1}. Multiply all the f values by 1/d so that all f values are at least 1 apart.
Take the floor() of all the f values.
This is sort of like a projection onto the x-axis, but it tries to spread out the values so that it preserves uniqueness.
UPDATE:
If you don't know all the data and will need to feed in new data in the future, maybe there's a way to hardcode an arbitrarily large or small constant for y_max and y_min in step 1, and an arbitrary delta d for step 2 according the boundaries of the data values you expect. Or a way to calculate values for these according to the limits of the floating point arithmetic.

Lognormal pdf generates zero in python

I want to generate random numbers from lognormal distribution on background of exponential distribution as folows:
I have 100 integers (say localities) from 1 to 25. This integers are generated from my own exponential-like distribution.
On this localities I want to distribute N items. But these items have their own lognormal distribution, with some mode (between 1 and 25) and standart deviation (from 1 to 7). My code works like this:
I have array of localities called variable_vec, I know N called N, I know mode called pref_value and I know standard deviation called power_of_preference.
First I will compute shape and scale parameters from pref_value and power_of_preference. Than my progress is as folows:
unique_localities = np.unique(np.array(vec_of_variable)) #all values of localities
res1 = [0 for i in range(len(unique_localities))]
res = [0 for i in range(len(vec_of_variable))] #this will be desired output
for i in range(len(res1)):
res1[i] = stats.lognorm.pdf(unique_localities[i], shape, 0, scale) #pdfs of values of localities
res1 = np.array([x/min(res1) for x in res1]) #here is the problem, min(res1) could be zero, see text
res1 = np.round(res1)
res1 = np.cumsum(res1)
item = 0
while item < N:
r = random.uniform(0, max(res1))
site_pdf_value_vec = [x for x in res1 if x >= r]
site_pdf_value = min(site_pdf_value_vec) #this is value of locality where Ill place one item
The code continues but crucial part is here. Simply, lognorm pdf values of localities are 'probabilities' that Ill place my item in that locality. This is why I need pdf values.
PS: This approach is approved by my supervisor so I do not want to change it.
The problem is, that sometimes happens, that min(res1) = 0. Than ill divide by zero, and res1 become array of infinities. The lognormal for x between 0 and 25 is never zero, but it could be very close. I thing that problem is that one of these pdf values is too close to zero, so python will round it to zero.
My question is, how to avoid getting zeros in res1 in my code? My idea was to replace zeros by smallest positive floats in python, but I dont know this value. Or is there another, more elegant solution?
Thx for help.
PS: Someone could thing about not taking reverse values of res1, the problem step looks superflows. But it is the control, that min of these values is not zero. In another words, every locality must have some "interval" 'probability' that ill place item in it, if one pdf is zero, its probability is not interval but one number.
Compute lognorm.logpdf rather than lognorm.pdf, and then work in log space. This should have better accuracy for the very small probabilities that are being rounded to zero.

Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.

can I do fast set difference with floats using numpy if elements are equal up to some tolerance

I have two lists of float numbers, and I want to calculate the set difference between them.
With numpy I originally wrote the following code:
aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)
This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:
diff = []
for a in allpoints:
remove = False
for p in toberemovedpoints:
if norm(p-a) < 0.1:
remove = True
if not remove:
diff.append(a)
return array(diff)
But is there a way to write this with numpy and gain back the speed?
Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )
Edited to add an solution based on scipy.KDTree that seems to work:
def remove_points_fast(allpoints, toberemovedpoints):
diff = []
removed = 0
# prepare a KDTree
from scipy.spatial import KDTree
tree = KDTree(toberemovedpoints, leafsize=allpoints.shape[0]+1)
for p in allpoints:
distance, ndx = tree.query([p], k=1)
if distance < 0.1:
removed += 1
else:
diff.append(p)
return array(diff), removed
If you want to do this with the matrix form, you have a lot of memory consumption with larger arrays. If that does not matter, then you get the difference matrix by:
diff_array = allpoints[:,None] - toberemovedpoints[None,:]
The resulting array has as many rows as there are points in allpoints, and as many columns as there are points in toberemovedpoints. Then you can manipulate this any way you want (e.g. calculate the absolute value), which gives you a boolean array. To find which rows have any hits (absolute difference < .1), use numpy.any:
hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)
Now you have a vector which has the same number of items as there were rows in the difference array. You can use that vector to index all points (negation because we wanted the non-matching points):
return allpoints[-hits]
This is a numpyish way of doing this. But, as I said above, it takes a lot of memory.
If you have larger data, then you are better off doing it point by point. Something like this:
return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]
This should perform well in most cases, and the memory use is much lower than with the matrix solution. (For stylistic reasons you may want to use numpy.all instead of numpy.any and turn the comparison around to get rid of the negation.)
(Beware, there may be pritning mistakes in the code.)

Code works but too slow

How do i make this next piece of code run faster?
I calculate the distance between a number of points first (no problem), but after that, i need to get the mean of the values all the points in one list that are closer than (in this case 20m), and if that 20 is small, this piece of code is fast, but otherwise, it is very slow, since i need the indices etc-
The next piece of code does exactly what i want, but it is extremely slow if i take 20 for value instead of for example 6 (because for 20, there are about 100 points close enough, while for 6, there are only 3 or 5 or so)
D = numpy.sqrt((xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2)
dumdic = {}
l1=[]
for i in range(len(xf)):
dumdic[i] = D[i,:][D[i,:]<20] # gets the values where the distance is small enough
A=[]
for j in range(len(dumdic[i])):
A.append(G.epsilon[list(D[i,:]).index(dumdic[i][j])]) # for each point in that dummy dictionary, gets the index where i need to take the epsilon value, and than adds that right epsilon value to A
l1.append(numpy.mean(numpy.array(A)))
a1 = numpy.array(l1)
G.epsilon is the array in which for each point we have a measurement value. So in that array i need to take (for each point in the other array) the mean for all points in this array that are close enough to that other point.
If you need more details, just ask
after the reply of #gregwittier, this is the better version:
can anyone oneliner it yet? (twoliner, since D=... takes one line)
would be more pythonic i guess if i dont have the l1=... and the recasting to numpy array, but the worst thing now is to kill that for-loop, by using an axis argument or so?
D = numpy.sqrt((xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2)
l1=[]
for i in range(len(xf)):
l1.append(numpy.mean(G.epsilon[D[i,:]<20]))
a1 = numpy.array(l1)
I think this is what you want.
D2 = (xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2
near = D2 < 20**2
a1 = np.array([G.epsilon[near_row].mean() for near_row in near])
You could squeeze down another line by combining line 2 and 3.
D2 = (xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2
a1 = np.array([G.epsilon[near_row].mean() for near_row in D2 < 20**2])
Your description in words seems different from what your example code actually does. From the word description, I think you need something like
dist_sq = (xf-xg)**2 + (yf-yg)**2
near = (dist_sq < 20*20)
return dist_sq[near].mean()
I can't understand your example code, so I don't know how to match what it does. Perhaps you will still need to iterate over one of the dimension (i.e. you might still need the outer for loop from your example).
If you are calculating the all distances between a set of points it might be a problem of complexity. As the set of points increases, the number of possible combinations increases dramatically.

Categories

Resources