Random numpy array in order

Random numpy array in order - python

I have a random sampling method as below:
def costum_random_sample(size):
randomList = []
counter = 0
last_n = -1
while(size != counter):
n = random.random()
if abs(n - last_n) < 0.05:
continue
else:
randomList.append(n)
counter += 1
last_n = n
return np.array(randomList)
The result is an array([0.50146945, 0.17442673, 0.60011469, 0.13501798]) like this. Now, I want to change it to make it resulted in order likes array in ascending order. Sort() doesn't work in this case since it change the order of my array after it is generated, and the logic between each number is changed. I want it random the number in list in order, by that way it could keep the logic in the number sequence. How can I do that?

If your arrays are shortish, you can simply generate the whole array, sort it, and reject it and regenerate as long as the constraint is violated.
bad = True
while bad:
arr = np.sort(np.random.rand(size))
bad = np.any(np.ediff1d(arr) < 0.05)
If size is too big, the conflicts will be too plentiful, and this will take forever, so only use it if there is a reasonable chance a conformant array will be generated randomly. Note that if size > 20 there is no array that will fit the criteria, turning this into an infinite loop.
Another approach would be to generate and sort the array as above, find the non-conformant element pairs, then nudge the array elements by increasing the distance between the non-comformant pairs and evenly subtracting this difference from other places. This can't get stuck in an infinite loop, but has a bit more math, and bends the uniform distribution (though I couldn't tell you how much).
EDIT After thinking a bit, there's a much better way. Basically, you need a spaced array, where there's a fixed spacer and a little bit of extra randomness between each element:
random start space
[element1]
0.05 spacer
some more space
[element2]
0.05 spacer
some more space
[element3]
random end space
All the space needs to add up to 1. However, some of that space is fixed ((size - 1) * 0.05); so if we take out the fixed spacers, we have our "space budget" to distribute between our start, end and random space. So we generate random space and then punch it a bit so it sums up to our space budget. Then add in the fixed spacers, and cumulative sum will give us the final array (and an extra 1.0 at the end, which we chop off).
space_budget = 1 - (size - 1) * 0.05
space = np.random.rand(size + 1)
space *= space_budget / np.sum(space)
space[1:-1] += 0.05
arr = np.cumsum(space)[:-1]
For size = 21, you get exactly one solution every time, as space_budget is zero. For larger size, you start bursting out of the 0...1 range as it's mathematically impossible to stuff more than 21 0.05 spacers into that interval.

Related

Suppose an array contains only two kinds of elements, how to quickly find their boundaries?

I've asked a similar question before, but this time it's different.
Since our array contains only two elements, we might as well set it to 1 and -1, where 1 is on the left side of the array and -1 is on the right side of the array:
[1,...,1,1,-1,-1,...,-1]
Both 1 and -1 exist at the same time and the number of 1 and -1 is not necessarily the same. Also, the numbers of 1 and -1 are both very large.
Then, define the boundary between 1 and -1 as the index of the -1 closest to 1. For example, for the following array:
[1,1,1,-1,-1,-1,-1]
Its boundary is 3.
Now, for each number in the array, I cover it with a device that you have to unlock to see the number in it.
I want to try to unlock as few devices as possible that cover 1, because it takes much longer to see a '1' than it takes to see a '-1'. And I also want to reduce my time cost as much as possible.
How can I search to get the boundary as quickly as possible?

The problem is very like the "egg dropping" problem, but where a wrong guess has a large fixed cost (100), and a good guess has a small cost (1).
Let E(n) be the (optimal) expected cost of finding the index of the right-most 1 in an array (or finding that the array is all -1), assuming each possible position of the boundary is equally likely. Define the index of the right-most 1 to be -1 if the array is all -1.
If you choose to look at the array element at index i, then it's -1 with probability i/(n+1), and 1 with probability (n-i+1)/(n+1).
So if you look at array element i, your expected cost for finding the boundary is (1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1).
Thus E(n) = min((1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1), i=0..n-1)
For each n, the i that minimizes the equation is the optimal array element to look at for an array of that length.
I don't think you can solve these equations analytically, but you can solve them with dynamic programming in O(n^2) time.
The solution is going to look like a very skewed binary search for large n. For smaller n, it'll be skewed so much that it will be a traversal from the right.

If I am right, a strategy to minimize the expectation of the cost is to draw at a fraction of the interval that favors the -1 outcome, in inverse proportion of the cost. So instead of picking the middle index, take the right centile.
But this still corresponds to a logarithmic asymptotic complexity.
There is probably nothing that you can do regarding the worst case.

Parallel algorithm for set splitting

I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan

I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays

Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!

For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )

Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.

your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.

Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.

can I do fast set difference with floats using numpy if elements are equal up to some tolerance

I have two lists of float numbers, and I want to calculate the set difference between them.
With numpy I originally wrote the following code:
aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)
This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:
diff = []
for a in allpoints:
remove = False
for p in toberemovedpoints:
if norm(p-a) < 0.1:
remove = True
if not remove:
diff.append(a)
return array(diff)
But is there a way to write this with numpy and gain back the speed?
Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )
Edited to add an solution based on scipy.KDTree that seems to work:
def remove_points_fast(allpoints, toberemovedpoints):
diff = []
removed = 0
# prepare a KDTree
from scipy.spatial import KDTree
tree = KDTree(toberemovedpoints, leafsize=allpoints.shape[0]+1)
for p in allpoints:
distance, ndx = tree.query([p], k=1)
if distance < 0.1:
removed += 1
else:
diff.append(p)
return array(diff), removed

If you want to do this with the matrix form, you have a lot of memory consumption with larger arrays. If that does not matter, then you get the difference matrix by:
diff_array = allpoints[:,None] - toberemovedpoints[None,:]
The resulting array has as many rows as there are points in allpoints, and as many columns as there are points in toberemovedpoints. Then you can manipulate this any way you want (e.g. calculate the absolute value), which gives you a boolean array. To find which rows have any hits (absolute difference < .1), use numpy.any:
hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)
Now you have a vector which has the same number of items as there were rows in the difference array. You can use that vector to index all points (negation because we wanted the non-matching points):
return allpoints[-hits]
This is a numpyish way of doing this. But, as I said above, it takes a lot of memory.
If you have larger data, then you are better off doing it point by point. Something like this:
return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]
This should perform well in most cases, and the memory use is much lower than with the matrix solution. (For stylistic reasons you may want to use numpy.all instead of numpy.any and turn the comparison around to get rid of the negation.)
(Beware, there may be pritning mistakes in the code.)

Code works but too slow

How do i make this next piece of code run faster?
I calculate the distance between a number of points first (no problem), but after that, i need to get the mean of the values all the points in one list that are closer than (in this case 20m), and if that 20 is small, this piece of code is fast, but otherwise, it is very slow, since i need the indices etc-
The next piece of code does exactly what i want, but it is extremely slow if i take 20 for value instead of for example 6 (because for 20, there are about 100 points close enough, while for 6, there are only 3 or 5 or so)
D = numpy.sqrt((xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2)
dumdic = {}
l1=[]
for i in range(len(xf)):
dumdic[i] = D[i,:][D[i,:]<20] # gets the values where the distance is small enough
A=[]
for j in range(len(dumdic[i])):
A.append(G.epsilon[list(D[i,:]).index(dumdic[i][j])]) # for each point in that dummy dictionary, gets the index where i need to take the epsilon value, and than adds that right epsilon value to A
l1.append(numpy.mean(numpy.array(A)))
a1 = numpy.array(l1)
G.epsilon is the array in which for each point we have a measurement value. So in that array i need to take (for each point in the other array) the mean for all points in this array that are close enough to that other point.
If you need more details, just ask
after the reply of #gregwittier, this is the better version:
can anyone oneliner it yet? (twoliner, since D=... takes one line)
would be more pythonic i guess if i dont have the l1=... and the recasting to numpy array, but the worst thing now is to kill that for-loop, by using an axis argument or so?
D = numpy.sqrt((xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2)
l1=[]
for i in range(len(xf)):
l1.append(numpy.mean(G.epsilon[D[i,:]<20]))
a1 = numpy.array(l1)

I think this is what you want.
D2 = (xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2
near = D2 < 20**2
a1 = np.array([G.epsilon[near_row].mean() for near_row in near])
You could squeeze down another line by combining line 2 and 3.
D2 = (xf[:,None] - xg[None,:])**2 + (yf[:,None] - yg[None,:])**2 + (zf[:,None] - zg[None,:])**2
a1 = np.array([G.epsilon[near_row].mean() for near_row in D2 < 20**2])

Your description in words seems different from what your example code actually does. From the word description, I think you need something like
dist_sq = (xf-xg)**2 + (yf-yg)**2
near = (dist_sq < 20*20)
return dist_sq[near].mean()
I can't understand your example code, so I don't know how to match what it does. Perhaps you will still need to iterate over one of the dimension (i.e. you might still need the outer for loop from your example).

If you are calculating the all distances between a set of points it might be a problem of complexity. As the set of points increases, the number of possible combinations increases dramatically.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.