numpy irreversibly change items in a list at random - python

I have a list
a = np.ones(100)
I want to turn 50 random items in that list to 0. Once those elements are 0, I want to turn 25 random elements of the remaining items in the list to 0. Of the remaining 25 ones, I want to turn 13 of the remaining elements to zero at random etc.
I will then run this through a simple loop e.g. (pseudocode) "if item == 1, print red particle, else print blue particle" etc.
This is basically to simulate exponential decay, but I'm struggling to think of an algorithm to do this.
This is not a duplicate of "Numpy: Replace random elements in an array" because once the elements have changed I do not want them to be considered for change again.

You are basically needing diminishing ones in the array and it seems with the additional constraint that the ones being replaced by zeros are only from the existing set of ones. To solve it, here's one way with np.random.choice with its optional arg replace set as False to get unique indices per iteration to assign zeros and using np.flatnonzero to get the leftover indices of ones per iteration.
Hence, the implementation would look something like this -
# Counts of ones to be set as zeros per iteration
counts = np.array([50,25,13,6,3,2,1])
a = np.ones(100,dtype=int)
for c in counts:
a[np.random.choice(np.flatnonzero(a), c, replace=False)] = 0
Sample run -
In [49]: counts = np.array([50,25,13,6,3,2,1])
...: a = np.ones(100,dtype=int)
...: for c in counts:
...: a[np.random.choice(np.flatnonzero(a), c, replace=False)] = 0
...: print a.sum() # verify with summation of ones print at each iteration
50
25
12
6
3
1
0
To setup the counts array for a generic length input array a, we can do something like this -
N = int(np.log(len(a))/np.log(2)) # number of iterations
counts = (len(a)*((0.5)**(np.arange(N)+1))).astype(int)

Here's a simple version. It's an alternative to the other answers and diverges from instructions by randomly setting elements to zero as is suggested for exponential decay rather than removing a deterministic 1/2 at each round.
This prints the entire array each round but alternatives could include the first element, print(a[0]), or the number remaining, print(a.sum()).
a = np.ones(100)
while any(a):
print(a)
for i, _ in enumerate(a):
if random.random() > 0.5:
a[i] = 0

The way you want to implement it is to always set exactly 50% of the remaining elements to zero. This means in each epoch you will always decay exactly 50, 25, 12.5, ... elements.
However I don't believe that's how exponential decay processes work.
To my understanding you should, with a probability p, set each individual element to zero. This means for a probability p = 0.5 in each epoch you will on average decay 50, 25, 12.5, ... elements.
This method also simplifies the solution a lot:
import numpy as np
x = np.ones(100)
p = .5 # probability of decay
for i in range(20):
mask = np.random.choice([True, False], size=len(x), p=[p, 1-p])
x[mask] = 0
If speed is a factor (but memory consumption isn't), you can also vectorize this operation:
x = np.random.choice([-1, 0], size=(100, 20), p=[p, 1-p])
x = np.cumsum(x, axis=1)
x = x == 0
after which
x[:, n]
is your population after n + 1 epochs and
np.sum(x, axis=0)
is the population size over time.

Here's one way you could do it. The essential idea is to create an array of indices [0, 1, 2, ..., 99], shuffle that array, and then uses slices of decreasing size from that array as the indices of a to be zeroed.
In [75]: a = np.ones(100)
In [76]: sizes = (len(a)*np.power(2.0, [-1, -2, -3, -4, -5, -6, -7]) + 0.5).astype(int)
In [77]: sizes
Out[77]: array([50, 25, 13, 6, 3, 2, 1])
In [78]: indices = np.arange(len(a))
In [79]: np.random.shuffle(indices)
In [80]: start = 0
In [81]: for k in range(len(sizes)):
...: end = start + sizes[k]
...: a[indices[start:end]] = 0
...: print(np.count_nonzero(a))
...: start = end
...:
50
25
12
6
3
1
0

Related

how to avoid division by zero in 2d numpy array when taking average?

Let's say I have three arrays
A = np.array([[2,2,2],[1,0,0],[1,2,1]])
B = np.array([[2,0,2],[0,1,0],[1,2,1]])
C = np.array([[2,0,1],[0,1,0],[1,1,2]])
A,B,C
(array([[2, 2, 2],
[1, 0, 0],
[1, 2, 1]]),
array([[2, 0, 2],
[0, 1, 0],
[1, 2, 1]]),
array([[2, 0, 1],
[0, 1, 0],
[1, 1, 2]]))
when i take average of C/ (A+B), i get nan/inf value with RunTimeWarning..
the resultant array looks like the following.
np.average(C/(A+B), axis = 1)
array([0.25 , nan, 0.58333333])
I would like to change any inf/nan value to 0.
What I tried so far was
#doesn't work. ( maybe im doing this wrong..)
mask = A+B >0
np.average(C[mask]/(A[mask]+B[mask]), axis = 1)
#does not work and not an ideal solution.
avg = np.average(C/(A+B), axis = 1)
avg[avg == np.nan] =0
any help would be appreciated!
Your tried approaches are both a valid way of dealing with it, but you need to change them slightly.
Avoiding the division upfront, by only calculating the result where it's valid (eg non-zero):
The use of the boolean mask you defined makes the resulting arrays (after indexing) to become 1D. So using this would mean you have to allocate the resulting array upfront, and assign it using that same mask.
mask = A+B > 0
result = np.zeros_like(A, dtype=np.float32)
result[mask] = C[mask]/(A[mask]+B[mask])
It does require the averaging over the second dimension to be done separate, and also masking the incorrect result to zero for elements where the division could not be done due to the zeros.
result = result.mean(axis=1)
result[(~mask).any(axis=1)] = 0
To me the main benefit would be avoiding the warning from Numpy, and perhaps in the case of a large amount of zeros (in A+B) you could gain a little performance by avoiding that calculation all together. But overall it seems a lot of effort to me.
Masking invalid values afterwards:
The main takeaway here is that you should never ever compare against np.nan directly since it will always be False. You can check this yourself by looking at the result from np.nan == np.nan. The way to handle this is use the dedicated np.isnan function. Or alternatively negate the np.isfinite function if you also want to catch +/- np.inf values at the same time.
avg = np.average(C/(A+B), axis = 1)
avg[np.isnan(avg)] = 0
# or to include inf
avg[~np.isfinite(avg)] = 0
import numpy as np
a = np.array([1, np.nan])
print(a) # [1, nan]
a = np.nan_to_num(a)
print(a) # [1, 0]
https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html
for inf and -inf
from numpy import inf
avg[avg == inf] = 0
avg[avg == -inf] = 0
Simply follow this if you are supposed to keep the inf value zero
np.divide(a, b, where=b.astype(bool))
This is tougher than I thought as np.mean's where argument doesn't work if it results in empty arrays and np.average's weights have to be 1-D.
# these don't work
# >>> np.mean(div, axis=1, where=mask.all(1, keepdims=True))
# RuntimeWarning: Mean of empty slice.
# RuntimeWarning: invalid value encountered in true_divide
# >>> np.average(div, axis=1, weights=mask.all(1, keepdims=True))
# TypeError: 1D weights expected when shapes of a and weights differ.
import numpy as np
A = np.array([[2,2,2],[1,0,0],[1,2,1]])
B = np.array([[2,0,2],[0,1,0],[1,2,1]])
C = np.array([[2,0,1],[0,1,0],[1,1,2]])
div = np.zeros(C.shape)
AB = A+B # avoid repeated summing
mask = AB > 0 # AB != 0 to include all valid divisors
np.divide(C, AB, where=mask, out=div) # out=None won't initialize unused elements
np.mean(div * mask.all(1, keepdims=True), axis = 1)
Output
array([0.25 , 0. , 0.58333333])

Generate binary random matrix with upper and lower limit on number of ones in each row?

I want to generate a binary matrix of numbers with M rows and N columns. Each row must sum to <=p and >=q. In other words, each row must have at most p and at least q ones.
This is the code I have been using.
import numpy as np
def randbin(M, N, P):
return np.random.choice([0, 1], size=(M, N), p=[P, 1 - P])
MyMatrix = randbin(200, 7, 0.5)
Notice that row 0 is all zeros:
I noticed that some rows have all zeros and some rows have all ones. How can I modify this to get what I want? Is there an efficient way of achieving this solution?
You can generate a random number in [q, p] for each row and then set that many random ones in each row. If by efficient you mean vectorized, then yes, there is an efficient way. The trick is to simulate sampling without replacement in one axis but with the the other. This can be done with np.argsort. You can select a variable number of indices by turning a random vector into a mask.
def randbin(m, n, p, q):
# output to assign ones into
result = np.zeros((m, n), dtype=bool)
# simulate sampling with replacement in one axis
col_ind = np.argsort(np.random.random(size=(m, n)), axis=1)
# figure out how many samples to take in each row
count = np.random.randint(p, q + 1, size=(m, 1))
# turn it into a mask over col_ind using a clever broadcast
mask = np.arange(n) < count
# apply the mask not only to col_ind, but also the corresponding row_ind
col_ind = col_ind[mask]
row_ind = np.broadcast_to(np.arange(m).reshape(-1, 1), (m, n))[mask]
# Set the corresponding elements to 1
result[row_ind, col_ind] = 1
return result
The selection is made so that each run of equal values in row_ind is between p and q elements long. The corresponding elements of col_ind are unique and uniformly distributed within each row.
An alternative is #Prunes solution. It requires np.argsort to shuffle the rows independently, since np.random.shuffle would keep the rows together:
def randbin(m, n, p, q):
# make the unique rows
options = np.arange(n) < np.arange(p, q + 1).reshape(-1, 1)
# select random unique row to go into each output row
selection = np.random.choice(options.shape[0], size=m, replace=True)
# perform the selection
result = options[selection]
# create indices to shuffle each row independently
col_ind = np.argsort(np.random.random(result.shape), axis=1)
row_ind = np.arange(m).reshape(-1, 1)
# perform the shuffle
result = result[row_ind, col_ind]
return result
Okay, then: a uniform distribution is easy enough. Let's take that case with [2,5] 1s required. Use a list of the allowable combinations:
[ [1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0] ]
For each of your rows, choose a random element from these four, and then shuffle it. There is your row.

How to find the nearest neighbour index from one series to another

I have a target array A, which represents isobaric pressure levels in NCEP reanalysis data.
I also have the pressure at which a cloud is observed as a long time series, B.
What I am looking for is a k-nearest neighbour lookup that returns the indices of those nearest neighbours, something like knnsearch in Matlab that could be represented the same in python such as: indices, distance = knnsearch(A, B, n)
where indices is the nearest n indices in A for every value in B, and distance is how far removed the value in B is from the nearest value in A, and A and B can be of different lengths (this is the bottleneck that I have found with most solutions so far, whereby I would have to loop each value in B to return my indices and distance)
import numpy as np
A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2
What I would like returned from indices, distance = knnsearch(A, B, n) is this:
indices = [[1, 2],[4, 5] etc...]
where 923 in A is matched to first A[1]=925 and then A[2]=850
and 584.2 in A is matched to first A[4]=600 and then A[5]=500
distance = [[72, 77],[15.8, 84.2] etc...]
where 72 represents the distance between queried value in B to the nearest value in A e.g. distance[0, 0] == np.abs(B[0] - A[1])
The only solution I have been able to come up with is:
import numpy as np
def knnsearch(A, B, n):
indices = np.zeros((len(B), n))
distances = np.zeros((len(B), n))
for i in range(len(B)):
a = A
for N in range(n):
dif = np.abs(a - B[i])
ind = np.argmin(dif)
indices[i, N] = ind + N
distances[i, N] = dif[ind + N]
# remove this neighbour from from future consideration
np.delete(a, ind)
return indices, distances
array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2
indices, distances = knnsearch(array_A, array_B, neighbours)
print(indices)
print(distances)
returns:
[[ 1. 2.]
[ 4. 5.]
[ 4. 3.]
[10. 11.]]
[[ 2. 73. ]
[ 15.8 84.2]
[ 5.3 94.7]
[ 3.2 53.2]]
There must be a way to remove the for loops, as I need the performance should my A and B arrays contain many thousands of elements with many nearest neighbours...
Please help! Thanks :)
The second loop can easily be vectorized. The most straightforward way to do it is to use np.argsort and select the indices corresponding to the n smallest dif values. However, for large arrays, as only n values should be sorted, it is better to use np.argpartition.
Therefore, the code would look like something like that:
def vector_knnsearch(A, B, n):
indices = np.empty((len(B), n))
distances = np.empty((len(B), n))
for i,b in enumerate(B):
dif = np.abs(A - b)
min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
# numbers but not necessarily sorted
ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
indices[i, :] = ind
distances[i, :] = dif[ind]
return indices, distances
As said in the comments, the first loop can also be removed using a meshgrid, however, the extra use of memory and computation time to construct the meshgrid makes this approach slower for the dimensions I tried (and this will probably get worse for large arrays and end up in Memory Error). In addition, the readability of the code decreases. Overall, this would probably do this approach less pythonic.
def mesh_knnsearch(A, B, n):
m = len(B)
rng = np.arange(m).reshape((m,1))
Amesh, Bmesh = np.meshgrid(A,B)
dif = np.abs(Amesh-Bmesh)
min_ind = np.argpartition(dif,n,axis=1)[:,:n]
ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]
return ind, dif[rng,ind]
Not that it is important to define this rng as a 2d array in order to retrieve a[rng[0],ind[0]], a[rng[1],ind[1]], etc and maintain the dimensions of the array, as opposed to a[:,ind] which retrieves a[:,ind[0]], a[:,ind[1]], etc.

Numpy: get the lowest N elements of an array X, considering only elements whose index is not an element in another array Y

To get the lowest 10 values of an array X I do something like:
lowest10 = np.argsort(X)[:10]
what is the most efficient way, avoiding loops, to filter the results so that I get the lowest 10 values whose index is not an element of another array Y?
So for example if the array Y is:
[2,20,51]
X[2], X[20] and X[51] shouldn't be taken into consideration to compute the lowest 10.
After some benchmarking here is my humble recommendation:
Swapping out appears to be more or less always faster than masking (even if 99% of X are forbidden.) So use something along the lines of
swap = X[Y]
X[Y] = np.inf
Sorting is expensive, therefore use argpartition and only sort what's necessary. Like
lowest10 = np.argpartition(Xfiltered, 10)[:10]
lowest10 = lowest10[np.argsort(Xfiltered[lowest10])]
Here are some benchmarks:
import numpy as np
from timeit import timeit
def swap_out():
global sol
swap = X[Y]
X[Y] = np.inf
sol = np.argpartition(X, K)[:K]
sol = sol[np.argsort(X[sol])]
X[Y] = swap
def app1():
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:K]
def app2():
sidx = np.argpartition(X,range(K+Y.size))
return sidx[~np.in1d(sidx, Y)][:K]
def app3():
sidx = np.argpartition(X,K+Y.size)
return sidx[~np.in1d(sidx, Y)][:K]
K = 10 # number of small elements wanted
N = 10000 # size of X
M = 10 # size of Y
S = 10 # number of repeats in benchmark
X = np.random.random((N,))
Y = np.random.choice(N, (M,))
so = timeit(swap_out, number=S)
print(sol)
print(X[sol])
d1 = timeit(app1, number=S)
print(sol)
print(X[sol])
d2 = timeit(app2, number=S)
print(sol)
print(X[sol])
d3 = timeit(app3, number=S)
print(sol)
print(X[sol])
print('pp', f'{so:8.5f}', ' d1(um)', f'{d1:8.5f}', ' d2', f'{d2:8.5f}', ' d3', f'{d3:8.5f}')
# pp 0.00053 d1(um) 0.00731 d2 0.00313 d3 0.00149
Here's one approach -
sidx = X.argsort()
idx_out = sidx[~np.in1d(sidx, Y)][:10]
Sample run -
# Setup inputs
In [141]: X = np.random.choice(range(60), 60)
In [142]: Y = np.array([2,20,51])
# For testing, let's set the Y positions as 0s and
# we want to see them skipped in o/p
In [143]: X[Y] = 0
# Use proposed approach
In [144]: sidx = X.argsort()
In [145]: X[sidx[~np.in1d(sidx, Y)][:10]]
Out[145]: array([ 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
# Print the first 13 numbers and skip three 0s and
# that should match up with the output from proposed approach
In [146]: np.sort(X)[:13]
Out[146]: array([ 0, 0, 0, 0, 2, 4, 5, 5, 9, 9, 10, 12, 14])
Alternatively, for performance, we might want to use np.argpartition, like so -
sidx = np.argpartition(X,range(10+Y.size))
idx_out = X[sidx[~np.in1d(sidx, Y)][:10]]
This would be beneficial if the length of X is a much larger number than 10.
If you don't care about the order of elements in that list of 10 indices, for further boost, we can simply pass on the scalar length instead of range array to np.argpartition : np.argpartition(X,10+Y.size).
We can optimize np.in1d with searchsorted to have one more approach (listing next).
Listing below all the discussed approaches in this post -
def app1(X, Y, n=10):
sidx = X.argsort()
return sidx[~np.in1d(sidx, Y)][:n]
def app2(X, Y, n=10):
sidx = np.argpartition(X,range(n+Y.size))
return sidx[~np.in1d(sidx, Y)][:n]
def app3(X, Y, n=10):
sidx = np.argpartition(X,n+Y.size)
return sidx[~np.in1d(sidx, Y)][:n]
def app4(X, Y, n=10):
n_ext = n+Y.size
sidx = np.argpartition(X,np.arange(n_ext))[:n_ext]
ssidx = sidx.argsort()
mask = np.ones(ssidx.size,dtype=bool)
search_idx = np.searchsorted(sidx, Y, sorter=ssidx)
search_idx[search_idx==sidx.size] = 0
idx = ssidx[search_idx]
mask[idx[sidx[idx] == Y]] = 0
return sidx[mask][:n]
You can work on a subset of original array using numpy.delete();
lowest10 = np.argsort(np.delete(X, Y))[:10]
Since delete works by slicing the original array with indexes to keep, complexity should be constant.
Warning: This solution uses a subset of original X array (X without the elements indexed in Y), thus the end result will be the lowest 10 of that subset.

Decrease array size by averaging adjacent values with numpy

I have a large array of thousands of vals in numpy. I want to decrease its size by averaging adjacent values.
For example:
a = [2,3,4,8,9,10]
#average down to 2 values here
a = [3,9]
#it averaged 2,3,4 and 8,9,10 together
So, basically, I have n number of elements in array, and I want to tell it to average down to X number of values, and it averages like above.
Is there some way to do that with numpy (already using it for other things, so I'd like to stick with it).
Using reshape and mean, you can average every m adjacent values of an 1D-array of size N*m, with N being any positive integer number. For example:
import numpy as np
m = 3
a = np.array([2, 3, 4, 8, 9, 10])
b = a.reshape(-1, m).mean(axis=1)
#array([3., 9.])
1)a.reshape(-1, m) will create a 2D image of the array without copying data:
array([[ 2, 3, 4],
[ 8, 9, 10]])
2)taking the mean in the second axis (axis=1) will then calculate the mean value of each row, resulting in:
array([3., 9.])
Try this:
n_averaged_elements = 3
averaged_array = []
a = np.array([ 2, 3, 4, 8, 9, 10])
for i in range(0, len(a), n_averaged_elements):
slice_from_index = i
slice_to_index = slice_from_index + n_averaged_elements
averaged_array.append(np.mean(a[slice_from_index:slice_to_index]))
>>>> averaged_array
>>>> [3.0, 9.0]
Looks like a simple non-overlapping moving window average to me, how about:
In [3]:
import numpy as np
a = np.array([2,3,4,8,9,10])
window_sz = 3
a[:len(a)/window_sz*window_sz].reshape(-1,window_sz).mean(1)
#you want to be sure your array can be reshaped properly, so the [:len(a)/window_sz*window_sz] part
Out[3]:
array([ 3., 9.])
In this example, I presume that a is the 1D numpy array that needs to be averaged. In the method that I give below, we first find the factors of the length of this array a. And, then we choose the an appropriate factor as the step size to average the array with.
Here is the code.
import numpy as np
from functools import reduce
''' Function to find factors of a given number 'n' '''
def factors(n):
return list(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
a = [2,3,4,8,9,10] #Given array.
'''fac: list of factors of length of a.
In this example, len(a) = 6. So, fac = [1, 2, 3, 6] '''
fac = factors(len(a))
'''step: choose an appropriate step size from the list 'fac'.
In this example, we choose one of the middle numbers in fac
(3). '''
step = fac[int( len(fac)/3 )+1]
'''avg: initialize an empty array. '''
avg = np.array([])
for i in range(0, len(a), step):
avg = np.append( avg, np.mean(a[i:i+step]) ) #append averaged values to `avg`
print avg #Prints the final result
[3.0, 9.0]

Categories

Resources