Extracting maximum value(with nearby values) in ndarray - python

I am trying to extract maximum value with some boundaries in ndarray(numpy).
For instance, there is ndarray N which size is (30000,1000), and I want to extract maximum value with some boundaries( (index of max value)-100 ~ (index of max value)+100 ) for each row.
So I wrote code like this
for item in N:
item = item(item.argmax()-100 : item.argmax()+100)
but after doing this task, I get still (30000,1000) for N.shape
If I want to (30000,200) for the N.shape value, what code should I have to run?

One problem with your code is that you do not always get 200 values around the maximum. Imagine your maximum in one row is the second value, then you will only get 102 values. So casting it back into a numpy array will anyways not work.
My suggestion, create a new list and append each item to the list, that is,
# Import
import numpy as np
# Random data
N=np.random.random((30000,1000))
# List for new results
Nnew=[]
# Loop
for item in N:
item = item[max([0,item.argmax()-100]) : item.argmax()+100]
Nnew.append(item)

The following is based on the answer from #alexblae. For the list you need to consider edge cases, since arr[-3, 4] == [], in most cases. You could also use numpy arrays, by moving your selection to fit within the row, if it is close to the edge:
import numpy as np
a = 30000 # number of rows
b = 1000 # number of items in a row
c = 100 # number of items to pick
N = np.random.random((a, b))
# create a list of values close to the max
# lists varies in size if close to an edge
Nnew = []
for row in N:
idx_max = row.argmax()
# avoid indexes out of range
idx0 = max(0, idx_max - c)
idx1 = min(len(row) - 1, idx_max + c)
Nnew.append(row[idx0:idx1])
# create an array of items close to the max
# moves "window" if close to an edge to pick enough items
Nnewer = np.zeros((a, 2 * c))
for i, row in enumerate(N):
idx_max = row.argmax()
# avoid indexes out of range
if idx_max < c:
idx_max = c
elif idx_max + c >= b:
idx_max = b - 1 - c
Nnewer[i, :] = row[idx_max - c : idx_max + c]
Of course, there are many ways to deal with edge cases, this is just one suggestion.

Related

Delete certain elements of a numpy array

I have two numpy arrays a and b. I have a definition that construct an array c whose elements are all the possible sums of different elements of a.
import numpy as np
def Sumarray(a):
n = len(a)
sumarray = np.array([0]) # Add a default zero element
for k in range(2,n+1):
full = np.mgrid[k*(slice(n),)]
nd_triu_idx = full[:,(np.diff(full,axis=0)>0).all(axis=0)]
sumarray = np.append(sumarray, a[nd_triu_idx].sum(axis=0))
return sumarray
a = np.array([1,2,6,8])
c = Sumarray(a)
print(d)
I then perform a subsetsum between an element of c and b: isSubsetSum returns the elements of b that when summed gives c[1]. Let's say that I get
c[0] = b[2] + b[3]
Then I want to remove:
the elements b[2], b[3] (easy bit), and
the elements of a that when summed gave c[0]
As you can see from the definition, Sumarray, the order of sums of different elements of a are preserved, so I need to realise some mapping.
The function isSubsetSum is given by
def _isSubsetSum(numbers, n, x, indices):
if (x == 0):
return True
if (n == 0 and x != 0):
return False
# If last element is greater than x, then ignore it
if (numbers[n - 1] > x):
return _isSubsetSum(numbers, n - 1, x, indices)
# else, check if x can be obtained by any of the following
found = _isSubsetSum(numbers, n - 1, x, indices)
if found: return True
indices.insert(0, n - 1)
found = _isSubsetSum(numbers, n - 1, x - numbers[n - 1], indices)
if not found: indices.pop(0)
return found
def isSubsetSum(numbers, x):
indices = []
found = _isSubsetSum(numbers, len(numbers), x, indices)
return indices if found else None
As you are iterating over all possible numbers of terms, you could as well directly generate all possible subsets.
These can be conveniently encoded as numbers 0,1,2,... by means of their binary representations: O means no terms at all, 1 means only the first term, 2 means only the second, 3 means the first and the second and so on.
Using this scheme it becomes very easy to recover the terms from the sum index because all we need to do is obtain the binary representation:
UPDATE: we can suppress 1-term-sums with a small amount of extra code:
import numpy as np
def find_all_subsums(a,drop_singletons=False):
n = len(a)
assert n<=32 # this gives 4G subsets, and we have to cut somewhere
# compute the smallest integer type with enough bits
dt = f"<u{1<<((n-1)>>3).bit_length()}"
# the numbers 0 to 2^n encode all possible subsets of an n
# element set by means of their binary representation
# each bit corresponds to one element number k represents the
# subset consisting of all elements whose bit is set in k
rng = np.arange(1<<n,dtype=dt)
if drop_singletons:
# one element subsets correspond to powers of two
rng = np.delete(rng,1<<np.arange(n))
# np.unpackbits transforms bytes to their binary representation
# given the a bitvector b we can compute the corresponding subsum
# as b dot a, to do it in bulk we can mutliply the matrix of
# binary rows with a
return np.unpackbits(rng[...,None].view('u1'),
axis=1,count=n,bitorder='little') # a
def show_terms(a,idx,drop_singletons=False):
n = len(a)
if drop_singletons:
# we must undo the dropping of powers of two to get an index
# that is easy to translate. One can check that the following
# formula does the trick
idx += (idx+idx.bit_length()).bit_length()
# now we can simply use the binary representation
return a[np.unpackbits(np.asarray(idx,dtype='<u8')[None].view('u1'),
count=n,bitorder='little').view('?')]
example = np.logspace(1,7,7,base=3)
ss = find_all_subsums(example,True)
# check every single sum
for i,s in enumerate(ss):
assert show_terms(example,i,True).sum() == s
# print one example
idx = 77
print(ss[idx],"="," + ".join(show_terms(example.astype('U'),idx,True)))
Sample run:
2457.0 = 27.0 + 243.0 + 2187.0

Randomly grow values in a NumPy Array

I have a program that takes some large NumPy arrays and, based on some outside data, grows them by adding one to randomly selected cells until the array's sum is equal to the outside data. A simplified and smaller version looks like:
import numpy as np
my_array = np.random.random_integers(0, 100, [100, 100])
## Just creating a sample version of the array, then getting it's sum:
np.sum(my_array)
499097
So, supposing I want to grow the array until its sum is 1,000,000, and that I want to do so by repeatedly selecting a random cell and adding 1 to it until we hit that sum, I'm doing something like:
diff = 1000000 - np.sum(my_array)
counter = 0
while counter < diff:
row = random.randrange(0,99)
col = random.randrange(0,99)
coordinate = [row, col]
my_array[coord] += 1
counter += 1
Where row/col combine to return a random cell in the array, and then that cell is grown by 1. It repeats until the number of times by which it has added 1 to a random cell == the difference between the original array's sum and the target sum (1,000,000).
However, when I check the result after running this - the sum is always off. In this case after running it with the same numbers as above:
np.sum(my_array)
99667203
I can't figure out what is accounting for this massive difference. And is there a more pythonic way to go about this?
my_array[coordinate] does not do what you expect. It is selecting multiple rows and adding 1 to all of those entries. You could simply use my_array[row, col] instead.
You could simply write something like:
for _ in range(1000000 - np.sum(my_array)):
my_array[random.randrange(0, 99), random.randrange(0, 99)] += 1
(or xrange instead of range if using Python 2.x)
Replace my_array[coord] with my_array[row][col]. Your method chose two random integers and added 1 to every entry in the rows corresponding to both integers.
Basically you had a minor misunderstanding of how numpy indexes arrays.
Edit: To make this clearer.
The code posted chose two numbers, say 30 and 45, and added 1 to all 100 entries of row 30 and all 100 entries of row 45.
From this you would expect the total sum to be 100,679,697 = 200*(1,000,000 - 499,097) + 499,097
However when the random integers are identical (say, 45 and 45), only 1 is added to every entry of column 45, not 2, so in that case the sum only jumps by 100.
The problem with your original approach is that you are indexing your array with a list, which is interpreted as a sequence of indices into the row dimension, rather than as separate indices into the row/column dimensions (see here).
Try passing a tuple instead of a list:
coord = row, col
my_array[coord] += 1
A much faster approach would be to find the difference between the sum over the input array and the target value, then generate an array containing the same number of random indices into the array and increment them all in one go, thus avoiding looping in Python:
import numpy as np
def grow_to_target(A, target=1000000, inplace=False):
if not inplace:
A = A.copy()
# how many times do we need to increment A?
n = target - A.sum()
# pick n random indices into the flattened array
idx = np.random.random_integers(0, A.size - 1, n)
# how many times did we sample each unique index?
uidx, counts = np.unique(idx, return_counts=True)
# increment the array counts times at each unique index
A.flat[uidx] += counts
return A
For example:
a = np.zeros((100, 100), dtype=np.int)
b = grow_to_target(a)
print(b.sum())
# 1000000
%timeit grow_to_target(a)
# 10 loops, best of 3: 91.5 ms per loop

Improving the execution time of matrix calculations in Python

I work with a large amount of data and the execution time of this piece of code is very very important. The results in each iteration are interdependent, so it's hard to make it in parallel. It would be awesome if there is a faster way to implement some parts of this code, like:
finding the max element in the matrix and its indices
changing the values in a row/column with the max from another row/column
removing a specific row and column
Filling the weights matrix is pretty fast.
The code does the following:
it contains a list of lists of words word_list, with count elements in it. At the beginning each word is a separate list.
it contains a two dimensional list (count x count) of float values weights (lower triangular matrix, the values for which i>=j are zeros)
in each iteration it does the following:
it finds the two words with the most similar value (the max element in the matrix and its indices)
it merges their row and column, saving the larger value from the two in each cell
it merges the corresponding word lists in word_list. It saves both lists in the one with the smaller index (max_j) and it removes the one with the larger index (max_i).
it stops if the largest value is less then a given THRESHOLD
I might think of a different algorithm to do this task, but I have no ideas for now and it would be great if there is at least a small performance improvement.
I tried using NumPy but it performed worse.
weights = fill_matrix(count, N, word_list)
while 1:
# find the max element in the matrix and its indices
max_element = 0
for i in range(count):
max_e = max(weights[i])
if max_e > max_element:
max_element = max_e
max_i = i
max_j = weights[i].index(max_e)
if max_element < THRESHOLD:
break
# reset the value of the max element
weights[max_i][max_j] = 0
# here it is important that always max_j is less than max i (since it's a lower triangular matrix)
for j in range(count):
weights[max_j][j] = max(weights[max_i][j], weights[max_j][j])
for i in range(count):
weights[i][max_j] = max(weights[i][max_j], weights[i][max_i])
# compare the symmetrical elements, set the ones above to 0
for i in range(count):
for j in range(count):
if i <= j:
if weights[i][j] > weights[j][i]:
weights[j][i] = weights[i][j]
weights[i][j] = 0
# remove the max_i-th column
for i in range(len(weights)):
weights[i].pop(max_i)
# remove the max_j-th row
weights.pop(max_i)
new_list = word_list[max_j]
new_list += word_list[max_i]
word_list[max_j] = new_list
# remove the element that was recently merged into a cluster
word_list.pop(max_i)
count -= 1
This might help:
def max_ij(A):
t1 = [max(list(enumerate(row)), key=lambda r: r[1]) for row in A]
t2 = max(list(enumerate(t1)), key=lambda r:r[1][1])
i, (j, max_) = t2
return max_, i, j
It depends on how much work you want to put into it but if you're really concerned about speed you should look into Cython. The quick start tutorial gives a few examples ranging from a 35% speedup to an amazing 150x speedup (with some added effort on your part).

Selecting rows from array under many conditions

I am trying to extract rows from a large numpy array. The columns of the array are obs number, group id (j), time id (t), and some data x_jt.
Here is an example:
import numpy as np
N = 100
T = 100
X = np.vstack((np.array(range(1,N*T+1)),np.repeat(np.array(range(1,N+1)),T), np.tile(np.array(range(1,T+1)),N), np.random.randint(100,size=N*T))).T
If I want to extract all rows from X where group id = 2, I would do
X[np.where(X[:,1] == 2)]
And if I wanted all rows where j = 2 or 3, I could extend that code. However, in my case, I have many group ids (j's) to extract. Specifically, I want to extract all rows where j comes from
samples = np.random.randint(N, size=N) + 1
For example, suppose size = 5 instead of N, and samples = (2,4,5,4,7). What I am after is code that goes through X and selects all rows where j = 2, then j = 4, then j = 5, j = 4, and finally j = 7, and creates a new array with the results. Basically this:
result = []
for j in samples:
result.extend(X[np.where(X[:,1] == j)])
However, this code is slow when N is large. Do you have any suggestions to speed it up? Thanks!
Without replacement
This could be done with vectorized functions:
def contains(X, samples):
return numpy.vectorize(lambda x: x in samples)(X)
result = X[contains(X[:, 1], set(samples)), :]
With replacement
If you want to do this with replacement just check off one count per sample until there are no more samples (assuming the order does not matter). This way you at least reduce the amount of times you need to iterate over the matrix.
result = []
sample_counts = collections.Counter(samples)
while sum(sample_counts.itervalues()):
# pick up one of each of the remaining samples and chain their rows
# together in result
s = set(key for key, value in sample_counts.iteritems() if value)
result = itertools.chain(result, X[contains(X[:, 1], s), :])
sample_counts -= collections.Counter(dict.fromkeys(s, 1))
# create a matrix of the final result
result = numpy.array(list(result))
In that case the only way I can think of that might speed up what you're already doing is preallocating a matrix. So you would do:
It doesn't do exactly what you are describing, but this type of problems are a good candidate for np.in1d. Something like this should work:
result = X[np.in1d(X[:, 1], samples)]

Speed up loop to fill an array with closest values from another array

I have a block of code that I need to optimize as much as possible since I have to run it several thousand times.
What it does is it finds the closest float in a sub-list of a given array for a random float and stores the corresponding float (ie: with the same index) stored in another sub-list of that array. It repeats the process until the sum of floats stored reaches a certain limit.
Here's the MWE to make it clearer:
import numpy as np
# Define array with two sub-lists.
a = [np.random.uniform(0., 100., 10000), np.random.random(10000)]
# Initialize empty final list.
b = []
# Run until the condition is met.
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a[1]))
# Store value located in sub-list a[0].
b.append(a[0][idx])
The code is reasonably simple but I haven't found a way to speed it up. I tried to adapt the great (and very fast) answer given in a similar question I made some time ago, to no avail.
OK, here's a slightly left-field suggestion. As I understand it, you are just trying to sample uniformally from the elements in a[0] until you have a list whose sum exceeds some limit.
Although it will be more costly memory-wise, I think you'll probably find it's much faster to generate a large random sample from a[0] first, then take the cumsum and find where it first exceeds your limit.
For example:
import numpy as np
# array of reference float values, equivalent to a[0]
refs = np.random.uniform(0, 100, 10000)
def fast_samp_1(refs, lim=10000, blocksize=10000):
# sample uniformally from refs
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# find where the cumsum first exceeds your limit
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
# # if it's ok to be just under lim rather than just over then this might
# # be quicker
# return samp[samp_sum <= lim]
Of course, if the sum of the sample of blocksize elements is < lim then this will fail to give you a sample whose sum is >= lim. You could check whether this is the case, and append to your sample in a loop if necessary.
def fast_samp_2(refs, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Note that concatenating arrays is pretty slow, so it would probably be better to make blocksize large enough to be reasonably sure that the sum of a single block will be >= to your limit, without being excessively large.
Update
I've adapted your original function a little bit so that its syntax more closely resembles mine.
def orig_samp(refs, lim=10000):
# Initialize empty final list.
b = []
a1 = np.random.random(10000)
# Run until the condition is met.
while (sum(b) < lim):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[1].
idx = np.argmin(np.abs(u - a1))
# Store value located in sub-list a[0].
b.append(refs[idx])
return b
Here's some benchmarking data.
%timeit orig_samp(refs, lim=10000)
# 100 loops, best of 3: 11 ms per loop
%timeit fast_samp_2(refs, lim=10000, blocksize=1000)
# 10000 loops, best of 3: 62.9 µs per loop
That's a good 3 orders of magnitude faster. You can do a bit better by reducing the blocksize a fraction - you basically want it to be comfortably larger than the length of the arrays you're getting out. In this case, you know that on average the output will be about 200 elements long, since the mean of all real numbers between 0 and 100 is 50, and 10000 / 50 = 200.
Update 2
It's easy to get a weighted sample rather than a uniform sample - you can just pass the p= parameter to np.random.choice:
def weighted_fast_samp(refs, weights=None, lim=10000, blocksize=10000):
samp = np.random.choice(refs, size=blocksize, replace=True, p=weights)
samp_sum = np.cumsum(samp)
# is the sum of our current block of samples >= lim?
while samp_sum[-1] < lim:
# if not, we'll sample another block and try again until it is
newsamp = np.random.choice(refs, size=blocksize, replace=True,
p=weights)
samp = np.hstack((samp, newsamp))
samp_sum = np.hstack((samp_sum, np.cumsum(newsamp) + samp_sum[-1]))
last = np.searchsorted(samp_sum, lim, side='right')
return samp[:last + 1]
Write it in cython. That's going to get you a lot more for a high iteration operation.
http://cython.org/
One obvious optimization - don't re-calculate sum on each iteration, accumulate it
b_sum = 0
while b_sum<10000:
....
idx = np.argmin(np.abs(u - a[1]))
add_val = a[0][idx]
b.append(add_val)
b_sum += add_val
EDIT:
I think some minor improvement (check it out if you feel like it) may be achieved by pre-referencing sublists before the loop
a_0 = a[0]
a_1 = a[1]
...
while ...:
....
idx = np.argmin(np.abs(u - a_1))
b.append(a_0[idx])
It may save some on run time - though I don't believe it will matter that much.
Sort your reference array.
That allows log(n) lookups instead of needing to browse the whole list. (using bisect for example to find the closest elements)
For starters, I reverse a[0] and a[1] to simplify the sort:
a = np.sort([np.random.random(10000), np.random.uniform(0., 100., 10000)])
Now, a is sorted by order of a[0], meaning if you are looking for the closest value to an arbitrary number, you can start by a bisect:
while (sum(b) < 10000):
# Draw random [0,1) value.
u = np.random.random()
# Find closest value in sub-list a[0].
idx = bisect.bisect(a[0], u)
# now, idx can either be idx or idx-1
if idx is not 0 and np.abs(a[0][idx] - u) > np.abs(a[0][idx - 1] - u):
idx = idx - 1
# Store value located in sub-list a[1].
b.append(a[1][idx])

Categories

Resources