c=np.array([ 0. , 0.2, 0.22, 0.89, 0.99])
rnd = np.random.uniform(low=0.00, high=1.00, size=12)
I want to see how many elements in c are smaller than each of the 12 random numbers in rnd. It needs to be in numpy and without the use of any lists so that it's faster.
The output will be an array of 12 elements, each describing how many elements for each of them are small than the corresponding number in rnd.
You can use broadcasting after extending c from 1D to a 2D array verison with None/np.newaxis for performing comparisons on all elements in a vectorized manner and then summing along rows with .sum(0) for the counting, like so -
(c[:,None] < rnd).sum(0)
It seems you can also use the efficient np.searchsorted like so -
np.searchsorted(c,rnd)
Related
Let's say I have several arrays, where each array is the same length. I am working with binary-valued (values are 0 or 1) arrays which might simplify the problem, so it's okay if the proposed solution makes use of this property.
I want to compute pairwise accuracies between each pair of arrays, where accuracy can be thought of as the proportion of times the elements in two arrays are equal. So here is a simple example where I am using a list of lists format. Let's say A = [[1,1,1], [0,1,0], [1,1,0]]. We would want to output:
1. , 1/3, 2/3
1/3, 1., 2/3
2/3, 2/3, 1.
I can compute this using multiple loops (iterating over each pair of arrays, and over each index). However, is there are built-in functionalities or library (e.g numpy) that can help do this more cleanly and efficiently?
You can use broadcasting:
import numpy as np
A = np.array([[1,1,1], [0,1,0], [1,1,0]])
output = A[:,None,:] == A[None,:,:]
output = output.sum(axis=2) / 3
print(output)
# [[1. 0.33333333 0.66666667]
# [0.33333333 1. 0.66666667]
# [0.66666667 0.66666667 1. ]]
I'd suggest
A = np.array(A)
-1 * np.linalg.norm(A[:, None, :] - A[None, :, :], axis=-1, ord=1)/len(A) + 1
that leverages NumPy's linalg.norm.
Since pairwise accuracies seemingly refers to the relative number of coinciding elements in between two vectors. In this case, you compute
1 - HammingDistance(v1, v2) / len(v2)
where the Hamming distance counts the (absolute) number of indices of non-equal values. This is emulated by using the 1-norm through ord=1.
However, if you'd prefer to leverage the binary structure of your vectors without invoking the linear algebra in NumPy but merely is broadcasting capability,
A = np.array(A)
-1 * (A[:, None, :] != A).sum(2)/len(A) + 1
will equally do.
Naturally, both code snippets require the lists (i.e. vectors) in your code to have the same length. However, it is non-trivial in a mathematically rigorous way to measure distance (and in turn, similarity) anyway when this is not the case.
a = (np.random.rand(10) > 0.1).astype(int)
b = np.random.binomial(1, 0.9, 10)
c = np.random.choice([0, 1], 10, [0.1, 0.9])
There are at least 3 different ways in numpy by which I can get an array of 0 and 1 (the ones are added with a certain probability p (p=0.9 in example)). When I use np.random.seed(1), the certain method always returns the same array. However, all the above methods create different arrays even with the same seed. Is this happening because they all have different PRNG algorithms or just some of them are not affected by np.random.seed(1)?
The different approaches use the stream of pseudo random numbers in different ways, which is why they do not result in the same samples, even if you seed the generator the same way prior to each sample.
This may be clearer by considering an additional approach (used for variable b below).
np.random.seed(1)
a = (np.random.rand(10) > 0.1).astype(int)
np.random.seed(1)
b = (np.random.rand(10) < 0.9).astype(int)
Both these approaches will generate ones with probability 0.9, and they draw the same underlying numbers when calling rand (since they're seeded the same). If for example, 0.02 is sampled from the call to rand, then the corresponding element in a will be 0 and the corresponding element in b will be 1. This is analogous to the differences you're observing when using other approaches to generate the samples.
I am generating a series of Gaussian arrays given a x vector of length (1400), and arrays for the sigma, center, amplitude (amp), all with length (100). I thought the best way to speed this up would be to use numpy and list comprehension:
g = np.sum([(amp[i]*np.exp(-0.5*(x - (center[i]))**2/(sigma[i])**2)) for i in range(len(center))],axis=0)
Each row is a gaussian along a vector x, and then I sum the columns into a single array of length x.
But this doesn't seem to speed things up at all. I think there is a faster way to do this while avoiding the for loop but I can't quite figure out how.
You should use vectorized computation instead of comprehension so the loops are all performed at c speed.
In order to do so you have to reshape x to be a column vector. For example you could do x = x.reshape((1400,1)).
Then you can operate directly on the arrays, like this:
v=(amp*np.exp(-0.5*(x - (center))**2/(sigma)**2
Then you obtain an array of shape (1400,100) which you can sum up to a vector by np.sum(v, axe=1)
You should try to vectorize all the operations. IMHO the most efficient to first converts your input data to numpy arrays (if they were plain Python lists) and then let numpy process the computations:
np_amp = np.array(amp)
np_center = np.array(center)
np_sigma = np.array(sigma)
g = np.sum((np_amp*np.exp(-0.5*(x - (np_center))**2/(np_sigma)**2)),axis=0)
Is it possible to multiply two matrices together, but taking 1-elements in a certain column? I am multiplying large numpy arrays and don't want to break the arrays apart into multiple instances because I am running out of memory.
For example:
a=np.array([[0.1, 0.2],[0.3, 0.4]])
b=np.array([[0.25],[0.4]])
c = np.matmul(a,b)
except 1-b used multiplication:
1-b=np.array([[0.75], [0.6]])
before the matrix multiplication, but doesn't save it into memory after the calculation is done
I need to find the most frequent element in a numpy array "label", only if those elements lie inside the mask array. Here is the brute force approach:
def getlabel(mask, label):
# get majority label
assert label.shape == mask.shape
tmp = []
for i in range(mask.shape[0]):
for j in range(mask.shape[1]):
if mask[i][j] == True:
tmp.append(label[i][j])
return Counter(tmp).most_common(1)[0][0]
However I don't think this is the most elegant and fastest approach yet. Which other data structures should I use? (hasing, dictionary, etc... )?
Assuming your mask is a boolean array:
import numpy as np
cnt = np.bincount(label[mask].flat)
This gives you a vector of number of occurrences of values 0, 1, 2, ... max(label)
You can find the most frequent then by
most_frequent = np.argmax(cnt)
And naturally, the number of these elements in your input data is
cnt[most_frequent]
Usually, np.bincount is fast. Let us try with labels with maximum number of 999 (i.e. 1000 bins) and a 10 000 000 element array masked by 8 000 000 values:
data = np.random.randint(0, 1000, (1000, 10000))
mask = np.random.random((1000, 10000)) < 0.8
# time this section
cnt = np.bincount(data[mask].flat)
With my machine this takes 80 ms. The argmax takes maybe 2 ns/bin, so even if your label integers are a bit scattered, it does not really matter.
This approach is probably the fastest approach if the following conditions hold:
the labels are integers within range 0..N, where N is not much more than the size of the input array
the input data is in a NumPy array
This solution may be applied to some other cases, but then it is more a question of how and whether there are better solutions available. (See metaperture's answer.) For example, a simple conversion of a Python list into ndarray is rather costly, and the speed benefit gained by bincount will be lost if the input is a Python list, and the amount of data is not large.
The sparsity of labels in the integer space is not a problem per se. Creating and zeroing the output vector is relatively fast, and it is easy and fast to compress back with np.nonzero. However, if the maximum label value is large compared to the size of the input array, then the speed benefit may be lost.
np.bincount is not a general approach.np.bincount will be faster for bounded, low entropy, discrete distributions. However, it will fail:
if the distribution is unbounded, the memory used is unbounded (can be arbitrarily large for an arbitrarily small input array)
if the distribution is continuous, the argmax of bincount is not the mode (technically it's the MAP of a KDE, where the KDE is generated using histogram-like methods)
if the distribution has high entropy/dispersal, then the bin-based representation of np.bincount doesn't make sense (won't fail but will just be worse)
For a general solution, you should do one of:
cnt = Counter((l for m, l in zip(mask.flat, label.flat) if m)) # or...
cnt = Counter(label[mask].flat)
Or:
scipy.stats.mode(label[mask].flat)
In my testing the former is ~20x faster. If you know the distribution is discrete with a relatively low bound and entropy then bincount will be faster.
If the above is not fast enough, a better general approach than bincount is to sample your data
collections.Counter(np.random.choice(data[mask], 1000)).most_common(1)
scipy.stats.mode(np.random.choice(data[mask], 1000))
Both of the above are an order of magnitude faster than the unsampled versions and converge to the mode quickly for even the most pathological distributions.