Using Numpy to Count Numbers in a range - python

I've been trying to write some code which will add the numbers which fall into a certain range and add a corresponding number to a list. I also need to pull the range from a cumsum range.
numbers = []
i=0
z = np.random.rand(1000)
arraypmf = np.array(pmf)
summation = np.cumsum(z)
while i < 6:
index = i-1
a = np.extract[condition, z] # I can't figure out how to write the condition.
length = len(a)
length * numbers.append(i)

I'm not entirely sure what you're trying to do, but the easiest way to do conditions in numpy is to just apply them to the whole array to get a mask:
mask = (z >= 0.3) & (z < 0.6)
Then you can use, e.g., extract or ma if necessary—but in this case, I think you can just rely on the fact that True==1 and False==0 and do this:
zm = z * mask
After all, if all you're doing is summing things up, 0 is the same as not there, and you can just replace len with count_nonzero.
For example:
In [588]: z=np.random.rand(10)
In [589]: z
Out[589]:
array([ 0.33335522, 0.66155206, 0.60602815, 0.05755882, 0.03596728,
0.85610536, 0.06657973, 0.43287193, 0.22596789, 0.62220608])
In [590]: mask = (z >= 0.3) & (z < 0.6)
In [591]: mask
Out[591]: array([ True, False, False, False, False, False, False, True, False, False], dtype=bool)
In [592]: z * mask
Out[592]:
array([ 0.33335522, 0. , 0. , 0. , 0. ,
0. , 0. , 0.43287193, 0. , 0. ])
In [593]: np.count_nonzero(z * mask)
Out[593]: 2
In [594]: np.extract(mask, z)
Out[594]: array([ 0.33335522, 0.43287193])
In [595]: len(np.extract(mask, z))
Out[595]: 2

Here is another approach to do (what I think) you're trying to do:
import numpy as np
z = np.random.rand(1000)
bins = np.asarray([0, .1, .15, 1.])
# This will give the number of values in each range
counts, _ = np.histogram(z, bins)
# This will give the sum of all values in each range
sums, _ = np.histogram(z, bins, weights=z)

Related

Cleaner way to fill matrix based on other matrix values

Let's suppose we have these two matrices
epsilon = np.asmatrix([
[1,2],
[-1,2],
[0,2]
])
and this one:
step_weights = np.asmatrix(np.random.normal(0, 0.5, (np.shape(epsilon)))
I want to populate/update step_weights matrix based on epsilon values, that is:
if epsilon[i,j] > 0:
step_weights[i,j] = np.minimum(1.2 * step_weights[i,j], 50)
elif epsilon[i,j] < 0:
step_weights[i,j] = np.maximum(0.5 * step_weights[i,j], 10**-6)
This is what I have done:
import numpy as np
def update_steps(self, epsilon):
for (i, j), epsilon_ij in np.ndenumerate(epsilon):
if epsilon_ij > 0:
step_weights[i, j] = np.minimum(1.2 * step_weights[i,j], 50)
elif epsilon_ij < 0:
step_weights[i, j] =np.maximum(0.5 * step_weights[i,j], 10**-6)
and that's working fine.
My question is: is there a more efficient/cleaner way to do it, avoiding the for loop? For example exploiting matrix calculus or linear algebra?
Use bool indices array:
>>> np.random.seed(0)
>>> step_weights = np.asmatrix(np.random.normal(0, 0.5, np.shape(epsilon)))
>>> step_weights
matrix([[ 0.88202617, 0.2000786 ],
[ 0.48936899, 1.1204466 ],
[ 0.933779 , -0.48863894]])
>>> mask = epsilon > 0
>>> step_weights[mask] = np.minimum(step_weights.A[mask] * 1.2, 50)
>>> mask = epsilon < 0
>>> step_weights[mask] = np.maximum(step_weights.A[mask] * 0.5, 10 ** -6)
>>> step_weights
matrix([[ 1.05843141, 0.24009433],
[ 0.2446845 , 1.34453592],
[ 0.933779 , -0.58636673]])
Note: The matrix class is not recommended now, and will be discarded in the future. It should use a regular multidimensional array instead. The current multidimensional array already supports many matrix operations (such as using the # operator for matrix multiplication).

Python loop to populate a matrix

I'm trying to populate an array in python more efficiently. I have a 5x3 matrix A that I am transforming into a 3x3 matrix (Z) by calculating z11, z12, ..., z33 independently. The code below works, but it's clunky and I'm hoping to automate this into a loop so that it will take an A matrix of any size (n x m) and transform it into a Z matrix of size (m x m). If someone could help me out I would greatly appreciate it!
import numpy as np
A = np.array([[1,0,0],
[0,1,0],
[0,1,1],
[0,0,-1],
[0,0,1]])
A1=A[:,0]
A2=A[:,1]
A3=A[:,2]
C = np.array([-2,-2, -9,-6,-4])
X = np.array([-4,-4,-8])
z11 = (sum(A1*A1))*(C[0]/X[0])
z12 = (sum(A1*A2))*(C[0]/X[1])
z13 = (sum(A1*A3))*(C[0]/X[2])
z21 = (sum(A2*A1))*(C[1]/X[0])
z22 = (sum(A2*A2))*(C[1]/X[1])
z23 = (sum(A2*A3))*(C[1]/X[2])
z31 = (sum(A3*A1))*(C[2]/X[0])
z32 = (sum(A3*A2))*(C[2]/X[1])
z33 = (sum(A3*A3))*(C[2]/X[2])
Z = np.array([[z11,z12,z13],
[z21,z22,z23],
[z31,z32,z33]])
We can use the broadcasting to achieve the same. First let's increase A by one dimension using A[:, None] and then multiply it with A. Since shape of A[:, None] is (3, 1, 5) and shape of A is (3, 5), numpy first repeats(intuitively) the array corresponding to dimension where both array don't match and then does the multiplication. This way each column of A gets multiplied with every other column(to makes sure that columns are multiplied, I have used transpose) Then we can take sum along the last axis and multiply with C[:, None] to achieve the desired output.
Use:
m = A.shape[1]
B = A[:, None].T * A.T
Z = np.sum(B, axis = -1).astype(float)*C[:m, None]/X
Output:
>>> Z
array([[0.5 , 0. , 0. ],
[0. , 1. , 0.25 ],
[0. , 2.25 , 3.375]])

Scatter plot with specifc conditions

Suppose that I have this data set
x1 = np.array([0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7])
x2 = np.array([0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6])
c = np.array([ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
I want a scatter plot of x1,x2 under the following condition: if the corresponding index for (x1[i],x2[i])
in c is c[i]==1 then plot this a marker as a red X , but if the corresponding index for (x1[i], x2[i]) in c is c[i]==0 then plot the marker at (x1[i], x2[i]) as a blue O.
Any idea on how to do it?
Since you're using numpy, it is rather easy to get elements (or indices) where certain rule (comparison) is satisfied.
You can get an array of boolean values which corresponds to some condition acted on elements of existing array:
print(c==0)
[False False False False False True True True True True]
Also, you can access only some elements of array using the array of booleans (same as the size of initial array):
print(x1[c==1])
[0.1 0.3 0.1 0.6 0.4]
or more complex operations and/or set values:
x1[x2 < 0.5] = 0
print(x1)
x2[x2 < 0.5] = 10
print(x2)
[0. 0. 0.1 0.6 0. 0. 0.5 0. 0. 0.7]
[10. 10. 0.5 0.9 10. 10. 0.6 10. 10. 0.6]
For even more complicated conditions, there are logical functions for NumPy arrays.
This approach (NumPy) is significantly faster than using loops and you should utilize it whenever it is possible!
Applying above, it is easy to solve your problem:
import numpy as np
import matplotlib.pyplot as plt
x1 = np.array([0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7])
x2 = np.array([0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6])
c=np.array([ 1,1,1,1,1,0,0,0,0,0 ])
plt.plot(x1[c==0], x2[c==0], 'bo')
plt.plot(x1[c==1], x2[c==1], 'rx')
I don't know it is efficient way or not but you can do this,
i = 0
for (p1,p2) in zip(x1,x2):
if(c[i]==1):
plt.scatter(p1, p2, marker='x', color='red')
else:
plt.scatter(p1, p2, marker='o', color='blue')
i+=1
check other marker shapes here,
https://matplotlib.org/3.1.1/api/markers_api.html

Aggregating 2 NumPy arrays by confidence

I have 2 np arrays containing values in the interval [0,1].
I want to create the third array, containing the most "confident" values, meaning to take elementwise, the number from the array which is closer to 1 or 0. Consider the following example:
[0.7,0.12,1,0.5]
[0.1,0.99,0.001,0.49]
so my constructed array would be:
[0.1,0.99,1,0.49]
import numpy as np
A = np.array([0.7,0.12,1,0.5])
B = np.array([0.1,0.99,0.001,0.49])
maxi = np.maximum(A,B)
mini = np.minimum(A,B)
# Find where the maximum is closer to 1 than the minimum is to 0
idx = 1-maxi < mini
maxi*idx + mini*~idx
returns
array([ 0.1 , 0.99, 1. , 0.49])
You can try this:
c=np.array([a[i] if min(1-a[i],a[i])<min(1-b[i],b[i]) else b[i] for i in range(len(a))])
The result is:
array([ 0.1 , 0.99, 1. , 0.49])
Another way of stating your "confidence" measure is to ask which of the two numbers are furtest away from 0.5. That is, which of the two numbers x yields the largest abs(0.5 - x). The following solution constructs a 2D array c with the original arrays as columns. Then we construct and apply a boolean mask based on abs(0.5 - c):
import numpy as np
a = np.array([0.7,0.12,1,0.5])
b = np.array([0.1,0.99,0.001,0.49])
# Combine
c = np.concatenate((a, b)).reshape((2, len(a))).T
# Create mask
b_or_a = np.asarray(np.argmax(np.abs((0.5 - c)), axis=1), dtype=bool)
mask = np.zeros(c.shape, dtype=bool)
mask[:, 0] = ~b_or_a
mask[:, 1] = b_or_a
# Applt mask
d = c[mask]
print(d) # [ 0.1 0.99 1. 0.49]

Radial profile of 2D matrix with float indexes

I have a 2D data array and I'm trying to get a profile of values about its center in an efficient manner. So the output should be two one-dimensional arrays: one with the values of distances from the center, the other with the mean of all the values in the original 2D that are at that distance from the center.
Each index has a non-integer distance from the center, which prevents me from using some already known solutions for the problem. Allow me to explain.
Consider these matrices
data = np.random.randn(5,5)
L = 2
x = np.arange(-L,L+1,1)*2.5
y = np.arange(-L,L+1,1)*2.5
xx, yy = np.meshgrid(x, y)
r = np.sqrt(xx**2. + yy**2.)
So the matrices are
In [30]: r
Out[30]:
array([[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 5. , 2.5 , 0. , 2.5 , 5. ],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781]])
In [31]: data
Out[31]:
array([[ 1.27603322, 1.33635284, 1.93093228, 0.76229675, -0.00956535],
[ 0.69556071, -1.70829753, 1.19615919, -1.32868665, 0.29679494],
[ 0.13097791, -1.33302719, 1.48226442, -0.76672223, -1.01836614],
[ 0.51334771, -0.83863115, -0.41541794, 0.34743342, 0.1199237 ],
[-1.02042539, 0.90739383, -2.4858624 , -0.07417987, 0.90748933]])
For this case the expected output should be array([ 0. , 2.5 , 3.53553391, 5. , 5.59016994, 7.07106781]) for the index of distances, and a second array of same length with the mean of all the values that are at those corresponding distances: array([ 0.98791323, -0.32496927, 0.37221219, -0.6209728 , 0.27986926, 0.04060628]).
From this answer there is a very nice function to compute the profile about any arbitrary point. However, the problem with his approach is that it approximates the distance r by the index distance. So his r for my case would be this:
array([[2, 2, 2, 2, 2],
[2, 1, 1, 1, 2],
[2, 1, 0, 1, 2],
[2, 1, 1, 1, 2],
[2, 2, 2, 2, 2]])
which is a pretty big difference for me, since I'm working with small matrices. This approximation, however, allows him to use np.bincount, which is pretty handy (but won't work for me).
I've been trying to expand this for float distance, like my version r, but so far no luck. bincount doesn't work with floats and histogram needs equally-spaced bins, which is not the case. Any suggestion?
Approach #1
def radial_profile_app1(data, r):
mid = data.shape[0]//2
ids = np.rint((r**2)/r[mid-1,mid]**2).astype(int).ravel()
count = np.bincount(ids)
R = data.shape[0]//2 # Radial profile radius
R0 = R+1
dists = np.unique(r[:R0,:R0][np.tril(np.ones((R0,R0),dtype=bool))])
mean_data = (np.bincount(ids, data.ravel())/count)[count!=0]
return dists, mean_data
For the given sample data -
In [475]: radial_profile_app1(data, r)
Out[475]:
(array([ 0. , 2.5 , 3.53553391, 5. , 5.59016994,
7.07106781]),
array([ 1.48226442 , -0.3297520425, -0.8820454775, -0.3605795875,
0.5696863263, 0.2883829525]))
Approach #2
def radial_profile_app2(data, r):
R = data.shape[0]//2 # Radial profile radius
range_arr = np.arange(-R,R+1)
ids = (range_arr[:,None]**2 + range_arr**2).ravel()
count = np.bincount(ids)
R0 = R+1
dists = np.unique(r[:R0,:R0][np.tril(np.ones((R0,R0),dtype=bool))])
mean_data = (np.bincount(ids, data.ravel())/count)[count!=0]
return dists, mean_data
Runtime test -
In [562]: # Setup inputs
...: N = 2001
...: data = np.random.randn(N,N)
...: L = (N-1)//2
...: x = np.arange(-L,L+1,1)*2.5
...: y = np.arange(-L,L+1,1)*2.5
...: xx, yy = np.meshgrid(x, y)
...: r = np.sqrt(xx**2. + yy**2.)
...:
In [563]: out01, out02 = radial_profile_app1(data, r)
...: out11, out12 = radial_profile_app2(data, r)
...:
...: print np.allclose(out01, out11)
...: print np.allclose(out02, out12)
...:
True
True
In [566]: %timeit radial_profile_app1(data, r)
...: %timeit radial_profile_app2(data, r)
...:
10 loops, best of 3: 114 ms per loop
10 loops, best of 3: 91.2 ms per loop
Got what I was expecting with this function:
def radial_prof(data, r):
uniq = np.unique(r)
prof = np.array([ np.mean(data[ r==un ]) for un in uniq ])
return uniq, prof
But I'm still not happy with the fact that I had to use list comprehension (or a python loop), since it might be slow for very large matrices.
Here is an indirect sorting approach that should scale well if batch size and / or number of bins are large. The sorting is O(n log n) all the histogramming is O(n). I've also added a little unscientific speed test. For the speed test I use flat indexing but I left the 2d index code in because its more flexible when dealing with images of different sizes etc.
import numpy as np
# this need only be run once per batch
def r_to_ind(r, dist_bins="auto"):
f = np.argsort(r.ravel())
if dist_bins == "auto":
rs = r.ravel()[f]
bins = np.where(np.r_[True, rs[1:]!=rs[:-1]])[0]
dist_bins = rs[bins]
else:
bins = np.searchsorted(r.ravel()[f], dist_bins)
denom = np.diff(np.r_[bins, r.size])
return f, np.unravel_index(f, r.shape), bins, denom, dist_bins
# this is with adjustable offset
def profile_xy(image, yx, ij, bins, nynx, denom):
(y, x), (i, j), (ny, nx) = yx, ij, nynx
return np.add.reduceat(image[i + y - ny//2, j + x - nx//2], bins) / denom
# this is fixed
def profile_xy_no_offset(image, ij, bins, denom):
return np.add.reduceat(image[ij], bins) / denom
# this is fixed and flat
def profile_xy_no_offset_flat(image, k, bins, denom):
return np.add.reduceat(image.ravel()[k], bins) / denom
data = np.array([[ 1.27603322, 1.33635284, 1.93093228, 0.76229675, -0.00956535],
[ 0.69556071, -1.70829753, 1.19615919, -1.32868665, 0.29679494],
[ 0.13097791, -1.33302719, 1.48226442, -0.76672223, -1.01836614],
[ 0.51334771, -0.83863115, -0.41541794, 0.34743342, 0.1199237 ],
[-1.02042539, 0.90739383, -2.4858624 , -0.07417987, 0.90748933]])
r = np.array([[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 5. , 2.5 , 0. , 2.5 , 5. ],
[ 5.59016994, 3.53553391, 2.5 , 3.53553391, 5.59016994],
[ 7.07106781, 5.59016994, 5. , 5.59016994, 7.07106781]])
f, (i, j), bins, denom, dist_bins = r_to_ind(r)
result = profile_xy(data, (2, 2), (i, j), bins, (5, 5), denom)
print(dist_bins)
# [ 0. 2.5 3.53553391 5. 5.59016994 7.07106781]
print(result)
# [ 1.48226442 -0.32975204 -0.88204548 -0.36057959 0.56968633 0.28838295]
#########################
from timeit import timeit
n = 2001
batch = 100
fake = 10
a = np.random.random((fake, n, n))
l = np.linspace(-1, 1, n)**2
r = sum(np.ix_(l, l))
def run_all():
f, ij, bins, denom, dist_bins = r_to_ind(r)
for b in range(batch):
profile_xy_no_offset_flat(a[b%fake], f, bins, denom)
print(timeit(run_all, number=10))
# 47.4157 (for 10 batches of 100 images of size 2001x2001)
# and my computer is slower than Divakar's ;-)
I've made some more benchmarks comparing mine to #Divakar's approach 3 stripping out everything precomputable into a run-once-per-batch function. The general finding: they are similar mine has a higher upfront cost but is then faster. But they only cross over at around 100 pictures per batch.

Categories

Resources