Unclear array of Array notation in numpy - python

I am on my way to understand a vectorized approach to calculating (and plotting) Julia sets. On the web, I found the following code (annotations are mainly mine, based on my growing understanding of the ideas behind the code):
import numpy as np
import matplotlib.pyplot as plt
c = -0.74543+0.11301j # Example value for this picture (Julia set)
n = 512 # Maximum number of iterations
x = np.linspace(-1.5, 1.5, 2000).reshape((1, 2000)) # 1 row, 2000 columns
y = np.linspace(-1.2, 1.2, 1600).reshape((1600, 1)) # 1600 rows, 1 column
z = x + 1j*y # z is an array with 1600 * 2000 complex entries
c = np.full(z.shape, c) # c is a complex number matrix to be added for the iteration
diverge = np.zeros(z.shape) # 1600 * 2000 zeroes (0s), contains divergent iteration counts
m = np.full(z.shape, True) # 1600 * 2000 True, used as a kind of mask (convergent values)
for i in range(0,n): # Do at most n iterations
z[m] = z[m]**2 + c[m] # Matrix op: Complex iteration for fixed c (Julia set perspective)
m[np.abs(z) > 2] = False # threshold for convergence of absolute(z) is 2
diverge[m] = i
plt.imshow(diverge, cmap='magma') # Color map "magma" applied to the iterations for each point
plt.show() # Display image plotted
I don't understand the mechanics of the line
diverge[m] = i
I gather that m is a 1600*2000 element array of Booleans. It seems that m is used as a kind of mask to let stand only those values in diverge[] for which the corresponding element in m is True. Yet I would like to understand this concept in greater detail. The syntax diverge[m] = i seems to imply that an array is used as some sort of generalized "index" to another array (diverge), and I could use some help understanding this concept. (The code runs as expected, I just have problems understanding the working of it.)
Thank you.

Yes, you can use an array to index another. In many many ways. That a complex matter. And even if I flatter myself to understand numpy quite a bit now, I still sometimes encouter array indexation that make me scratch my head a little bit before I understand.
But this case is not a very complex one
M=np.array([[1,2,3],
[4,5,6],
[7,8,9]])
msk=np.array([[True, False, True],
[True, True, True],
[False, True, False]])
M[msk]
Returns array([1, 3, 4, 5, 6, 8]). You can, I am sure, easily understand the logic.
But more importantly, indexation is a l-value. So that means that M[msk] can be to the left side of the =. And then the values of M are impacted
So, that means that
M[msk]=0
M
shows
array([[0, 2, 0],
[0, 0, 0],
[7, 0, 9]])
Likewise
M=np.array([[1,2,3],
[4,5,6],
[7,8,9]])
A=np.array([[2,2,4],
[4,6,6],
[8,8,8]])
msk=np.array([[True, False, True],
[True, True, True],
[False, True, False]])
M[msk] = M[msk]+A[msk]
M
Result is
array([[ 3, 2, 7],
[ 8, 11, 12],
[ 7, 16, 9]])
So back to your case,
z[m] = z[m]**2 + c[m] # Matrix op: Complex iteration for fixed c (Julia set perspective)
Is somehow just an optimisation. You could have also just z=z**2+c. But why would be the point to compute that even where overflow has already occured. So, it computes z=z**2+c only where there was no overflow yet
m[np.abs(z) > 2] = False # threshold for convergence of absolute(z) is 2
np.abs(s)>2 is a 2d array of True/False values. m is set to False at for every "pixels" for which |z|>2. Other values of m remain unchanged. So they stay False if they were already False. Note that this one is slightly over complicated. Since, because of the previous line, z doesn't change once it became >2, in reality, there is no pixels where np.abs(z)<=2 and yet m is already False. So
m=np.abs(z)<=2
would have worked as well. And it would not have been slower, since the original version computes that anyway. In fact, it would be faster, since we spare the indexation/affecation operation. On my computer my version runs 1.3 seconds faster than the original (on a 12 second computation time. So 10% approx.)
But the original version has the merit to makes next line easier to understand, becaus it makes one point clear: m starts with all True values, and then some values turn False as long as algorithm runs, but none never become True again.
diverge[m] = i
m being the mask of pixels that has not yet diverged (it starts with all True, and as long as we iterate, more and more values of m are False).
So doing so update diverge to i everywhere no divergence occured yet (the name of the variable is not the most pertinent).
So pixels whose z values become>2 at iteration 50, so whose m value became False at iteration 50, would have been updated to 1, then 2, then 3, then 4, ..., then 48, then 49 by this line. But not to 50, 51, ...
So at the end, what stays in "diverge" is the last i for which m was still True. That is the last i for which algorithm was still converging. Or, at 1 unit shift, the first one for which algorithm diverges.

Related

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)

How to properly index to an array of changing size due to masking in python

This is a problem I've run into when developing something, and it's a hard question to phrase. So it's best with an simple example:
Imagine you have 4 random number generators which generate an array of size 4:
[rng-0, rng-1, rng-2, rng-3]
| | | |
[val0, val1, val2, val3]
Our goal is to loop through "generations" of arrays populated by these RNGs, and iteratively mask out the RNG which outputted the maximum value.
So an example might be starting out with:
mask = [False, False, False, False], arr = [0, 10, 1, 3], and so we would mask out rng-1.
Then the next iteration could be: mask = [False, True, False, False], arr = [2, 1, 9] (before it gets asked, yes arr HAS to decrease in size with each rng that is masked out). In this case, it is clear that rng-3 should be masked out (e.g. mask[3] = True), but since arr is now of different size than mask, it is tricky to get the right indexing for setting the mask (since the max of arr is at index 2 of the arr, but the corresponding generator is index 3). This problem grows more an more difficult as more generators get masked out (in my case I'm dealing with a mask of size ~30).
If it helps, here is python version of the example:
rng = np.random.RandomState(42)
mask = np.zeros(10, dtype=bool) # True if generator is being masked
for _ in range(mask.size):
arr = rng.randint(100, size=~mask.sum())
unadjusted_max_value_idx = arr.argmax()
adjusted_max_value_idx = unadjusted_max_value_idx + ????
mask[adjusted_max_value_idx] = True
Any idea a good way to map the index of the max value in the arr to the corresponding index in the mask? (i.e. moving from unadjusted_max_value_idx to adjusted_max_value_idx)
#use a helper list
rng = np.random.RandomState(42)
mask = np.zeros(10, dtype=bool) # True if generator is being masked
ndxLst=list(range(mask.size))
maskHistory=[]
for _ in range(mask.size):
arr = rng.randint(100, size=(~mask).sum())
unadjusted_max_value_idx = arr.argmax()
adjusted_max_value_idx=ndxLst.pop(unadjusted_max_value_idx)
mask[adjusted_max_value_idx] = True
maskHistory.append(adjusted_max_value_idx)
print(maskHistory)
print(mask)

Python list notation, Numpy array notation: predictions[predictions < 1e-10] = 1e-10

I am trying to find out operation applied on list. I have list/array name predictions and and executing following set of instruction.
predictions[predictions < 1e-10] = 1e-10
This code snippet is from a Udacity Machine Learning assignment that uses Numpy.
It was used in the following manner:
def logprob(predictions, labels):
"""Log-probability of the true labels in a predicted batch."""
predictions[predictions < 1e-10] = 1e-10
return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
As pointed out by #MosesKoledoye and various others, it is actually a Numpy array. (Numpy is a Python library)
What does this line do?
As pointed out by #MosesKoledoye, predictions is most likely a numpy array.
A boolean array would then be generated using predictions < 1e-10. At all indices where the boolean array set by the condition is True, the value will be changed to 1e-10, ie. 10-10.
Example:
>>> a = np.array([1,2,3,4,5]) #define array
>>> a < 3 #define boolean array through condition
array([ True, True, False, False, False], dtype=bool)
>>> a[a<3] #select elements using boolean array
array([1, 2])
>>> a[a<3] = -1 #change value of elements which fit condition
>>> a
array([-1, -1, 3, 4, 5])
The reason this might be done in the code could be to prevent division by zero or to prevent negative numbers messing up things by instead inserting a very small number.
All elements of the array, for which the condition (element < 1e-10) is true, are set to 1e-10.
Practically you are setting a minimum value.

Is there a difference in the way we access elements of a list comprehension and the elements of a numpy array

I am working on a genetic algorithm code. I am fairly new to python.
My code snippet is as follows:
import numpy as np
pop_size = 10 # Population size
noi = 2 # Number of Iterations
M = 2 # Number of Phases in the Data
alpha = [np.random.randint(0, 64, size = pop_size)]* M
phi = [np.random.randint(0, 64, size = pop_size)]* M
reduced_tensor = [np.zeros((pop_size,3,3))]* M
for n_i in range(noi):
alpha_en = [(2*np.pi*alpha/63.00) for alpha in alpha]
phi_en = [(phi/63.00) for phi in phi]
for i in range(M):
for j in range(pop_size):
reduced_tensor[i][j] = [[1, 0, 0],
[0, phi_en[i][j], 0],
[0, 0, 0]]
Here I have a list of numpy arrays. The variable 'alpha' is a list containing two numpy arrays. How do I use list comprehension in this case? I want to create a similar list 'alpha_en' which operates on every element of alpha. How do I do that? I know my current code is wrong, it was just trial and error.
What does 'for alpha in alpha' mean (line 11)? This line doesn't give any error, but also doesn't give the desired output. It changes the dimension and value of alpha.
The variable 'reduced_tensor' is a list of an array of 3x3 matrix, i.e., four dimensions in total. How do I differentiate between the indexing of a list comprehension and a numpy array? I want to perform various operations on a list of matrices, in this case, assign the values of phi_en to one of the elements of the matrix reduced_tensor (as shown in the code). How should I do it efficiently? I think my current code is wrong, if not just confusing.
There some questionable programming in these 2 lines
alpha = [np.random.randint(0, 64, size = pop_size)]* M
...
alpha_en = [(2*np.pi*alpha/63.00) for alpha in alpha]
The first makes an array, and then makes a list with M pointers to the same thing. Note, M copies of the random array. If I were to change one element of alpha, I'd change them all. I don't see the point to this type of construction.
The [... for alpha in alpha] works because the 2 uses of alpha are different. At least in newer Pythons the i in [i*3 for i in range(3)] does not 'leak out' of the comprehension. That said, I would not approve of that variable naming. At the very least is it confusing to readers.
The arrays in alpha_en are separate. Values are derived from the array in alpha, but they are new.
for a in alphas:
a *= 2
would modify each array in alphas; how ever due to how alphas is constructed this ends up multiplying the array many times.
reduced_tensor = [np.zeros((pop_size,3,3))]* M
has the same problem; it's a list of M references to the same 3d array.
reduced_tensor[i][j]
references the i reference in that list, and the j 'row' of that array. I like to use
reduced_tensor[i][j,:,:]
to make it clearer to me and my reader the expected dimensions of the result.
The iteration over M does nothing for you; it just repeats the same assignment M times.
At the root of your problems is that use of list replication.
In [30]: x=[np.arange(3)]*3
In [31]: x
Out[31]: [array([0, 1, 2]), array([0, 1, 2]), array([0, 1, 2])]
In [32]: [id(i) for i in x]
Out[32]: [3036895536, 3036895536, 3036895536]
In [33]: x[0] *= 10
In [34]: x
Out[34]: [array([ 0, 10, 20]), array([ 0, 10, 20]), array([ 0, 10, 20])]

Numpy: transpose result of advanced indexing

>>> import numpy as np
>>> X = np.arange(27).reshape(3, 3, 3)
>>> x = [0, 1]
>>> X[x, x, :]
array([[ 0, 1, 2],
[12, 13, 14]])
I need to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout. Hence I would like the result to be transposed:
array([[ 0, 12],
[ 1, 13],
[ 2, 14]])
How do I do that? I would like the result of numpy's "advanced indexing" to be implicitly transposed. Transposing it explicitly with .T at the end is even slower and is not an option.
Update1: in the real world advanced indexing is unavoidable and the subscripts are not guaranteed to be the same.
>>> x = [0, 0, 1]
>>> y = [0, 1, 1]
>>> X[x, y, :]
array([[ 0, 1, 2],
[ 3, 4, 5],
[12, 13, 14]])
Update2: To clarify that this is not an XY problem, here is the actual problem:
I have a large matrix X which contains elements x coming from some probability distribution. The probability distribution of the element depends on the neighbourhood of the element. This distribution is unknown so I follow the Gibbs sampling procedure to build a matrix which has elements from this distribution. In a nutshell it means that I make some initial guess for matrix X and then I keep iterating over the elements of matrix X updating each element x with a formula that depends on the neighbouring values of x. So, for any element of a matrix I need to get its neighbours (advanced indexing) and perform some operation on them (summation in my example). I have used line_profiler to see that the line which takes most of the time in my code is taking the sum of an array with respect to dimension 0 rather than -1. Hence I would like to know if there is a way to produce an already-transposed matrix as a result of advanced indexing.
I would like to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout.
I'm not totally sure what you mean by this. If the underlying array is row-major (the default, i.e. X.flags.c_contiguous == True), then it may be slightly faster to sum it along the 0th dimension. Simply transposing an array using .T or np.transpose() does not, in itself, change how the array is laid out in memory.
For example:
# X is row-major
print(X.flags.c_contiguous)
# True
# Y is just a transposed view of X
Y = X.T
# the indices of the elements in Y are transposed, but their layout in memory
# is the same as in X, therefore Y is column-major rather than row-major
print(Y.flags.c_contiguous)
# False
You can convert from row-major to column-major, for example by using np.asfortranarray(X), but there is no way to perform this conversion without making a full copy of X in memory. Unless you're going to be performing lots of operations over the columns of X then it almost certainly won't be worthwhile doing the conversion.
If you want to store the result of your summation in a column-major array, you could use the out= kwarg to X.sum(), e.g.:
result = np.empty((3, 3), order='F') # Fortran-order, i.e. column-major
X.sum(0, out=result)
In your case the difference between summing over rows vs columns is likely to be very minimal, though - since you are already going to be indexing non-adjacent elements in X you will already be losing the benefit of spatial locality of reference that would normally make summing over rows slightly faster.
For example:
X = np.random.randn(100, 100, 100)
# summing over whole rows is slightly faster than summing over whole columns
%timeit X.sum(0)
# 1000 loops, best of 3: 438 µs per loop
%timeit X.T.sum(0)
# 1000 loops, best of 3: 486 µs per loop
# however, the locality advantage disappears when you are addressing
# non-adjacent elements using fancy indexing
%timeit X[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.72 µs per loop
%timeit X.T[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.63 µs per loop
Update
#senderle has mentioned in the comments that using numpy v1.6.2 he sees the opposite order for the timing, i.e. X.sum(-1) is faster than X.sum(0) for a row-major array. This seems to be related to the version of numpy he is using - using v1.6.2 I can reproduce the order that he observes, but using two newer versions (v1.8.2 and 1.10.0.dev-8bcb756) I observe the opposite (i.e. X.sum(0) is faster than X.sum(-1) by a small margin). Either way, I don't think it's likely that changing the memory order of the array is likely to help much for the OP's case.

Categories

Resources