Recently, I have been following a tutorial where I came up with the following code
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
here, y_set is a vector having binary values 0, 1 and X_set is an array with two columns.
I am specifically not understanding how to interpret the following line of code
X_set[y_set == j, 0], X_set[y_set == j, 1]
There's a few things going on here. For now, I will drop the loop but we know that j will take values in y_set so will be either zero or one. First make the two arrays:
import numpy as np
X_set = np.arange(20).reshape(10, 2)
y_set = np.array([0, 1, 1, 1, 0, 0, 1, 1, 0, 1])
From the above, this code is basically doing:
plt.scatter(filtered_values_in_first_column_of X_set,
filtered_values_in_second_column_of X_set)
y_set is providing the filter. We can get there by building up the steps:
print("Where y_set == 0: Boolean mask.")
print(y_set == 0)
print()
print("All rows of X_set indexed by the Boolean mask")
print(X_set[y_set == 0])
print()
print("2D indexing to get only the first column of the above")
print(X_set[y_set == 0, 0])
print()
You can see more on the numpy indexing here. Once you break the steps down, it's not too complicated but I think it was an unnecessarily complex way of achieving this task.
The for loop is so that they could repeat the plot with two different colours depending on whether the values are filtered by y_set being equal to 0 or 1.
Related
I want to generate a binary matrix of numbers with M rows and N columns. Each row must sum to <=p and >=q. In other words, each row must have at most p and at least q ones.
This is the code I have been using.
import numpy as np
def randbin(M, N, P):
return np.random.choice([0, 1], size=(M, N), p=[P, 1 - P])
MyMatrix = randbin(200, 7, 0.5)
Notice that row 0 is all zeros:
I noticed that some rows have all zeros and some rows have all ones. How can I modify this to get what I want? Is there an efficient way of achieving this solution?
You can generate a random number in [q, p] for each row and then set that many random ones in each row. If by efficient you mean vectorized, then yes, there is an efficient way. The trick is to simulate sampling without replacement in one axis but with the the other. This can be done with np.argsort. You can select a variable number of indices by turning a random vector into a mask.
def randbin(m, n, p, q):
# output to assign ones into
result = np.zeros((m, n), dtype=bool)
# simulate sampling with replacement in one axis
col_ind = np.argsort(np.random.random(size=(m, n)), axis=1)
# figure out how many samples to take in each row
count = np.random.randint(p, q + 1, size=(m, 1))
# turn it into a mask over col_ind using a clever broadcast
mask = np.arange(n) < count
# apply the mask not only to col_ind, but also the corresponding row_ind
col_ind = col_ind[mask]
row_ind = np.broadcast_to(np.arange(m).reshape(-1, 1), (m, n))[mask]
# Set the corresponding elements to 1
result[row_ind, col_ind] = 1
return result
The selection is made so that each run of equal values in row_ind is between p and q elements long. The corresponding elements of col_ind are unique and uniformly distributed within each row.
An alternative is #Prunes solution. It requires np.argsort to shuffle the rows independently, since np.random.shuffle would keep the rows together:
def randbin(m, n, p, q):
# make the unique rows
options = np.arange(n) < np.arange(p, q + 1).reshape(-1, 1)
# select random unique row to go into each output row
selection = np.random.choice(options.shape[0], size=m, replace=True)
# perform the selection
result = options[selection]
# create indices to shuffle each row independently
col_ind = np.argsort(np.random.random(result.shape), axis=1)
row_ind = np.arange(m).reshape(-1, 1)
# perform the shuffle
result = result[row_ind, col_ind]
return result
Okay, then: a uniform distribution is easy enough. Let's take that case with [2,5] 1s required. Use a list of the allowable combinations:
[ [1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0] ]
For each of your rows, choose a random element from these four, and then shuffle it. There is your row.
I'm working on an image classification problem where I got the train labels as a 1-D numpy array, like [1,2,3,2,2,2,4,4,3,1]. I used
train_y = []
for label in train_label:
if label == 0:
train_y.append([1,0,0,0])
elif label == 1:
train_y.append([0,1,0,0])
elif label == 2:
train_y.append([0,0,1,0])
elif label == 3:
train_y.append([0,0,0,1])
Also I need the len(one_hot_array) = set(train_labels),
but this is not a good method. Please recommend a good method to do so.
It's always a good habit to use numpy for arrays. np.unique() determins the labels you have in train_labels. ix is an array of indices. np.nonzero() gives the indices of train_lables where train_labels == unique_tl[iy].
import numpy as np
train_labels = np.array([2,5,8,2,5,8])
unique_tl = np.unique(train_labels)
NL = len(train_labels) # how many data , 6
nl = len(unique_tl) # how many labels, 3
target = np.zeros((NL,nl),dtype=int)
for iy in range(nl):
ix = np.nonzero(train_labels == unique_tl[iy])
target[ix,iy] = 1
gives
target
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
I'll think about a possibility to eliminate the for-loop.
If [2,5,8] is meant as part of [0,1,2,3,4,5,6,7,8], then you can use this answer
make a vector of zeros, and set only one value to 1
target = np.zeros(num_classes)
target[label] = 1
train_y.append(target)
I have an array:
a = np.array([[1,2,3], [0,0,3], [1,2,0],[0,2,3]])
which looks like:
array([[1, 2, 3],
[0, 0, 3],
[1, 2, 0],
[0, 2, 3]])
I need to calculate paired correlations, but without taking 0s in considerations. So, for example correlations between "1" and "2" should be calculated between arrays:
array([[1, 2],
[1, 2]])
Problem: Numpy and pandas method will consider zeros and i can't remind them.
So, I need a faster, willingly built-n method for this.
Though, i wrote mine algorithm, but it works really slow on large arrays.
correlations = np.zeros((1000,1000))
for i, column_i in enumerate(np.transpose(array_data)):
for j, column_j in enumerate(np.transpose(array_data[:,i+1:])):
if i != j:
column_i = np.reshape(column_i,(column_i.shape[0], 1))
column_j = np.reshape(column_j,(column_j.shape[0], 1))
values = np.concatenate([column_i, column_j],axis=1)
values = [row for row in values if (row[0] != 0) & (row[1] != 0)]
values = np.array(values)
correlation = np.corrcoef(values[:,0], values[:,1])[0][1]
correlations[i,j] = correlation
Actually, i decieded to change all zeros in data to np.nan
for i,e_i in enumerate(array_data):
for j, e_j in enumerate(e_i):
if e_j == 0:
array_data[i,j] = np.NaN
and then, pandas.corr() worked fine...
Let's say I have an 2D array of (N, N) shape:
import numpy as np
my_array = np.random.random((N, N))
Now I want to do some computations only on some "cells" of this array, for instance the ones inside the central part of the array. To avoid doing computations on cells I'm not interested in, what I usually do here is create a Boolean mask, in this spirit:
my_mask = np.zeros_like(my_array, bool)
my_mask[40:61,40:61] = True
my_array[my_mask] = some_twisted_computations(my_array[my_mask])
But what if some_twisted_computations() involves values of the neighboring cells if they are inside the mask? Performance-wise, would it be a good idea to create an "adjacency array" with a (len(my_mask), 4) shape, storing the index of 4-connected neighbor cells in the flat my_array[mask] array that I will use in some_twisted_computations()? If yes, what are the efficient options for computing such adjacency array? Should I switch to lower-level langage/other data structures?
My real-worlds arrays shapes are around (1000,1000,1000), the mask concerns only a small subset (~100000) of these values and is of rather complex geometry. I hope my questions make sense...
EDIT: the very dirty and slow solution I've worked out:
wall = mask
i = 0
top_neighbors = []
down_neighbors = []
left_neighbors = []
right_neighbors = []
indices = []
for index, val in np.ndenumerate(wall):
if not val:
continue
indices += [index]
if wall[index[0] + 1, index[1]]:
down_neighbors += [(index[0] + 1, index[1])]
else:
down_neighbors += [i]
if wall[index[0] - 1, index[1]]:
top_neighbors += [(index[0] - 1, index[1])]
else:
top_neighbors += [i]
if wall[index[0], index[1] - 1]:
left_neighbors += [(index[0], index[1] - 1)]
else:
left_neighbors += [i]
if wall[index[0], index[1] + 1]:
right_neighbors += [(index[0], index[1] + 1)]
else:
right_neighbors += [i]
i += 1
top_neighbors = [i if type(i) is int else indices.index(i) for i in top_neighbors]
down_neighbors = [i if type(i) is int else indices.index(i) for i in down_neighbors]
left_neighbors = [i if type(i) is int else indices.index(i) for i in left_neighbors]
right_neighbors = [i if type(i) is int else indices.index(i) for i in right_neighbors]
The best answer will probably depend on the nature of the computations you want to do. For example, if they can be expressed as summations over neighboring pixels, then something like np.convolve or scipy.signal.fftconvolve can be a really nice solution.
For your specific question of efficiently generating arrays of neighbor indices, you might try something like this:
x = np.random.rand(100, 100)
mask = x > 0.9
i, j = np.where(mask)
i_neighbors = i[:, np.newaxis] + [0, 0, -1, 1]
j_neighbors = j[:, np.newaxis] + [-1, 1, 0, 0]
# need to do something with the edge cases
# the best choice will depend on your application
# here we'll change out-of-bounds neighbors to the
# central point itself.
i_neighbors = np.clip(i_neighbors, 0, 99)
j_neighbors = np.clip(j_neighbors, 0, 99)
# compute some vectorized result over the neighbors
# as a concrete example, here we'll do a standard deviation
result = x[i_neighbors, j_neighbors].std(axis=1)
The result is an array of values corresponding to the masked region, containing the standard deviation of neighboring values.
Hopefully that approach will work for whatever specific problem you have in mind!
Edit: given the edited question above, here's how my response can be adapted to generate arrays of indices in a vectorized manner:
x = np.random.rand(100, 100)
mask = x > -0.9
i, j = np.where(mask)
i_neighbors = i[:, np.newaxis] + [0, 0, -1, 1]
j_neighbors = j[:, np.newaxis] + [-1, 1, 0, 0]
i_neighbors = np.clip(i_neighbors, 0, 99)
j_neighbors = np.clip(j_neighbors, 0, 99)
indices = np.zeros(x.shape, dtype=int)
indices[mask] = np.arange(len(i))
neighbor_in_mask = mask[i_neighbors, j_neighbors]
neighbors = np.where(neighbor_in_mask,
indices[i_neighbors, j_neighbors],
np.arange(len(i))[:, None])
left_indices, right_indices, top_indices, bottom_indices = neighbors.T
I am working with matrices of (x,y,z) dimensions, and would like to index numerous values from this matrix simultaneously.
ie. if the index A[0,0,0] = 5
and A[1,1,1] = 10
A[[1,1,1], [5,5,5]] = [5, 10]
however indexing like this seems to return huge chunks of the matrix.
Does anyone know how I can accomplish this? I have a large array of indices (n, x, y, z) that i need to use to index from A)
Thanks
You are trying to use 1 as the first index 3 times and 5 as the index into the second dimension (again three times). This will give you the element at A[1,5,:] repeated three times.
A = np.random.rand(6,6,6);
B = A[[1,1,1], [5,5,5]]
# [[ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839]]
B.shape
# (3, 6)
Instead, you will want to specify [1,5] for each axis of your matrix.
A[[1,5], [1,5], [1,5]] = [5, 10]
Advanced indexing works like this:
A[I, J, K][n] == A[I[n], J[n], K[n]]
with A, I, J, and K all arrays. That's not the full, general rule, but it's what the rules simplify down to for what you need.
For example, if you want output[0] == A[0, 0, 0] and output[1] == A[1, 1, 1], then your I, J, and K arrays should look like np.array([0, 1]). Lists also work:
A[[0, 1], [0, 1], [0, 1]]