Say I have an array like this:
import numpy as np
arr = np.array([
[1, 1, 3, 3, 1],
[1, 3, 3, 1, 1],
[4, 4, 3, 1, 1],
[4, 4, 1, 1, 1]
])
There are 4 distinct regions: The top left 1s, 3s, 4s and right 1s.
How would I get the paths for the bounds of each region? The coordinates of the vertices of the region, in order.
For example, for the top left 1s, it is (0, 0), (0, 2), (1, 2), (1, 1), (2, 1), (2, 0)
(I ultimately want to end up with something like start at 0, 0. Right 2. Down 1. Right -1. Down 1. Right -1. Down -2., but it's easy to convert, as it's just the difference between adjacent vertices)
I can split it up into regions with scipy.ndimage.label:
from scipy.ndimage import label
regions = {}
# region_value is the number in the region
for region_value in np.unique(arr):
labeled, n_regions = label(arr == region_value)
regions[region_value] = [labeled == i for i in range(1, n_regions + 1)]
Which looks more like this:
{1: [
array([
[ True, True, False, False, False],
[ True, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]
], dtype=bool), # Top left 1s region
array([
[False, False, False, False, True],
[False, False, False, True, True],
[False, False, False, True, True],
[False, False, True, True, True]
], dtype=bool) # Right 1s region
],
3: [
array([
[False, False, True, True, False],
[False, True, True, False, False],
[False, False, True, False, False],
[False, False, False, False, False]
], dtype=bool) # 3s region
],
4: [
array([
[False, False, False, False, False],
[False, False, False, False, False],
[ True, True, False, False, False],
[ True, True, False, False, False]
], dtype=bool) # 4s region
]
}
So how would I convert that into a path?
a pseudo code idea would be to do the following:
scan multi-dim array horizontally and then vertically until you find True value (for second array it is (0,4))
output that as a start coord
since you have been scanning as determined above your first move will be to go right.
repeat until you come back:
move one block in the direction you are facing.
you are now at coord x,y
check values of ul=(x-1, y-1), ur=(x-1, y), ll=(x, y-1), lr=(x,y)
# if any of above is out of bounds, set it as False
if ul is the only True:
if previous move right:
next move is up
else:
next move is left
output previous move
move by one
..similarly for other single True cells..
elif ul and ur only True or ul and ll only True or ll and lr only True or ur and lr only True:
repeat previous move
elif ul and lr only True:
if previous move left:
next move down
elif previous move right:
next move up
elif preivous move down:
next move left:
else:
next move right
output previous move
move one
elif ul, ur, ll only Trues:
if previous move left:
next move down
else:
next move right
output previous move, move by one
...similarly for other 3 True combos...
for the second array it will do the following:
finds True val at 0,4
start at 0,4
only lower-right cell is True, so moves right to 0,5 (previous move is None, so no output)
now only lower-left cell is True, so moves down to 1,5 (previous move right 1 is output)
now both left cells are True, so repeat move (moves down to 2,5)
..repeat until hit 4,5..
only upper-left cell is True, so move left (output down 4)
both upper cells are true, repeat move (move left to 3,4)
both upper cells are true, repeat move (move left to 2,4)
upper right cell only true, so move up (output right -3)
..keep going until back at 0,4..
Try visualising all the possible coord neighbouring cell combos and that will give you a visual idea of the possible flows.
Also note that with this method it should be impossible to be traversing a coord which has all 4 neighbours as False.
Related
My goal is to set some labels in 2d array to zero without using a for loop. Is there a faster numpy way to do this without the for loop? The ideal scenario would be temp_arr[labeled_im not in labels] = 0, but it's not really working the way I'd like it to.
labeled_array = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
labels = [2,4,5,6,8]
temp_arr = np.zeros((labeled_array.shape)).astype(int)
for label in labels:
temp_arr[labeled_array == label] = label
>> temp_arr
[[0 2 0]
[4 5 6]
[0 8 0]]
The for loop gets quite slow when there are a lot of iterations to go through, so it is important to improve the execution time with numpy.
You can use define labels as a set and use temp_arr = np.where(np.isin(labeled_array, labels), labeled_array, 0). Although, the difference for such a small array does not seem to be significant.
import numpy as np
import time
labeled_array = np.array([[1,2,3],
[4,5,6],
[7,8,9]])
labels = [2,4,5,6,8]
start = time.time()
temp_arr_0 = np.zeros((labeled_array.shape)).astype(int)
for label in labels:
temp_arr_0[labeled_array == label] = label
end = time.time()
print(f"Loop takes {end - start}")
start = time.time()
temp_arr_1 = np.where(np.isin(labeled_array, labels), labeled_array, 0)
end = time.time()
print(f"np.where takes {end - start}")
labels = {2,4,5,6,8}
start = time.time()
temp_arr_2 = np.where(np.isin(labeled_array, labels), labeled_array, 0)
end = time.time()
print(f"np.where with set takes {end - start}")
outputs
Loop takes 5.3882598876953125e-05
np.where takes 0.00010514259338378906
np.where with set takes 3.314018249511719e-05
In the case the labels are unique in labels (and memory isn't a concern), here's a way to go.
As the very first step, we convert labels to a ndarray
labels = np.array(labels)
Then, we produce two broadcastable arrays from labeled_array and labels
labeled_row = labeled_array.ravel()[np.newaxis, :]
labels_col = labels[:, np.newaxis]
The above code block produces respectively a row array of shape (1,9)
array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])
and a column array of shape (5,1)
array([[2],
[4],
[5],
[6],
[8]])
Now the two shapes are broadcastable (see this page), so we can perform elementwise comparison, e.g.
mask = labeled_row == labels_col
which returns a (5,9)-shaped boolean mask
array([[False, True, False, False, False, False, False, False, False],
[False, False, False, True, False, False, False, False, False],
[False, False, False, False, True, False, False, False, False],
[False, False, False, False, False, True, False, False, False],
[False, False, False, False, False, False, False, True, False]])
In the case the assumption above is fullfilled, you'll have a number of True values per row equal to the number of times the corresponding label appears in your labeled_array. Nonetheless, you can also have all-False rows, e.g. when a label in labels never appears in your labeled_array.
To find out which labels actually appeared in your labeled_array, you can use np.nonzero on the boolean mask
indices = np.nonzero(mask)
which returns a tuple containing the row and column indices of the non-zero (i.e. True) elements
(array([0, 1, 2, 3, 4], dtype=int64), array([1, 3, 4, 5, 7], dtype=int64))
By construction, the first element of the tuple above tells you which labels actually appeared in your labeled_array, e.g.
appeared_labels = labels[indices[0]]
(note that you can have consecutive elements in appeared_labels if that specific label appeared more than once in your labeled_array).
We can now build and fill the output array:
out = np.zeros(labeled_array.size, dtype=int)
out[indices[1]] = labels[indices[0]]
and bring it back to the original shape
out = out.reshape(*labeled_array.shape)
array([[0, 2, 0],
[4, 5, 6],
[0, 8, 0]])
I was learning K-means clustering. And is quite confused about the working of plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1') what is the purpose of X[y_kmeans == 0, 0], X[y_kmeans == 0, 1] in the code?
Full code here
#k-means
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv("mall_customers.csv")
X = dataset.iloc[:,[3,4]].values
#using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = [] #Within-Cluster Sum of Square
for i in range(1,11):
kmeans = KMeans(n_clusters = i, init = 'k-means++',max_iter = 300,n_init=10,random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title("The elbow method")
plt.xlabel("Number of cluster")
plt.ylabel('Wcss')
plt.show()
#applying kmeans to all dataset
kmeans = KMeans(n_clusters = 5,init = 'k-means++', max_iter=300,n_init=10,random_state=0)
y_kmeans = kmeans.fit_predict(X)
#Visualising the cluster
plt.scatter(X[y_kmeans == 0,0],X[y_kmeans == 0,1],s=100,c = 'red' ,label='Cluster1')
plt.scatter(X[y_kmeans == 1,0],X[y_kmeans == 1,1],s=100,c='blue', label='Cluster2')
plt.scatter(X[y_kmeans == 2,0],X[y_kmeans == 2,1],s=100,c='green',label='Cluster3')
plt.scatter(X[y_kmeans == 3,0],X[y_kmeans == 3,1],s=100, c ='cyan',label = 'CLuster4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=300, c = 'yellow', label ='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
I have added the output image for reference purpose
elbow graph,
Final cluster image
That's a filter. y_kmeans == 0 selects those elements where y_kmeans[i] is equal to 0. X[y_kmeans == 0, 0] selects the elements of X where the corresponding y_kmeans value is 0 and the second dimension is 0.
Originally answered by tim roberts
X[y_hc ==1,0] here 0 means model is in x plain X[y_hc == 0,1] means model is in y-plain.
Where as 1 refers to the value of [i] or the cluster value.
X[y_kmeans == 0, 0] :
It's a filter that works like a slicing strategy (X[start_rows : end_rows , selected column]). it's like you are selecting samples from your dataset X from a given row number to a specific given row number and after selecting these rows, only use column 0. This will work perfectly if our samples were contiguous unfortunately it is not since we want to select rows based on the cluster made by our model which is contained in y.
Explication below:
Remember y contains the result of your clustering model where we have 5 clusters represented as cluster 0, cluster 1 ... cluster 4.
At first y_kmeans == 0 will select the elements where y==0, meaning elements classified as cluster 0 so y==0 return a list of boolean with True for those elements belonging to cluster 0 and false for other elements. The outcome will now be X[[True, False, etc...],0], the first element in the bracket represents the list of boolean mentioned above and the second element ( the 0 ) represents the column (or feature. Example sepal length for the Iris dataset). Also, remember to make a scatter plot we need two values (x and y) in the case of the iris dataset, X can be the Sepal length and Y the Sepal Width.
So the first line
X[y_kmeans == 0,0],X[y_kmeans == 0,1]
will be evaluated to X[[True, False...],0] and X[[True, False],1] the bolded value here represents the column's value in your original dataset. Each Boolean value is mapped to the corresponding row in your dataset, if the value is True, that row is selected and its columns value (corresponding to the bolded part of the bracket) is selected. So you will have something like this:
x[[False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, True, False, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, False, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, True, False, False, False, False, True, False,
False, False, False, False, False, True, True, False, False,
False, False, True, False, True, False, True, False, False,
True, True, False, False, False, False, False, True, False,
False, False, False, True, False, False, False, True, False,
False, False, True, False, False, True],0]
Note that the number of rows in your dataset or X must be equal to the number of elements in your y.
I want to apply the NOT operation on whole columns/rows of a boolean Numpy array. Is this possible with Numpy?
matrix = np.array([[False for i in range(3)] for j in range(2)])
# Initial
# [False, False, False]
# [False, False, False]
matrix[:,1].not() # Something like this
# After not operation on column 1
# [False, True, False]
# [False, True, False]
This should do the trick, see here
matrix[:, 1] = np.logical_not(matrix[:, 1])
I have two numpy vector arrays, one contains binary values so either 1 or 0 and the other float values so anything in between 0 and 1.
I want to use the numpy.logical_and operator and have it return true if the binary value is in the range of the float plus or minus 0.2. So i.e. a float of 0.1 would return true, 0.4 false.
How would I tackle this?
I think what you want is np.isclose. In this case implementation would be:
bin_arr = np.random.randint(2, size = 100)
float_arr = np.random.rand(100)
out = np.isclose(bin_arr.astype(float), float_arr, atol = .2)
Note that while logical_and is a ufunc (Universal Function) with extended functionality, np.isclose is not.
Does the question require True if (float_arr less than 0.2) AND (bin_arr > 0). Which does need the use of logical and.
Or True if abs(float_arr - bin_arr) <= 0.2 which doesn't. #Daniel F 's use of isclose() is an elegant answer to this.
# Set up some data
np.random.seed(0) # Make it repeatable.
bin_arr = np.random.randint(2, size = 20)
float_arr = np.random.rand(20)
bin_arr, float_arr
# (array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]),
# array([0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606,
# 0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215,
# 0.97861834, 0.79915856, 0.46147936, 0.78052918, 0.11827443,
# 0.63992102, 0.14335329, 0.94466892, 0.52184832, 0.41466194]))
True if (float_arr less than 0.2) AND (bin_arr > 0)`.
np.logical_and( float_arr<=0.2, bin_arr)
# array([False, False, False, False, True, True, True, False, False,
# False, False, False, False, False, False, False, False, False,
# False, False])
True if abs(float_arr - bin_arr) <= 0.2
np.abs(float_arr - bin_arr)<=0.2
# array([False, False, False, False, False, False, False, True, False,
# True, True, False, False, False, True, False, True, False,
# False, False])
In Python, how do I create a numpy array of arbitrary shape filled with all True or all False?
The answer:
numpy.full((2, 2), True)
Explanation:
numpy creates arrays of all ones or all zeros very easily:
e.g. numpy.ones((2, 2)) or numpy.zeros((2, 2))
Since True and False are represented in Python as 1 and 0, respectively, we have only to specify this array should be boolean using the optional dtype parameter and we are done:
numpy.ones((2, 2), dtype=bool)
returns:
array([[ True, True],
[ True, True]], dtype=bool)
UPDATE: 30 October 2013
Since numpy version 1.8, we can use full to achieve the same result with syntax that more clearly shows our intent (as fmonegaglia points out):
numpy.full((2, 2), True, dtype=bool)
UPDATE: 16 January 2017
Since at least numpy version 1.12, full automatically casts to the dtype of the second parameter, so we can just write:
numpy.full((2, 2), True)
numpy.full((2,2), True, dtype=bool)
ones and zeros, which create arrays full of ones and zeros respectively, take an optional dtype parameter:
>>> numpy.ones((2, 2), dtype=bool)
array([[ True, True],
[ True, True]], dtype=bool)
>>> numpy.zeros((2, 2), dtype=bool)
array([[False, False],
[False, False]], dtype=bool)
If it doesn't have to be writeable you can create such an array with np.broadcast_to:
>>> import numpy as np
>>> np.broadcast_to(True, (2, 5))
array([[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
If you need it writable you can also create an empty array and fill it yourself:
>>> arr = np.empty((2, 5), dtype=bool)
>>> arr.fill(1)
>>> arr
array([[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
These approaches are only alternative suggestions. In general you should stick with np.full, np.zeros or np.ones like the other answers suggest.
benchmark for Michael Currie's answer
import perfplot
bench_x = perfplot.bench(
n_range= range(1, 200),
setup = lambda n: (n, n),
kernels= [
lambda shape: np.ones(shape, dtype= bool),
lambda shape: np.full(shape, True)
],
labels = ['ones', 'full']
)
bench_x.show()
Quickly ran a timeit to see, if there are any differences between the np.full and np.ones version.
Answer: No
import timeit
n_array, n_test = 1000, 10000
setup = f"import numpy as np; n = {n_array};"
print(f"np.ones: {timeit.timeit('np.ones((n, n), dtype=bool)', number=n_test, setup=setup)}s")
print(f"np.full: {timeit.timeit('np.full((n, n), True)', number=n_test, setup=setup)}s")
Result:
np.ones: 0.38416870904620737s
np.full: 0.38430388597771525s
IMPORTANT
Regarding the post about np.empty (and I cannot comment, as my reputation is too low):
DON'T DO THAT. DON'T USE np.empty to initialize an all-True array
As the array is empty, the memory is not written and there is no guarantee, what your values will be, e.g.
>>> print(np.empty((4,4), dtype=bool))
[[ True True True True]
[ True True True True]
[ True True True True]
[ True True False False]]
>>> a = numpy.full((2,4), True, dtype=bool)
>>> a[1][3]
True
>>> a
array([[ True, True, True, True],
[ True, True, True, True]], dtype=bool)
numpy.full(Size, Scalar Value, Type). There is other arguments as well that can be passed, for documentation on that, check https://docs.scipy.org/doc/numpy/reference/generated/numpy.full.html