One hot encoding a numpy array - python

I'm working on an image classification problem where I got the train labels as a 1-D numpy array, like [1,2,3,2,2,2,4,4,3,1]. I used
train_y = []
for label in train_label:
if label == 0:
train_y.append([1,0,0,0])
elif label == 1:
train_y.append([0,1,0,0])
elif label == 2:
train_y.append([0,0,1,0])
elif label == 3:
train_y.append([0,0,0,1])
Also I need the len(one_hot_array) = set(train_labels),
but this is not a good method. Please recommend a good method to do so.

It's always a good habit to use numpy for arrays. np.unique() determins the labels you have in train_labels. ix is an array of indices. np.nonzero() gives the indices of train_lables where train_labels == unique_tl[iy].
import numpy as np
train_labels = np.array([2,5,8,2,5,8])
unique_tl = np.unique(train_labels)
NL = len(train_labels) # how many data , 6
nl = len(unique_tl) # how many labels, 3
target = np.zeros((NL,nl),dtype=int)
for iy in range(nl):
ix = np.nonzero(train_labels == unique_tl[iy])
target[ix,iy] = 1
gives
target
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
I'll think about a possibility to eliminate the for-loop.
If [2,5,8] is meant as part of [0,1,2,3,4,5,6,7,8], then you can use this answer

make a vector of zeros, and set only one value to 1
target = np.zeros(num_classes)
target[label] = 1
train_y.append(target)

Related

python function to count nonzero patches in array

For a given array (1 or 2-dimensional) I would like to know, how many "patches" there are of nonzero elements. For example, in the array [0, 0, 1, 1, 0, 1, 0, 0] there are two patches.
I came up with a function for the 1-dimensional case, where I first assume the maximal number of patches and then decrease that number if a neighbor of a nonzero element is nonzero, too.
def count_patches_1D(array):
patches = np.count_nonzero(array)
for i in np.nonzero(array)[0][:-1]:
if (array[i+1] != 0):
patches -= 1
return patches
I'm not sure if that method works for two dimensions as well. I haven't come up with a function for that case and I need some help for that.
Edit for clarification:
I would like to count connected patches in the 2-dimensional case, including diagonals. So an array [[1, 0], [1, 1]] would have one patch as well as [[1, 0], [0, 1]].
Also, I am wondering if there is a build-in python function for this.
The following should work:
import numpy as np
import copy
# create an array
A = np.array(
[
[0, 1, 1, 1, 0, 1],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1]
]
)
def isadjacent(pos, newpos):
"""
Check whether two coordinates are adjacent
"""
# check for adjacent columns and rows
return np.all(np.abs(np.array(newpos) - np.array(pos)) < 2):
def count_patches(A):
"""
Count the number of non-zero patches in an array.
"""
# get non-zero coordinates
coords = np.nonzero(A)
# add them to a list
inipatches = list(zip(*coords))
# list to contain all patches
allpatches = []
while len(inipatches) > 0:
patch = [inipatches.pop(0)]
i = 0
# check for all points adjacent to the points within the current patch
while True:
plen = len(patch)
curpatch = patch[i]
remaining = copy.deepcopy(inipatches)
for j in range(len(remaining)):
if isadjacent(curpatch, remaining[j]):
patch.append(remaining[j])
inipatches.remove(remaining[j])
if len(inipatches) == 0:
break
if len(inipatches) == 0 or plen == len(patch):
# nothing added to patch or no points remaining
break
i += 1
allpatches.append(patch)
return len(allpatches)
print(f"Number of patches is {count_patches(A)}")
Number of patches is 5
This should work for arrays with any number of dimensions.

I was trying to use matrixes without libraries but I can't set the values correctly

def create_matrix(xy):
matrix = []
matrix_y = []
x = xy[0]
y = xy[1]
for z in range(y):
matrix_y.append(0)
for n in range(x):
matrix.append(matrix_y)
return matrix
def set_matrix(matrix,xy,set):
x = xy[0]
y = xy[1]
matrix[x][y] = set
return matrix
index = [4,5]
index_2 = [3,4]
z = create_matrix(index)
z = set_matrix(z,index_2, 12)
print(z)
output:
[[0, 0, 0, 0, 12], [0, 0, 0, 0, 12], [0, 0, 0, 0, 12], [0, 0, 0, 0, 12]]
This code should change only the last array
In your for n in range(x): loop you are appending the same y matrix multiple times. Python under the hood does not copy that array, but uses a pointer. So you have a row of pointers to the same one column.
Move the matrix_y = [] stuff inside the n loop and you get unique y arrays.
Comment: python does not actually have a pointer concept but it does use them. It hides from you when it does a copy data and when it only copies a pointer to that data. That's kind of bad language design, and it tripped you up here. So now you now that pointers exist, and that most of the time when you "assign arrays" you will actually only set a pointer.
Another comment: if you are going to be doing anything serious with matrices, you should really look into numpy. That will be many factors faster if you do numerical computations.
you don't need first loop in create_matrix, hide them with comment:
#for z in range(y):
# matrix_y.append(0)
change second one like this, it means an array filled with and length = y:
for n in range(x):
matrix.append([0] * y)
result (only last cell was changed in matrix):
z = set_matrix(z,index_2, 12)
print(z)
# [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 12]]

How do I use numpy vectorize to iterate through a two-dimentional vector?

I am trying to use numpy.vectorize to iterate over a (2x5) matrix which contains two vectors representing the x- and y-values of coordinates. The coordinates (x- and y-value) are to be fed to a function returning a (1x1) vector for each iteration. So that in the end, the result should be a (1x5) vector. My problem is that instead of iterating through each element I want the algorithm to iterate through both vectors simultaneously, so it picks up the x- and y-values of the coordinates in parallel to feed it to the function.
data = np.transpose(np.array([[1, 2], [1, 3], [2, 1], [1, -1], [2, -1]]))
th_ = np.array([[1, 1]])
th0_ = -2
def positive(x, th = th_, th0 = th0_):
if signed_dist(x, th, th0)[0][0] > 0:
return np.array([[1]])
elif signed_dist(x, th, th0)[0][0] == 0:
return np.array([[0]])
else:
return np.array([[-1]])
positive_numpy = np.vectorize(positive)
results = positive_numpy(data)
Reading the numpy documentation did not really help and I want to avoid large workarounds in favor of computation timing. Thankful for any suggestion!
This is a bit of a guess, but looks like your code can be simplified to
data = np.array([[1, 2], [1, 3], [2, 1], [1, -1], [2, -1]]) # (5,2) array
th_ = np.array([[1, 1]])
th0_ = -2
alist = [signed_dist(x, th_, th0_) for x in data]
arr = np.array(alist) # (5,?,?) array
arr = arr[:,0,0] # (5,) array
arr[arr>0] = 1

What does X_set[y_set == j, 0] mean?

Recently, I have been following a tutorial where I came up with the following code
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
here, y_set is a vector having binary values 0, 1 and X_set is an array with two columns.
I am specifically not understanding how to interpret the following line of code
X_set[y_set == j, 0], X_set[y_set == j, 1]
There's a few things going on here. For now, I will drop the loop but we know that j will take values in y_set so will be either zero or one. First make the two arrays:
import numpy as np
X_set = np.arange(20).reshape(10, 2)
y_set = np.array([0, 1, 1, 1, 0, 0, 1, 1, 0, 1])
From the above, this code is basically doing:
plt.scatter(filtered_values_in_first_column_of X_set,
filtered_values_in_second_column_of X_set)
y_set is providing the filter. We can get there by building up the steps:
print("Where y_set == 0: Boolean mask.")
print(y_set == 0)
print()
print("All rows of X_set indexed by the Boolean mask")
print(X_set[y_set == 0])
print()
print("2D indexing to get only the first column of the above")
print(X_set[y_set == 0, 0])
print()
You can see more on the numpy indexing here. Once you break the steps down, it's not too complicated but I think it was an unnecessarily complex way of achieving this task.
The for loop is so that they could repeat the plot with two different colours depending on whether the values are filtered by y_set being equal to 0 or 1.

Removing duplicates (within a given tolerance) from a Numpy array of vectors

I have an Nx5 array containing N vectors of form 'id', 'x', 'y', 'z' and 'energy'. I need to remove duplicate points (i.e. where x, y, z all match) within a tolerance of say 0.1. Ideally I could create a function where I pass in the array, columns that need to match and a tolerance on the match.
Following this thread on Scipy-user, I can remove duplicates based on a full array using record arrays, but I need to just match part of an array. Moreover this will not match within a certain tolerance.
I could laboriously iterate through with a for loop in Python but is there a better Numponic way?
You might look at scipy.spatial.KDTree.
How big is N ?
Added: oops, tree.query_pairs is not in scipy 0.7.1 .
When in doubt, use brute force: split the space (here side^3) into little cells,
one point per cell:
""" scatter points to little cells, 1 per cell """
from __future__ import division
import sys
import numpy as np
side = 100
npercell = 1 # 1: ~ 1/e empty
exec "\n".join( sys.argv[1:] ) # side= ...
N = side**3 * npercell
print "side: %d npercell: %d N: %d" % (side, npercell, N)
np.random.seed( 1 )
points = np.random.uniform( 0, side, size=(N,3) )
cells = np.zeros( (side,side,side), dtype=np.uint )
id = 1
for p in points.astype(int):
cells[tuple(p)] = id
id += 1
cells = cells.flatten()
# A C, an E-flat, and a G walk into a bar.
# The bartender says, "Sorry, but we don't serve minors."
nz = np.nonzero(cells)[0]
print "%d cells have points" % len(nz)
print "first few ids:", cells[nz][:10]
I have finally got a solution that I am happy with, this is a slightly cleaned up cut and paste from my own code. There may yet be some bugs.
Note: that it still uses a 'for' loop. I could use Denis's idea of KDTree above coupled with the rounding to get the full solution.
import numpy as np
def remove_duplicates(data, dp_tol=None, cols=None, sort_by=None):
'''
Removes duplicate vectors from a list of data points
Parameters:
data An MxN array of N vectors of dimension M
cols An iterable of the columns that must match
in order to constitute a duplicate
(default: [1,2,3] for typical Klist data array)
dp_tol An iterable of three tolerances or a single
tolerance for all dimensions. Uses this to round
the values to specified number of decimal places
before performing the removal.
(default: None)
sort_by An iterable of columns to sort by (default: [0])
Returns:
MxI Array An array of I vectors (minus the
duplicates)
EXAMPLES:
Remove a duplicate
>>> import wien2k.utils
>>> import numpy as np
>>> vecs1 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 0],
... [3, 0, 0, 1]])
>>> remove_duplicates(vecs1)
array([[1, 0, 0, 0],
[3, 0, 0, 1]])
Remove duplicates with a tolerance
>>> vecs2 = np.array([[1, 0, 0, 0 ],
... [2, 0, 0, 0.001 ],
... [3, 0, 0, 0.02 ],
... [4, 0, 0, 1 ]])
>>> remove_duplicates(vecs2, dp_tol=2)
array([[ 1. , 0. , 0. , 0. ],
[ 3. , 0. , 0. , 0.02],
[ 4. , 0. , 0. , 1. ]])
Remove duplicates and sort by k values
>>> vecs3 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [3, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs3, sort_by=[3])
array([[1, 0, 0, 0],
[4, 0, 0, 1],
[2, 0, 0, 2]])
Change the columns that constitute a duplicate
>>> vecs4 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [1, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs4, cols=[0])
array([[1, 0, 0, 0],
[2, 0, 0, 2],
[4, 0, 0, 1]])
'''
# Deal with the parameters
if sort_by is None:
sort_by = [0]
if cols is None:
cols = [1,2,3]
if dp_tol is not None:
# test to see if already an iterable
try:
null = iter(dp_tol)
tols = np.array(dp_tol)
except TypeError:
tols = np.ones_like(cols) * dp_tol
# Convert to numbers of decimal places
# Find the 'order' of the axes
else:
tols = None
rnd_data = data.copy()
# set the tolerances
if tols is not None:
for col,tol in zip(cols, tols):
rnd_data[:,col] = np.around(rnd_data[:,col], decimals=tol)
# TODO: For now, use a slow Python 'for' loop, try to find a more
# numponic way later - see: http://stackoverflow.com/questions/2433882/
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in cols]))
rnd_data = rnd_data[sorted_indexes]
unique_kpts = []
for i in xrange(len(rnd_data)):
if i == 0:
unique_kpts.append(i)
else:
if (rnd_data[i, cols] == rnd_data[i-1, cols]).all():
continue
else:
unique_kpts.append(i)
rnd_data = rnd_data[unique_kpts]
# Now sort
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in sort_by]))
rnd_data = rnd_data[sorted_indexes]
return rnd_data
if __name__ == '__main__':
import doctest
doctest.testmod()
Have not tested this but if you sort your array along x then y then z this should get you the list of duplicates. You then need to choose which to keep.
def find_dup_xyz(anarray, x, y, z): #for example in an data = array([id,x,y,z,energy]) x=1 y=2 z=3
dup_xyz=[]
for i, row in enumerated(sortedArray):
nx=1
while (abs(row[x] - sortedArray[i+nx[x])<0.1) and (abs(row[z] and sortedArray[i+nx[y])<0.1) and (abs(row[z] - sortedArray[i+nx[z])<0.1):
nx=+1
dup_xyz.append(row)
return dup_xyz
Also just found this
http://mail.scipy.org/pipermail/scipy-user/2008-April/016504.html

Categories

Resources