I have a three dimensional ndarray of 2D coordinates, for example:
[[[1704 1240]
[1745 1244]
[1972 1290]
[2129 1395]
[1989 1332]]
[[1712 1246]
[1750 1246]
[1964 1286]
[2138 1399]
[1989 1333]]
[[1721 1249]
[1756 1249]
[1955 1283]
[2145 1399]
[1990 1333]]]
The ultimate goal is to remove the point closest to a given point ([1989 1332]) from each "group" of 5 coordinates. My thought was to produce a similarly shaped array of distances, and then using argmin to determine the indices of the values to be removed. However, I am not certain how to go about applying a function, like one to calculate a distance to a given point, to every element in an ndarray, at least in a NumPythonic way.
List comprehensions are a very inefficient way to deal with numpy arrays. They're an especially poor choice for the distance calculation.
To find the difference between your data and a point, you'd just do data - point. You can then calculate the distance using np.hypot, or if you'd prefer, square it, sum it, and take the square root.
It's a bit easier if you make it an Nx2 array for the purposes of the calculation though.
Basically, you want something like this:
import numpy as np
data = np.array([[[1704, 1240],
[1745, 1244],
[1972, 1290],
[2129, 1395],
[1989, 1332]],
[[1712, 1246],
[1750, 1246],
[1964, 1286],
[2138, 1399],
[1989, 1333]],
[[1721, 1249],
[1756, 1249],
[1955, 1283],
[2145, 1399],
[1990, 1333]]])
point = [1989, 1332]
#-- Calculate distance ------------
# The reshape is to make it a single, Nx2 array to make calling `hypot` easier
dist = data.reshape((-1,2)) - point
dist = np.hypot(*dist.T)
# We can then reshape it back to AxBx1 array, similar to the original shape
dist = dist.reshape(data.shape[0], data.shape[1], 1)
print dist
This yields:
array([[[ 299.48121811],
[ 259.38388539],
[ 45.31004304],
[ 153.5219854 ],
[ 0. ]],
[[ 290.04310025],
[ 254.0019685 ],
[ 52.35456045],
[ 163.37074401],
[ 1. ]],
[[ 280.55837182],
[ 247.34186868],
[ 59.6405902 ],
[ 169.77926846],
[ 1.41421356]]])
Now, removing the closest element is a bit harder than simply getting the closest element.
With numpy, you can use boolean indexing to do this fairly easily.
However, you'll need to worry a bit about the alignment of your axes.
The key is to understand that numpy "broadcasts" operations along the last axis. In this case, we want to brodcast along the middle axis.
Also, -1 can be used as a placeholder for the size of an axis. Numpy will calculate the permissible size when -1 is put in as the size of an axis.
What we'd need to do would look a bit like this:
#-- Remove closest point ---------------------
mask = np.squeeze(dist) != dist.min(axis=1)
filtered = data[mask]
# Once again, let's reshape things back to the original shape...
filtered = filtered.reshape(data.shape[0], -1, data.shape[2])
You could make that a single line, I'm just breaking it down for readability. The key is that dist != something yields a boolean array which you can then use to index the original array.
So, Putting it all together:
import numpy as np
data = np.array([[[1704, 1240],
[1745, 1244],
[1972, 1290],
[2129, 1395],
[1989, 1332]],
[[1712, 1246],
[1750, 1246],
[1964, 1286],
[2138, 1399],
[1989, 1333]],
[[1721, 1249],
[1756, 1249],
[1955, 1283],
[2145, 1399],
[1990, 1333]]])
point = [1989, 1332]
#-- Calculate distance ------------
# The reshape is to make it a single, Nx2 array to make calling `hypot` easier
dist = data.reshape((-1,2)) - point
dist = np.hypot(*dist.T)
# We can then reshape it back to AxBx1 array, similar to the original shape
dist = dist.reshape(data.shape[0], data.shape[1], 1)
#-- Remove closest point ---------------------
mask = np.squeeze(dist) != dist.min(axis=1)
filtered = data[mask]
# Once again, let's reshape things back to the original shape...
filtered = filtered.reshape(data.shape[0], -1, data.shape[2])
print filtered
Yields:
array([[[1704, 1240],
[1745, 1244],
[1972, 1290],
[2129, 1395]],
[[1712, 1246],
[1750, 1246],
[1964, 1286],
[2138, 1399]],
[[1721, 1249],
[1756, 1249],
[1955, 1283],
[2145, 1399]]])
On a side note, if more than one point is equally close, this won't work. Numpy arrays have to have the same number of elements along each dimension, so you'll need to re-do your grouping in that case.
If I understand your question correctly, I think you're looking for apply_along_axis. Using numpy's built-in broadcasting, we can simply subtract the point from the array:
>>> a - numpy.array([1989, 1332])
array([[[-285, -92],
[-244, -88],
[ -17, -42],
[ 140, 63],
[ 0, 0]],
[[-277, -86],
[-239, -86],
[ -25, -46],
[ 149, 67],
[ 0, 1]],
[[-268, -83],
[-233, -83],
[ -34, -49],
[ 156, 67],
[ 1, 1]]])
Then we can apply numpy.linalg.norm to it:
>>> dist = a - numpy.array([1989, 1332])
>>> numpy.apply_along_axis(numpy.linalg.norm, 2, dist)
array([[ 299.48121811, 259.38388539, 45.31004304,
153.5219854 , 0. ],
[ 290.04310025, 254.0019685 , 52.35456045,
163.37074401, 1. ],
[ 280.55837182, 247.34186868, 59.6405902 ,
169.77926846, 1.41421356]])
Finally, some boolean mask trickery, along with a couple of reshape calls:
>>> a[normed != normed.min(axis=1).reshape((-1, 1))].reshape((3, 4, 2))
array([[[1704, 1240],
[1745, 1244],
[1972, 1290],
[2129, 1395]],
[[1712, 1246],
[1750, 1246],
[1964, 1286],
[2138, 1399]],
[[1721, 1249],
[1756, 1249],
[1955, 1283],
[2145, 1399]]])
Joe Kington's answer is faster though. Oh well. I'll leave this for posterity.
def joes(data, point):
dist = data.reshape((-1,2)) - point
dist = np.hypot(*dist.T)
dist = dist.reshape(data.shape[0], data.shape[1], 1)
mask = np.squeeze(dist) != dist.min(axis=1)
return data[mask].reshape((3, 4, 2))
def mine(a, point):
dist = a - point
normed = numpy.apply_along_axis(numpy.linalg.norm, 2, dist)
return a[normed != normed.min(axis=1).reshape((-1, 1))].reshape((3, 4, 2))
>>> %timeit mine(data, point)
1000 loops, best of 3: 586 us per loop
>>> %timeit joes(data, point)
10000 loops, best of 3: 48.9 us per loop
There are multiple ways to do this, but here is one using list comprehensions:
Distance function:
In [35]: from numpy.linalg import norm
In [36]: dist = lambda x,y:norm(x-y)
Input data:
In [39]: GivenMatrix = scipy.rand(3, 5, 2)
In [40]: GivenMatrix
Out[40]:
array([[[ 0.83798666, 0.90294439],
[ 0.8706959 , 0.88397176],
[ 0.91879085, 0.93512921],
[ 0.15989245, 0.57311869],
[ 0.82896003, 0.53589968]],
[[ 0.0207089 , 0.9521768 ],
[ 0.94523963, 0.31079109],
[ 0.41929482, 0.88559614],
[ 0.87885236, 0.45227422],
[ 0.58365369, 0.62095507]],
[[ 0.14757177, 0.86101539],
[ 0.58081214, 0.12632764],
[ 0.89958321, 0.73660852],
[ 0.3408943 , 0.45420989],
[ 0.42656333, 0.42770216]]])
In [41]: q = scipy.rand(2)
In [42]: q
Out[42]: array([ 0.03280889, 0.71057403])
Compute output distances:
In [44]: distances = [[dist(x, q) for x in SubMatrix]
for SubMatrix in GivenMatrix]
In [45]: distances
Out[45]:
[[0.82783910695733931,
0.85564093542511577,
0.91399620574915652,
0.18720096539588818,
0.81508758596405939],
[0.24190557184498068,
0.99617079746515047,
0.42426891258164884,
0.88459501973012633,
0.55808740166908177],
[0.18921712490174292,
0.80103146210692744,
0.86716521557255788,
0.40079819635686459,
0.48482888965287363]]
To rank the results for each submatrix:
In [46]: scipy.argsort(distances)
Out[46]:
array([[3, 4, 0, 1, 2],
[0, 2, 4, 3, 1],
[0, 3, 4, 1, 2]])
As for the deletion, I personally think that's easiest by converting GivenMatrix to a list, then using del:
>>> GivenList = GivenMatrix.tolist()
>>> del GivenList[1][2] # delete third row from the second 5-by-2 submatrix
Related
The situation is I'd like to take the following Python / NumPy code:
# Procure some data:
z = np.zeros((32,32))
chunks = []
for i in range(0,32,step):
for j in range(0,32,step):
chunks.append( z[i:i+step,j:j+step] )
chunks = np.array(chunks)
chunks.shape # (256, 2, 2)
And vectorize it / remove the for loops. Is this possible? I don't mind much about ordering of the final array, e.g. 256,2,2 vs 2,2,256, as long as the spatial structure remains the same. That is, blocks of 2x2 from the original array.
Perhaps some magic using :: in addition to regular indexing can do this? Any NumPy masters here?
You may need transpose:
a = np.arange(1024).reshape(32,32)
a.reshape(16,2,16,2).transpose((0,2,1,3)).reshape(-1,2,2)
Output:
array([[[ 0, 1],
[ 32, 33]],
[[ 2, 3],
[ 34, 35]],
[[ 4, 5],
[ 36, 37]],
...,
[[ 986, 987],
[1018, 1019]],
[[ 988, 989],
[1020, 1021]],
[[ 990, 991],
[1022, 1023]]])
I have to run the snippet shown below about 200000 times in a row and the snippet needs about 0.12585 seconds for 1000 iterations. Datapoints has a shape of (3, 2704, 64)
output = []
maxium = 0
for datapoint in datapoints:
tmp = []
for data in datapoint:
maxium = max(data)
if maxium == 0:
tmp.append(data)
else:
tmp.append(data / maxium)
output.append(tmp)
I have tried to rewrite it using map() but this gives me an average of 0.23237 seconds per iteration. This is probably due to the multiple max(y) and list() calls.
np.asarray(list(map(lambda datapoint: list(map(lambda data: data / max(data) if max(data) > 0 else y, datapoint)), datapoints)))
Is there a possibility to optimize the code again to improve performance?
Well here's a short answer:
def bar(datapoints):
m = np.amax(datapoints, axis=2)
m[m == 0] = 1
return datapoints / m[:,:,np.newaxis]
Here's an explanation of how you might have got there (it's how I did get there!):
Let's start off with some example data:
>>> x = np.array([[[1, 2, 3, 4], [11, -12, 13, -14]], [[26, 27, 28, 29], [0, 0, 0, 0]]])
Now check what you get on your original function:
def foo(datapoints):
output = []
maxium = 0
for datapoint in datapoints:
tmp = []
for data in datapoint:
maxium = max(data)
if maxium == 0:
tmp.append(data)
else:
tmp.append(data / maxium)
output.append(tmp)
return numpy.array(output)
The result is:
>>> foo(x)
array([[[ 0.25 , 0.5 , 0.75 , 1. ],
[ 0.84615385, -0.92307692, 1. , -1.07692308]],
[[ 0.89655172, 0.93103448, 0.96551724, 1. ],
[ 0. , 0. , 0. , 0. ]]])
Now let's try out amax:
>>> np.amax(x, axis=0)
array([[26, 27, 28, 29],
[11, 0, 13, 0]])
>>> np.amax(x, axis=2)
array([[ 4, 13],
[29, 0]])
Ah ha, looks like axis=2 is what we're after. Now we want to divide the original array by this, but only in the places where the max is non-zero. How do only divide in some places? The answer is: we divide everywhere, but in some places we divide by 1 so it has no effect. So let's replace zeros with ones:
>>> m = np.amax(x, axis=2)
>>> m[m == 0] = 1
>>> m
array([[ 4, 13],
[29, 1]])
Finally, let's divide by this, broadcasting back over axis 2 which we took the maximum over earlier:
>>> x / m[:,:,np.newaxis]
array([[[ 0.25 , 0.5 , 0.75 , 1. ],
[ 0.84615385, -0.92307692, 1. , -1.07692308]],
[[ 0.89655172, 0.93103448, 0.96551724, 1. ],
[ 0. , 0. , 0. , 0. ]]])
Putting that all together you get bar() at the top.
Try something like this:
maximum = datapoints.max(axis=2, keepdims=True)
output = np.where(maximum==0, datapoints, datapoints/maximum)
You would see a warning invalid value encounter in true_divide but it should work as expected.
Update as #ArthurTacca pointed out:
output = datapoints/np.where(maximum==0, 1, maximum)
will eliminate the warning.
Yes you can definitely speed this up w/ vectorized numpy operations. Here's how I would do it, if I understand what you're trying to do correctly:
import numpy as np
# I use a randomly initialized array here, replace this with your input
arr = np.random.random(size=(3, 2704, 64))
# Find max for 3rd dimension, returns array w/ shape (3, 2704)
max_arr = np.max(arr, axis=2)
# Set up divisor, returns array w/ shape (3, 2704)
divisor = np.where(max_arr == 0, 1, max_arr)
# Use expand_dims to add third dimension, returns array w/ shape (3, 2704, 1)
divisor = np.expand_dims(divisor, axis=2)
# Perform division, shape is (3, 2704, 64)
ans = np.divide(arr, divisor)
From your code, I gather that you intend to scale your data by the max of your 3rd axis, but in the event of there being 0, forego scaling instead. You seem to also want your output to have the same shape as your input, which explains the way you structured output and tmp. That's why I left the code snippet to end w/ output in a numpy array, but if you need it in its original form regardless, its a simple loop to re-arrange your data:
output = []
for i in ans:
tmp = []
for j in i:
tmp.append(list(j))
output.append(tmp)
For future reference, furnish your questions with more detail. It will make it easier for people to participate, and you'll increase the chance of getting your questions answered quickly!
numpy.argsort docs state
Returns:
index_array : ndarray, int
Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.
How can I apply the result of numpy.argsort for a multidimensional array to get back a sorted array? (NOT just a 1-D or 2-D array; it could be an N-dimensional array where N is known only at runtime)
>>> import numpy as np
>>> np.random.seed(123)
>>> A = np.random.randn(3,2)
>>> A
array([[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471],
[-0.57860025, 1.65143654]])
>>> i=np.argsort(A,axis=-1)
>>> A[i]
array([[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]],
[[ 0.2829785 , -1.50629471],
[-1.0856306 , 0.99734545]],
[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]]])
For me it's not just a matter of using sort() instead; I have another array B and I want to order B using the results of np.argsort(A) along the appropriate axis. Consider the following example:
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = np.argsort(A,axis=-1)
>>> BsortA = ???
# should result in [[4,1,3],[5,1,9]]
# so that corresponding elements of B and sort(A) stay together
It looks like this functionality is already an enhancement request in numpy.
The numpy issue #8708 has a sample implementation of take_along_axis that does what I need; I'm not sure if it's efficient for large arrays but it seems to work.
def take_along_axis(arr, ind, axis):
"""
... here means a "pack" of dimensions, possibly empty
arr: array_like of shape (A..., M, B...)
source array
ind: array_like of shape (A..., K..., B...)
indices to take along each 1d slice of `arr`
axis: int
index of the axis with dimension M
out: array_like of shape (A..., K..., B...)
out[a..., k..., b...] = arr[a..., inds[a..., k..., b...], b...]
"""
if axis < 0:
if axis >= -arr.ndim:
axis += arr.ndim
else:
raise IndexError('axis out of range')
ind_shape = (1,) * ind.ndim
ins_ndim = ind.ndim - (arr.ndim - 1) #inserted dimensions
dest_dims = list(range(axis)) + [None] + list(range(axis+ins_ndim, ind.ndim))
# could also call np.ix_ here with some dummy arguments, then throw those results away
inds = []
for dim, n in zip(dest_dims, arr.shape):
if dim is None:
inds.append(ind)
else:
ind_shape_dim = ind_shape[:dim] + (-1,) + ind_shape[dim+1:]
inds.append(np.arange(n).reshape(ind_shape_dim))
return arr[tuple(inds)]
which yields
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = A.argsort(axis=-1)
>>> take_along_axis(A,i,axis=-1)
array([[1, 2, 3],
[0, 4, 6]])
>>> take_along_axis(B,i,axis=-1)
array([[4, 1, 3],
[5, 1, 9]])
This argsort produces a (3,2) array
In [453]: idx=np.argsort(A,axis=-1)
In [454]: idx
Out[454]:
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32)
As you note applying this to A to get the equivalent of np.sort(A, axis=-1) isn't obvious. The iterative solution is sort each row (a 1d case) with:
In [459]: np.array([x[i] for i,x in zip(idx,A)])
Out[459]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
While probably not the fastest, it is probably the clearest solution, and a good starting point for conceptualizing a better solution.
The tuple(inds) from the take solution is:
(array([[0],
[1],
[2]]),
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32))
In [470]: A[_]
Out[470]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
In other words:
In [472]: A[np.arange(3)[:,None], idx]
Out[472]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
The first part is what np.ix_ would construct, but it does not 'like' the 2d idx.
Looks like I explored this topic a couple of years ago
argsort for a multidimensional ndarray
a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
I tried to explain what is going on. The take function does the same sort of thing, but constructs the indexing tuple for a more general case (dimensions and axis). Generalizing to more dimensions, but still with axis=-1 should be easy.
For the first axis, A[np.argsort(A,axis=0),np.arange(2)] works.
We just need to use advanced-indexing to index along all axes with those indices array. We can use np.ogrid to create open grids of range arrays along all axes and then replace only for the input axis with the input indices. Finally, index into data array with those indices for the desired output. Thus, essentially, we would have -
# Inputs : arr, ind, axis
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
out = arr[tuple(idx)]
Just to make it functional and do error checks, let's create two functions - One to get those indices and second one to feed in the data array and simply index. The idea with the first function is to get the indices that could be re-used for indexing into any arbitrary array which would support the necessary number of dimensions and lengths along each axis.
Hence, the implementations would be -
def advindex_allaxes(ind, axis):
axis = np.core.multiarray.normalize_axis_index(axis,ind.ndim)
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
return tuple(idx)
def take_along_axis(arr, ind, axis):
return arr[advindex_allaxes(ind, axis)]
Sample runs -
In [161]: A = np.array([[3,2,1],[4,0,6]])
In [162]: B = np.array([[3,1,4],[1,5,9]])
In [163]: i = A.argsort(axis=-1)
In [164]: take_along_axis(A,i,axis=-1)
Out[164]:
array([[1, 2, 3],
[0, 4, 6]])
In [165]: take_along_axis(B,i,axis=-1)
Out[165]:
array([[4, 1, 3],
[5, 1, 9]])
Relevant one.
I have a 2d ndarray called weights of shape (npts, nweights). For every column of weights, I wish to randomly shuffle the rows. I want to repeat this process num_shuffles times, and store the collection of shufflings into a 3d ndarray called weights_matrix. Importantly, for each shuffling iteration, the shuffling indices of each column of weights should be the same.
Below appears an explicit naive double-for-loop implementation of this algorithm. Is it possible to avoid the python loops and generate weights_matrix in pure Numpy?
import numpy as np
npts, nweights = 5, 2
weights = np.random.rand(npts*nweights).reshape((npts, nweights))
num_shuffles = 3
weights_matrix = np.zeros((num_shuffles, npts, nweights))
for i in range(num_shuffles):
indx = np.random.choice(np.arange(npts), npts, replace=False)
for j in range(nweights):
weights_matrix[i, :, j] = weights[indx, j]
You can start by filling your 3-D array with copies of the original weights, then perform a simple iteration over slices of that 3-D array, using numpy.random.shuffle to shuffle each 2-D slice in-place.
For every column of weights, I wish to randomly shuffle the rows...the shuffling indices of each column of weights should be the same
is just another way of saying "I want to randomly reorder the rows of a 2D array". numpy.random.shuffle is a numpy-array-capable version of random.shuffle: it will reorder the elements of a container in-place. And that's all you need, since the "elements" of a 2-D numpy array, in that sense, are its rows.
import numpy
weights = numpy.array( [ [ 1, 2, 3 ], [ 4, 5, 6], [ 7, 8, 9 ] ] )
weights_3d = weights[ numpy.newaxis, :, : ].repeat( 10, axis=0 )
for w in weights_3d:
numpy.random.shuffle( w ) # in-place shuffle of the rows of each slice
print( weights_3d[0, :, :] )
print( weights_3d[1, :, :] )
print( weights_3d[2, :, :] )
Here's a vectorized solution with the idea being borrowed from this post -
weights[np.random.rand(num_shuffles,weights.shape[0]).argsort(1)]
Sample run -
In [28]: weights
Out[28]:
array([[ 0.22508764, 0.8527072 ],
[ 0.31504052, 0.73272155],
[ 0.73370203, 0.54889059],
[ 0.87470619, 0.12394942],
[ 0.20587307, 0.11385946]])
In [29]: num_shuffles = 3
In [30]: weights[np.random.rand(num_shuffles,weights.shape[0]).argsort(1)]
Out[30]:
array([[[ 0.87470619, 0.12394942],
[ 0.20587307, 0.11385946],
[ 0.22508764, 0.8527072 ],
[ 0.31504052, 0.73272155],
[ 0.73370203, 0.54889059]],
[[ 0.87470619, 0.12394942],
[ 0.22508764, 0.8527072 ],
[ 0.73370203, 0.54889059],
[ 0.20587307, 0.11385946],
[ 0.31504052, 0.73272155]],
[[ 0.73370203, 0.54889059],
[ 0.31504052, 0.73272155],
[ 0.22508764, 0.8527072 ],
[ 0.20587307, 0.11385946],
[ 0.87470619, 0.12394942]]])
from numpy import genfromtxt, linalg, array, append, hstack, vstack
#Euclidean distance function
def euclidean(v1, v2):
dist = linalg.norm(v1 - v2)
return dist
#get the .csv files and eliminate heading and unused columns from test
BMUs = genfromtxt('BMU3.csv', delimiter=',')
data = genfromtxt('test.csv', delimiter=',')
data = data[1:, :-2]
i = 0
for obj in data:
D = 0
for BMU in BMUs:
Dist = append(euclidean(obj, BMU[: -2]), BMU[-2:])
D = hstack(Dist)
Map = vstack(D)
#iteration counter
i += 1
if not i % 1000:
print (i, ' of ', len(data))
print (Map)
What I would like to do is:
Take an object from data
Calculate distance from BMU (euclidean(obj, BMU[: -2])
Append to the distance the last two items of the BMU array
create a 2d matrix that contains all the distances plus the last two items of all the BMU from a data object (D = hstack(Dist))
create an array of those matrices with length equal to the number of objects in data. (Map = vstack(D))
The problem here, or at least what I think is the problem, is that hstack and vstack would like as input a tuple of an array and not a single array. It's like I'm trying to use them as I use List.append() for lists, sadly I'm a beginner and I have no idea how to do it differently.
Any help would be awesome, thank you in advance :)
First a usage note:
Instead of:
from numpy import genfromtxt, linalg, array, append, hstack, vstack
use
import numpy as np
....
data = np.genfromtxt(....)
....
np.hstack...
Secondly, stay away from np.append. It too easy to misuse. Use np.concatenate so you get the full flavor of what it is doing.
list append is better for incremental work
alist = []
for ....
alist.append(....)
arr = np.array(alist)
==================
Without sample arrays (or at least shapes) I'm guessing. But (n,2) arrays sound reasonable. Taking the distance of each pair of 'points' from each other, I can collect the values in a nested list comprehension:
In [121]: data = np.arange(6).reshape(3,2)
In [122]: [[euclidean(d,b) for b in data] for d in data]
Out[122]:
[[0.0, 2.8284271247461903, 5.6568542494923806],
[2.8284271247461903, 0.0, 2.8284271247461903],
[5.6568542494923806, 2.8284271247461903, 0.0]]
and make that an array:
In [123]: np.array([[euclidean(d,b) for b in data] for d in data])
Out[123]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])
The equivalent with nested loops:
alist = []
for d in data:
sublist=[]
for b in data:
sublist.append(euclidean(d,b))
alist.append(sublist)
arr = np.array(alist)
There are ways of doing this without loops, but let's make sure the basic Python looping approach works first.
===============
If I want the difference (along the last axis) between every element (row) in data and every element in bmu (or here data), I can use array broadcasting. The result is a (3,3,2) array:
In [130]: data[None,:,:]-data[:,None,:]
Out[130]:
array([[[ 0, 0],
[ 2, 2],
[ 4, 4]],
[[-2, -2],
[ 0, 0],
[ 2, 2]],
[[-4, -4],
[-2, -2],
[ 0, 0]]])
norm can handle larger dimensional arrays and takes an axis parameter.
In [132]: np.linalg.norm(data[None,:,:]-data[:,None,:],axis=-1)
Out[132]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])
Thanks to your help, I managed to implement the pseudo code, here the final program:
import numpy as np
def euclidean(v1, v2):
dist = np.linalg.norm(v1 - v2)
return dist
def makeKNN(dataSet, BMUSet, k, fileOut, test=False):
# take input files
BMUs = np.genfromtxt(BMUSet, delimiter=',')
data = np.genfromtxt(dataSet, delimiter=',')
final = data[1:, :]
if test == False:
data = data[1:, :]
else:
data = data[1:, :-2]
# Calculate all the distances between data and BMUs than reorder BMU with the distances information
dist = np.array([[euclidean(d, b[:-2]) for b in BMUs] for d in data])
BMU_K = np.array([BMUs[np.argsort(d)] for d in dist])
# median over the closest k BMU
Z = np.array([[np.sum(b[:k].T[5]) / k] for b in BMU_K])
# error propagation
Z_err = np.array([[np.sqrt(np.sum(np.power(b[:k].T[5], 2)))] for b in BMU_K])
# Adding z estimates and errors to the data
final = np.concatenate((final, Z, Z_err), axis=1)
# print output file
np.savetxt(fileOut, final, delimiter=',')
print('So long, and thanks for all the fish')
Thank you very much and I hope that this code will help someone else in the future :)