Python construct a matrix iterating over arrays - python

from numpy import genfromtxt, linalg, array, append, hstack, vstack
#Euclidean distance function
def euclidean(v1, v2):
dist = linalg.norm(v1 - v2)
return dist
#get the .csv files and eliminate heading and unused columns from test
BMUs = genfromtxt('BMU3.csv', delimiter=',')
data = genfromtxt('test.csv', delimiter=',')
data = data[1:, :-2]
i = 0
for obj in data:
D = 0
for BMU in BMUs:
Dist = append(euclidean(obj, BMU[: -2]), BMU[-2:])
D = hstack(Dist)
Map = vstack(D)
#iteration counter
i += 1
if not i % 1000:
print (i, ' of ', len(data))
print (Map)
What I would like to do is:
Take an object from data
Calculate distance from BMU (euclidean(obj, BMU[: -2])
Append to the distance the last two items of the BMU array
create a 2d matrix that contains all the distances plus the last two items of all the BMU from a data object (D = hstack(Dist))
create an array of those matrices with length equal to the number of objects in data. (Map = vstack(D))
The problem here, or at least what I think is the problem, is that hstack and vstack would like as input a tuple of an array and not a single array. It's like I'm trying to use them as I use List.append() for lists, sadly I'm a beginner and I have no idea how to do it differently.
Any help would be awesome, thank you in advance :)

First a usage note:
Instead of:
from numpy import genfromtxt, linalg, array, append, hstack, vstack
use
import numpy as np
....
data = np.genfromtxt(....)
....
np.hstack...
Secondly, stay away from np.append. It too easy to misuse. Use np.concatenate so you get the full flavor of what it is doing.
list append is better for incremental work
alist = []
for ....
alist.append(....)
arr = np.array(alist)
==================
Without sample arrays (or at least shapes) I'm guessing. But (n,2) arrays sound reasonable. Taking the distance of each pair of 'points' from each other, I can collect the values in a nested list comprehension:
In [121]: data = np.arange(6).reshape(3,2)
In [122]: [[euclidean(d,b) for b in data] for d in data]
Out[122]:
[[0.0, 2.8284271247461903, 5.6568542494923806],
[2.8284271247461903, 0.0, 2.8284271247461903],
[5.6568542494923806, 2.8284271247461903, 0.0]]
and make that an array:
In [123]: np.array([[euclidean(d,b) for b in data] for d in data])
Out[123]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])
The equivalent with nested loops:
alist = []
for d in data:
sublist=[]
for b in data:
sublist.append(euclidean(d,b))
alist.append(sublist)
arr = np.array(alist)
There are ways of doing this without loops, but let's make sure the basic Python looping approach works first.
===============
If I want the difference (along the last axis) between every element (row) in data and every element in bmu (or here data), I can use array broadcasting. The result is a (3,3,2) array:
In [130]: data[None,:,:]-data[:,None,:]
Out[130]:
array([[[ 0, 0],
[ 2, 2],
[ 4, 4]],
[[-2, -2],
[ 0, 0],
[ 2, 2]],
[[-4, -4],
[-2, -2],
[ 0, 0]]])
norm can handle larger dimensional arrays and takes an axis parameter.
In [132]: np.linalg.norm(data[None,:,:]-data[:,None,:],axis=-1)
Out[132]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])

Thanks to your help, I managed to implement the pseudo code, here the final program:
import numpy as np
def euclidean(v1, v2):
dist = np.linalg.norm(v1 - v2)
return dist
def makeKNN(dataSet, BMUSet, k, fileOut, test=False):
# take input files
BMUs = np.genfromtxt(BMUSet, delimiter=',')
data = np.genfromtxt(dataSet, delimiter=',')
final = data[1:, :]
if test == False:
data = data[1:, :]
else:
data = data[1:, :-2]
# Calculate all the distances between data and BMUs than reorder BMU with the distances information
dist = np.array([[euclidean(d, b[:-2]) for b in BMUs] for d in data])
BMU_K = np.array([BMUs[np.argsort(d)] for d in dist])
# median over the closest k BMU
Z = np.array([[np.sum(b[:k].T[5]) / k] for b in BMU_K])
# error propagation
Z_err = np.array([[np.sqrt(np.sum(np.power(b[:k].T[5], 2)))] for b in BMU_K])
# Adding z estimates and errors to the data
final = np.concatenate((final, Z, Z_err), axis=1)
# print output file
np.savetxt(fileOut, final, delimiter=',')
print('So long, and thanks for all the fish')
Thank you very much and I hope that this code will help someone else in the future :)

Related

Entering values at specific locations in an array in Python

I have a list T2 and an array X containing numpy arrays of different shape. I want to rearrange values in these arrays according to T2. For example, for X[0], the elements should occupy locations according to T2[0] and 0. should be placed for locations not mentioned. Similarly, for X[1], the elements should occupy locations according to T2[1]. I present the expected output.
import numpy as np
T2 = [[0, 3, 4, 5], [1, 2, 3, 4]]
X=np.array([np.array([4.23056174e+02, 3.39165087e+02, 3.98049092e+02, 3.68757486e+02]),
np.array([4.23056174e+02, 3.48895801e+02, 3.48895801e+02, 3.92892424e+02])])
The expected output is
X=array([array([4.23056174e+02, 0, 0, 3.39165087e+02, 3.98049092e+02, 3.68757486e+02]),
array([0, 4.23056174e+02, 3.48895801e+02, 3.48895801e+02, 3.92892424e+02])])
import numpy as np
T2 = ...
X = ...
out = []
for t, x in zip(T2, X):
temp = np.zeros(max(t) + 1)
temp[t] = x
out.append(temp)
out = np.array(out, dtype=object)
out:
array([array([423.056174, 0. , 0. , 339.165087, 398.049092,
368.757486]) ,
array([ 0. , 423.056174, 348.895801, 348.895801, 392.892424])],
dtype=object)

Is there a way to get the top k values per row of a numpy array (Python)?

Given a numpy array of the form below:
x = [[4.,3.,2.,1.,8.],[1.2,3.1,0.,9.2,5.5],[0.2,7.0,4.4,0.2,1.3]]
is there a way to retain the top-3 values in each row and set others to zero in python (without an explicit loop). The result in the case of the example above would be
x = [[4.,3.,0.,0.,8.],[0.,3.1,0.,9.2,5.5],[0.0,7.0,4.4,0.0,1.3]]
Code for one example
import numpy as np
arr = np.array([1.2,3.1,0.,9.2,5.5,3.2])
indexes=arr.argsort()[-3:][::-1]
a = list(range(6))
A=set(indexes); B=set(a)
zero_ind=(B.difference(A))
arr[list(zero_ind)]=0
The output:
array([0. , 0. , 0. , 9.2, 5.5, 3.2])
Above is my sample code (with many lines) for a 1-D numpy array. Looping through each row of a numpy array and performing this same computation repeatedly would be quite expensive. Is there a simpler way?
Here is a fully vectorized code without third party outside numpy. It is using numpy's argpartition to efficiently find the k-th values. See for instance this answer for other use cases.
def truncate_top_k(x, k, inplace=False):
m, n = x.shape
# get (unsorted) indices of top-k values
topk_indices = numpy.argpartition(x, -k, axis=1)[:, -k:]
# get k-th value
rows, _ = numpy.indices((m, k))
kth_vals = x[rows, topk_indices].min(axis=1)
# get boolean mask of values smaller than k-th
is_smaller_than_kth = x < kth_vals[:, None]
# replace mask by 0
if not inplace:
return numpy.where(is_smaller_than_kth, 0, x)
x[is_smaller_than_kth] = 0
return x
Use np.apply_along_axis to apply a function to 1-D slices along a given axis
import numpy as np
def top_k_values(array):
indexes = array.argsort()[-3:][::-1]
A = set(indexes)
B = set(list(range(array.shape[0])))
array[list(B.difference(A))]=0
return array
arr = np.array([[4.,3.,2.,1.,8.],[1.2,3.1,0.,9.2,5.5],[0.2,7.0,4.4,0.2,1.3]])
result = np.apply_along_axis(top_k_values, 1, arr)
print(result)
Output
[[4. 3. 0. 0. 8. ]
[0. 3.1 0. 9.2 5.5]
[0. 7. 4.4 0. 1.3]]
def top_k(arr, k, axis = 0):
top_k_idx = = np.take_along_axis(np.argpartition(arr, -k, axis = axis),
np.arange(-k,-1),
axis = axis) # indices of top k values in axis
out = np.zeros.like(arr) # create zero array
np.put_along_axis(out, top_k_idx, # put idx values of arr in out
np.take_along_axis(arr, top_k_idx, axis = axis),
axis = axis)
return out
This should work for arbitrary axis and k, but does not work in-place. If you want in-place it's a bit simpler:
def top_k(arr, k, axis = 0):
remove_idx = = np.take_along_axis(np.argpartition(arr, -k, axis = axis),
np.arange(arr.shape[axis] - k),
axis = axis) # indices to remove
np.put_along_axis(out, remove_idx, 0, axis = axis) # put 0 in indices
Here is an alternative that use a list comprehension to look thru your array and applying the keep_top_3 function
import numpy as np
import heapq
def keep_top_3(arr):
smallest = heapq.nlargest(3, arr)[-1] # find the top 3 and use the smallest as cut off
arr[arr < smallest] = 0 # replace anything lower than the cut off with 0
return arr
x = [[4.,3.,2.,1.,8.],[1.2,3.1,0.,9.2,5.5],[0.2,7.0,4.4,0.2,1.3]]
result = [keep_top_3(np.array(arr)) for arr in x]
I hope this helps :)

Aggregate elements based on position vector

I'm trying to vectorize a very simple operation but can't seem to figure out how.
Given a very large numerical vector (over 1M positions) and another array of size n with a given set of positions, I would like to get back a vector of size n with elements being the average of the values of the first vector as specified by the second
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
c = [1.5,3,5,6]
I need to repeat this operation many times so performance is an issue.
Vanilla python solution:
import numpy as np
import time
a = np.array([1,2,3,4,5,6,7])
b = np.array([[0,1],[2],[3,5],[4,6]])
begin = time.time()
for i in range(100000):
c = []
for d in b:
c.append(np.mean(a[d]))
print(time.time() - begin, c)
# 3.7529971599578857 [1.5, 3.0, 5.0, 6.0]
I'm not sure if this is necessarily faster but you may as well try:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([[0, 1], [2], [3, 5], [4, 6]])
# Get the length of each subset of indices
lens = np.fromiter((len(bi) for bi in b), count=len(b), dtype=np.int32)
# Compute reduction indices
reduce_idx = np.roll(np.cumsum(lens), 1)
reduce_idx[0] = 0
# Make flattened array of index lists
idx = np.fromiter((i for bi in b for i in bi), count=lens.sum(), dtype=np.int32)
# Reorder according to indices
a2 = a[idx]
# Sum reordered array at reduction indices and divide by number of indices
c = np.add.reduceat(a2, reduce_idx) / lens
print(c)
# [1.5 3. 5. 6. ]

Sort invariant for numpy.argsort with multiple dimensions

numpy.argsort docs state
Returns:
index_array : ndarray, int
Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.
How can I apply the result of numpy.argsort for a multidimensional array to get back a sorted array? (NOT just a 1-D or 2-D array; it could be an N-dimensional array where N is known only at runtime)
>>> import numpy as np
>>> np.random.seed(123)
>>> A = np.random.randn(3,2)
>>> A
array([[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471],
[-0.57860025, 1.65143654]])
>>> i=np.argsort(A,axis=-1)
>>> A[i]
array([[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]],
[[ 0.2829785 , -1.50629471],
[-1.0856306 , 0.99734545]],
[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]]])
For me it's not just a matter of using sort() instead; I have another array B and I want to order B using the results of np.argsort(A) along the appropriate axis. Consider the following example:
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = np.argsort(A,axis=-1)
>>> BsortA = ???
# should result in [[4,1,3],[5,1,9]]
# so that corresponding elements of B and sort(A) stay together
It looks like this functionality is already an enhancement request in numpy.
The numpy issue #8708 has a sample implementation of take_along_axis that does what I need; I'm not sure if it's efficient for large arrays but it seems to work.
def take_along_axis(arr, ind, axis):
"""
... here means a "pack" of dimensions, possibly empty
arr: array_like of shape (A..., M, B...)
source array
ind: array_like of shape (A..., K..., B...)
indices to take along each 1d slice of `arr`
axis: int
index of the axis with dimension M
out: array_like of shape (A..., K..., B...)
out[a..., k..., b...] = arr[a..., inds[a..., k..., b...], b...]
"""
if axis < 0:
if axis >= -arr.ndim:
axis += arr.ndim
else:
raise IndexError('axis out of range')
ind_shape = (1,) * ind.ndim
ins_ndim = ind.ndim - (arr.ndim - 1) #inserted dimensions
dest_dims = list(range(axis)) + [None] + list(range(axis+ins_ndim, ind.ndim))
# could also call np.ix_ here with some dummy arguments, then throw those results away
inds = []
for dim, n in zip(dest_dims, arr.shape):
if dim is None:
inds.append(ind)
else:
ind_shape_dim = ind_shape[:dim] + (-1,) + ind_shape[dim+1:]
inds.append(np.arange(n).reshape(ind_shape_dim))
return arr[tuple(inds)]
which yields
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = A.argsort(axis=-1)
>>> take_along_axis(A,i,axis=-1)
array([[1, 2, 3],
[0, 4, 6]])
>>> take_along_axis(B,i,axis=-1)
array([[4, 1, 3],
[5, 1, 9]])
This argsort produces a (3,2) array
In [453]: idx=np.argsort(A,axis=-1)
In [454]: idx
Out[454]:
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32)
As you note applying this to A to get the equivalent of np.sort(A, axis=-1) isn't obvious. The iterative solution is sort each row (a 1d case) with:
In [459]: np.array([x[i] for i,x in zip(idx,A)])
Out[459]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
While probably not the fastest, it is probably the clearest solution, and a good starting point for conceptualizing a better solution.
The tuple(inds) from the take solution is:
(array([[0],
[1],
[2]]),
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32))
In [470]: A[_]
Out[470]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
In other words:
In [472]: A[np.arange(3)[:,None], idx]
Out[472]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
The first part is what np.ix_ would construct, but it does not 'like' the 2d idx.
Looks like I explored this topic a couple of years ago
argsort for a multidimensional ndarray
a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
I tried to explain what is going on. The take function does the same sort of thing, but constructs the indexing tuple for a more general case (dimensions and axis). Generalizing to more dimensions, but still with axis=-1 should be easy.
For the first axis, A[np.argsort(A,axis=0),np.arange(2)] works.
We just need to use advanced-indexing to index along all axes with those indices array. We can use np.ogrid to create open grids of range arrays along all axes and then replace only for the input axis with the input indices. Finally, index into data array with those indices for the desired output. Thus, essentially, we would have -
# Inputs : arr, ind, axis
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
out = arr[tuple(idx)]
Just to make it functional and do error checks, let's create two functions - One to get those indices and second one to feed in the data array and simply index. The idea with the first function is to get the indices that could be re-used for indexing into any arbitrary array which would support the necessary number of dimensions and lengths along each axis.
Hence, the implementations would be -
def advindex_allaxes(ind, axis):
axis = np.core.multiarray.normalize_axis_index(axis,ind.ndim)
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
return tuple(idx)
def take_along_axis(arr, ind, axis):
return arr[advindex_allaxes(ind, axis)]
Sample runs -
In [161]: A = np.array([[3,2,1],[4,0,6]])
In [162]: B = np.array([[3,1,4],[1,5,9]])
In [163]: i = A.argsort(axis=-1)
In [164]: take_along_axis(A,i,axis=-1)
Out[164]:
array([[1, 2, 3],
[0, 4, 6]])
In [165]: take_along_axis(B,i,axis=-1)
Out[165]:
array([[4, 1, 3],
[5, 1, 9]])
Relevant one.

Mahalanabois distance in python returns matrix instead of distance

This should be a simple question, either I am missing information, or I have mis-coded this.
I am trying to implement Mahalanabois distance in python which I am following from the formula in python.
My code is as follows:
a = np.array([[1, 3, 5]])
b = np.array([[4, 5, 6]])
X = np.empty((0,3), float)
X = np.vstack([X, [2,3,4]])
X = np.vstack([X, a])
X = np.vstack([X, b])
n = ((a-b).T)*(np.cov(X)**-1)*(a-b)
dist = np.sqrt(n)
dist returns a 3x3 array but should I not be expecting a single number representing the distance?
dist = array([[ 1.5 , 1.73205081, 1.22474487],
[ 1.73205081 , 2. , 1.41421356],
[ 1.22474487 , 1.41421356, 1. ]])
Wikipedia does not suggest (to me) that it should return a matrix. Googling implementations of mahalanbois distance in python I have not found something to compare it to.
From wiki page you could see, that a and b are vectors but in your case they are arrays. So you need reverse transposing. And also there should be matrix multiplication. In numpy * means element-wise multiplication, for matrix you should use np.dot function or .dot method of the np.array. For your case answer is:
n = (a-b).dot((np.cov(X)**-1).dot((a-b).T))
dist = np.sqrt(n)
In [54]: n
Out[54]: array([[ 25.]])
In [55]: dist
Out[55]: array([[ 5.]])
EDIT
As #roadrunner66 noticed you should use inverse matrix instead of inverse matrix of element. Usually np.linalg.inv works for that cases but for that you've got Singular Error and you need to use np.linalg.pinv:
n = (a-b).dot((np.linalg.pinv(np.cov(X))).dot((a-b).T))
dist = np.sqrt(n)
In [90]: n
Out[90]: array([[ 1.77777778]])
In [91]: dist
Out[91]: array([[ 1.33333333]])

Categories

Resources