Related
I have to run the snippet shown below about 200000 times in a row and the snippet needs about 0.12585 seconds for 1000 iterations. Datapoints has a shape of (3, 2704, 64)
output = []
maxium = 0
for datapoint in datapoints:
tmp = []
for data in datapoint:
maxium = max(data)
if maxium == 0:
tmp.append(data)
else:
tmp.append(data / maxium)
output.append(tmp)
I have tried to rewrite it using map() but this gives me an average of 0.23237 seconds per iteration. This is probably due to the multiple max(y) and list() calls.
np.asarray(list(map(lambda datapoint: list(map(lambda data: data / max(data) if max(data) > 0 else y, datapoint)), datapoints)))
Is there a possibility to optimize the code again to improve performance?
Well here's a short answer:
def bar(datapoints):
m = np.amax(datapoints, axis=2)
m[m == 0] = 1
return datapoints / m[:,:,np.newaxis]
Here's an explanation of how you might have got there (it's how I did get there!):
Let's start off with some example data:
>>> x = np.array([[[1, 2, 3, 4], [11, -12, 13, -14]], [[26, 27, 28, 29], [0, 0, 0, 0]]])
Now check what you get on your original function:
def foo(datapoints):
output = []
maxium = 0
for datapoint in datapoints:
tmp = []
for data in datapoint:
maxium = max(data)
if maxium == 0:
tmp.append(data)
else:
tmp.append(data / maxium)
output.append(tmp)
return numpy.array(output)
The result is:
>>> foo(x)
array([[[ 0.25 , 0.5 , 0.75 , 1. ],
[ 0.84615385, -0.92307692, 1. , -1.07692308]],
[[ 0.89655172, 0.93103448, 0.96551724, 1. ],
[ 0. , 0. , 0. , 0. ]]])
Now let's try out amax:
>>> np.amax(x, axis=0)
array([[26, 27, 28, 29],
[11, 0, 13, 0]])
>>> np.amax(x, axis=2)
array([[ 4, 13],
[29, 0]])
Ah ha, looks like axis=2 is what we're after. Now we want to divide the original array by this, but only in the places where the max is non-zero. How do only divide in some places? The answer is: we divide everywhere, but in some places we divide by 1 so it has no effect. So let's replace zeros with ones:
>>> m = np.amax(x, axis=2)
>>> m[m == 0] = 1
>>> m
array([[ 4, 13],
[29, 1]])
Finally, let's divide by this, broadcasting back over axis 2 which we took the maximum over earlier:
>>> x / m[:,:,np.newaxis]
array([[[ 0.25 , 0.5 , 0.75 , 1. ],
[ 0.84615385, -0.92307692, 1. , -1.07692308]],
[[ 0.89655172, 0.93103448, 0.96551724, 1. ],
[ 0. , 0. , 0. , 0. ]]])
Putting that all together you get bar() at the top.
Try something like this:
maximum = datapoints.max(axis=2, keepdims=True)
output = np.where(maximum==0, datapoints, datapoints/maximum)
You would see a warning invalid value encounter in true_divide but it should work as expected.
Update as #ArthurTacca pointed out:
output = datapoints/np.where(maximum==0, 1, maximum)
will eliminate the warning.
Yes you can definitely speed this up w/ vectorized numpy operations. Here's how I would do it, if I understand what you're trying to do correctly:
import numpy as np
# I use a randomly initialized array here, replace this with your input
arr = np.random.random(size=(3, 2704, 64))
# Find max for 3rd dimension, returns array w/ shape (3, 2704)
max_arr = np.max(arr, axis=2)
# Set up divisor, returns array w/ shape (3, 2704)
divisor = np.where(max_arr == 0, 1, max_arr)
# Use expand_dims to add third dimension, returns array w/ shape (3, 2704, 1)
divisor = np.expand_dims(divisor, axis=2)
# Perform division, shape is (3, 2704, 64)
ans = np.divide(arr, divisor)
From your code, I gather that you intend to scale your data by the max of your 3rd axis, but in the event of there being 0, forego scaling instead. You seem to also want your output to have the same shape as your input, which explains the way you structured output and tmp. That's why I left the code snippet to end w/ output in a numpy array, but if you need it in its original form regardless, its a simple loop to re-arrange your data:
output = []
for i in ans:
tmp = []
for j in i:
tmp.append(list(j))
output.append(tmp)
For future reference, furnish your questions with more detail. It will make it easier for people to participate, and you'll increase the chance of getting your questions answered quickly!
numpy.argsort docs state
Returns:
index_array : ndarray, int
Array of indices that sort a along the specified axis. If a is one-dimensional, a[index_array] yields a sorted a.
How can I apply the result of numpy.argsort for a multidimensional array to get back a sorted array? (NOT just a 1-D or 2-D array; it could be an N-dimensional array where N is known only at runtime)
>>> import numpy as np
>>> np.random.seed(123)
>>> A = np.random.randn(3,2)
>>> A
array([[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471],
[-0.57860025, 1.65143654]])
>>> i=np.argsort(A,axis=-1)
>>> A[i]
array([[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]],
[[ 0.2829785 , -1.50629471],
[-1.0856306 , 0.99734545]],
[[-1.0856306 , 0.99734545],
[ 0.2829785 , -1.50629471]]])
For me it's not just a matter of using sort() instead; I have another array B and I want to order B using the results of np.argsort(A) along the appropriate axis. Consider the following example:
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = np.argsort(A,axis=-1)
>>> BsortA = ???
# should result in [[4,1,3],[5,1,9]]
# so that corresponding elements of B and sort(A) stay together
It looks like this functionality is already an enhancement request in numpy.
The numpy issue #8708 has a sample implementation of take_along_axis that does what I need; I'm not sure if it's efficient for large arrays but it seems to work.
def take_along_axis(arr, ind, axis):
"""
... here means a "pack" of dimensions, possibly empty
arr: array_like of shape (A..., M, B...)
source array
ind: array_like of shape (A..., K..., B...)
indices to take along each 1d slice of `arr`
axis: int
index of the axis with dimension M
out: array_like of shape (A..., K..., B...)
out[a..., k..., b...] = arr[a..., inds[a..., k..., b...], b...]
"""
if axis < 0:
if axis >= -arr.ndim:
axis += arr.ndim
else:
raise IndexError('axis out of range')
ind_shape = (1,) * ind.ndim
ins_ndim = ind.ndim - (arr.ndim - 1) #inserted dimensions
dest_dims = list(range(axis)) + [None] + list(range(axis+ins_ndim, ind.ndim))
# could also call np.ix_ here with some dummy arguments, then throw those results away
inds = []
for dim, n in zip(dest_dims, arr.shape):
if dim is None:
inds.append(ind)
else:
ind_shape_dim = ind_shape[:dim] + (-1,) + ind_shape[dim+1:]
inds.append(np.arange(n).reshape(ind_shape_dim))
return arr[tuple(inds)]
which yields
>>> A = np.array([[3,2,1],[4,0,6]])
>>> B = np.array([[3,1,4],[1,5,9]])
>>> i = A.argsort(axis=-1)
>>> take_along_axis(A,i,axis=-1)
array([[1, 2, 3],
[0, 4, 6]])
>>> take_along_axis(B,i,axis=-1)
array([[4, 1, 3],
[5, 1, 9]])
This argsort produces a (3,2) array
In [453]: idx=np.argsort(A,axis=-1)
In [454]: idx
Out[454]:
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32)
As you note applying this to A to get the equivalent of np.sort(A, axis=-1) isn't obvious. The iterative solution is sort each row (a 1d case) with:
In [459]: np.array([x[i] for i,x in zip(idx,A)])
Out[459]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
While probably not the fastest, it is probably the clearest solution, and a good starting point for conceptualizing a better solution.
The tuple(inds) from the take solution is:
(array([[0],
[1],
[2]]),
array([[0, 1],
[1, 0],
[0, 1]], dtype=int32))
In [470]: A[_]
Out[470]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
In other words:
In [472]: A[np.arange(3)[:,None], idx]
Out[472]:
array([[-1.0856306 , 0.99734545],
[-1.50629471, 0.2829785 ],
[-0.57860025, 1.65143654]])
The first part is what np.ix_ would construct, but it does not 'like' the 2d idx.
Looks like I explored this topic a couple of years ago
argsort for a multidimensional ndarray
a[np.arange(np.shape(a)[0])[:,np.newaxis], np.argsort(a)]
I tried to explain what is going on. The take function does the same sort of thing, but constructs the indexing tuple for a more general case (dimensions and axis). Generalizing to more dimensions, but still with axis=-1 should be easy.
For the first axis, A[np.argsort(A,axis=0),np.arange(2)] works.
We just need to use advanced-indexing to index along all axes with those indices array. We can use np.ogrid to create open grids of range arrays along all axes and then replace only for the input axis with the input indices. Finally, index into data array with those indices for the desired output. Thus, essentially, we would have -
# Inputs : arr, ind, axis
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
out = arr[tuple(idx)]
Just to make it functional and do error checks, let's create two functions - One to get those indices and second one to feed in the data array and simply index. The idea with the first function is to get the indices that could be re-used for indexing into any arbitrary array which would support the necessary number of dimensions and lengths along each axis.
Hence, the implementations would be -
def advindex_allaxes(ind, axis):
axis = np.core.multiarray.normalize_axis_index(axis,ind.ndim)
idx = np.ogrid[tuple(map(slice, ind.shape))]
idx[axis] = ind
return tuple(idx)
def take_along_axis(arr, ind, axis):
return arr[advindex_allaxes(ind, axis)]
Sample runs -
In [161]: A = np.array([[3,2,1],[4,0,6]])
In [162]: B = np.array([[3,1,4],[1,5,9]])
In [163]: i = A.argsort(axis=-1)
In [164]: take_along_axis(A,i,axis=-1)
Out[164]:
array([[1, 2, 3],
[0, 4, 6]])
In [165]: take_along_axis(B,i,axis=-1)
Out[165]:
array([[4, 1, 3],
[5, 1, 9]])
Relevant one.
from numpy import genfromtxt, linalg, array, append, hstack, vstack
#Euclidean distance function
def euclidean(v1, v2):
dist = linalg.norm(v1 - v2)
return dist
#get the .csv files and eliminate heading and unused columns from test
BMUs = genfromtxt('BMU3.csv', delimiter=',')
data = genfromtxt('test.csv', delimiter=',')
data = data[1:, :-2]
i = 0
for obj in data:
D = 0
for BMU in BMUs:
Dist = append(euclidean(obj, BMU[: -2]), BMU[-2:])
D = hstack(Dist)
Map = vstack(D)
#iteration counter
i += 1
if not i % 1000:
print (i, ' of ', len(data))
print (Map)
What I would like to do is:
Take an object from data
Calculate distance from BMU (euclidean(obj, BMU[: -2])
Append to the distance the last two items of the BMU array
create a 2d matrix that contains all the distances plus the last two items of all the BMU from a data object (D = hstack(Dist))
create an array of those matrices with length equal to the number of objects in data. (Map = vstack(D))
The problem here, or at least what I think is the problem, is that hstack and vstack would like as input a tuple of an array and not a single array. It's like I'm trying to use them as I use List.append() for lists, sadly I'm a beginner and I have no idea how to do it differently.
Any help would be awesome, thank you in advance :)
First a usage note:
Instead of:
from numpy import genfromtxt, linalg, array, append, hstack, vstack
use
import numpy as np
....
data = np.genfromtxt(....)
....
np.hstack...
Secondly, stay away from np.append. It too easy to misuse. Use np.concatenate so you get the full flavor of what it is doing.
list append is better for incremental work
alist = []
for ....
alist.append(....)
arr = np.array(alist)
==================
Without sample arrays (or at least shapes) I'm guessing. But (n,2) arrays sound reasonable. Taking the distance of each pair of 'points' from each other, I can collect the values in a nested list comprehension:
In [121]: data = np.arange(6).reshape(3,2)
In [122]: [[euclidean(d,b) for b in data] for d in data]
Out[122]:
[[0.0, 2.8284271247461903, 5.6568542494923806],
[2.8284271247461903, 0.0, 2.8284271247461903],
[5.6568542494923806, 2.8284271247461903, 0.0]]
and make that an array:
In [123]: np.array([[euclidean(d,b) for b in data] for d in data])
Out[123]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])
The equivalent with nested loops:
alist = []
for d in data:
sublist=[]
for b in data:
sublist.append(euclidean(d,b))
alist.append(sublist)
arr = np.array(alist)
There are ways of doing this without loops, but let's make sure the basic Python looping approach works first.
===============
If I want the difference (along the last axis) between every element (row) in data and every element in bmu (or here data), I can use array broadcasting. The result is a (3,3,2) array:
In [130]: data[None,:,:]-data[:,None,:]
Out[130]:
array([[[ 0, 0],
[ 2, 2],
[ 4, 4]],
[[-2, -2],
[ 0, 0],
[ 2, 2]],
[[-4, -4],
[-2, -2],
[ 0, 0]]])
norm can handle larger dimensional arrays and takes an axis parameter.
In [132]: np.linalg.norm(data[None,:,:]-data[:,None,:],axis=-1)
Out[132]:
array([[ 0. , 2.82842712, 5.65685425],
[ 2.82842712, 0. , 2.82842712],
[ 5.65685425, 2.82842712, 0. ]])
Thanks to your help, I managed to implement the pseudo code, here the final program:
import numpy as np
def euclidean(v1, v2):
dist = np.linalg.norm(v1 - v2)
return dist
def makeKNN(dataSet, BMUSet, k, fileOut, test=False):
# take input files
BMUs = np.genfromtxt(BMUSet, delimiter=',')
data = np.genfromtxt(dataSet, delimiter=',')
final = data[1:, :]
if test == False:
data = data[1:, :]
else:
data = data[1:, :-2]
# Calculate all the distances between data and BMUs than reorder BMU with the distances information
dist = np.array([[euclidean(d, b[:-2]) for b in BMUs] for d in data])
BMU_K = np.array([BMUs[np.argsort(d)] for d in dist])
# median over the closest k BMU
Z = np.array([[np.sum(b[:k].T[5]) / k] for b in BMU_K])
# error propagation
Z_err = np.array([[np.sqrt(np.sum(np.power(b[:k].T[5], 2)))] for b in BMU_K])
# Adding z estimates and errors to the data
final = np.concatenate((final, Z, Z_err), axis=1)
# print output file
np.savetxt(fileOut, final, delimiter=',')
print('So long, and thanks for all the fish')
Thank you very much and I hope that this code will help someone else in the future :)
I'm trying to calculate the Pearson correlation correlation between every item in my list. I'm trying to get the correlations between data[0] and data[1], data[0] and data[2], and data[1] and data[2].
import scipy
from scipy import stats
data = [[1, 2, 4], [9, 5, 1], [8, 3, 3]]
def pearson(x, y):
series1 = data[x]
series2 = data[y]
if x != y:
return scipy.stats.pearsonr(series1, series2)
h = [pearson(x,y) for x,y in range(0, len(data))]
This returns the error TypeError: 'int' object is not iterable on h. Could someone please explain the error here? Thanks.
range will return you a list of int values while you are trying to use it like it returning you a tuple. Try itertools.combinations instead:
import scipy
from scipy import stats
from itertools import combinations
data = [[1, 2, 4], [9, 5, 1], [8, 3, 3]]
def pearson(x, y):
series1 = data[x]
series2 = data[y]
if x != y:
return scipy.stats.pearsonr(series1, series2)
h = [pearson(x,y) for x,y in combinations(len(data), 2)]
Or as #Marius suggested:
h = [stats.pearsonr(data[x], data[y]) for x,y in combinations(len(data), 2)]
Why not use numpy.corrcoef
import numpy as np
data = [[1, 2, 4], [9, 5, 1], [8, 3, 3]]
Result:
>>> np.corrcoef(data)
array([[ 1. , -0.98198051, -0.75592895],
[-0.98198051, 1. , 0.8660254 ],
[-0.75592895, 0.8660254 , 1. ]])
The range() function will give you only an int for each iteration, and you can't assign an int to a pair of values.
If you want to go through every possible pair of possibilities of ints in that range you could try
import itertools
h = [pearson(x,y) for x,y in itertools.product(range(len(data)), repeat=2)]
That will combine all the possibilities in the given range in a tuple of 2 elements
Remember that, using that function you defined, when x==y you will have None values. To fix that you could use:
import itertools
h = [pearson(x,y) for x,y in itertools.permutations(range(len(data)), 2)]
I have an Nx5 array containing N vectors of form 'id', 'x', 'y', 'z' and 'energy'. I need to remove duplicate points (i.e. where x, y, z all match) within a tolerance of say 0.1. Ideally I could create a function where I pass in the array, columns that need to match and a tolerance on the match.
Following this thread on Scipy-user, I can remove duplicates based on a full array using record arrays, but I need to just match part of an array. Moreover this will not match within a certain tolerance.
I could laboriously iterate through with a for loop in Python but is there a better Numponic way?
You might look at scipy.spatial.KDTree.
How big is N ?
Added: oops, tree.query_pairs is not in scipy 0.7.1 .
When in doubt, use brute force: split the space (here side^3) into little cells,
one point per cell:
""" scatter points to little cells, 1 per cell """
from __future__ import division
import sys
import numpy as np
side = 100
npercell = 1 # 1: ~ 1/e empty
exec "\n".join( sys.argv[1:] ) # side= ...
N = side**3 * npercell
print "side: %d npercell: %d N: %d" % (side, npercell, N)
np.random.seed( 1 )
points = np.random.uniform( 0, side, size=(N,3) )
cells = np.zeros( (side,side,side), dtype=np.uint )
id = 1
for p in points.astype(int):
cells[tuple(p)] = id
id += 1
cells = cells.flatten()
# A C, an E-flat, and a G walk into a bar.
# The bartender says, "Sorry, but we don't serve minors."
nz = np.nonzero(cells)[0]
print "%d cells have points" % len(nz)
print "first few ids:", cells[nz][:10]
I have finally got a solution that I am happy with, this is a slightly cleaned up cut and paste from my own code. There may yet be some bugs.
Note: that it still uses a 'for' loop. I could use Denis's idea of KDTree above coupled with the rounding to get the full solution.
import numpy as np
def remove_duplicates(data, dp_tol=None, cols=None, sort_by=None):
'''
Removes duplicate vectors from a list of data points
Parameters:
data An MxN array of N vectors of dimension M
cols An iterable of the columns that must match
in order to constitute a duplicate
(default: [1,2,3] for typical Klist data array)
dp_tol An iterable of three tolerances or a single
tolerance for all dimensions. Uses this to round
the values to specified number of decimal places
before performing the removal.
(default: None)
sort_by An iterable of columns to sort by (default: [0])
Returns:
MxI Array An array of I vectors (minus the
duplicates)
EXAMPLES:
Remove a duplicate
>>> import wien2k.utils
>>> import numpy as np
>>> vecs1 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 0],
... [3, 0, 0, 1]])
>>> remove_duplicates(vecs1)
array([[1, 0, 0, 0],
[3, 0, 0, 1]])
Remove duplicates with a tolerance
>>> vecs2 = np.array([[1, 0, 0, 0 ],
... [2, 0, 0, 0.001 ],
... [3, 0, 0, 0.02 ],
... [4, 0, 0, 1 ]])
>>> remove_duplicates(vecs2, dp_tol=2)
array([[ 1. , 0. , 0. , 0. ],
[ 3. , 0. , 0. , 0.02],
[ 4. , 0. , 0. , 1. ]])
Remove duplicates and sort by k values
>>> vecs3 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [3, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs3, sort_by=[3])
array([[1, 0, 0, 0],
[4, 0, 0, 1],
[2, 0, 0, 2]])
Change the columns that constitute a duplicate
>>> vecs4 = np.array([[1, 0, 0, 0],
... [2, 0, 0, 2],
... [1, 0, 0, 0],
... [4, 0, 0, 1]])
>>> remove_duplicates(vecs4, cols=[0])
array([[1, 0, 0, 0],
[2, 0, 0, 2],
[4, 0, 0, 1]])
'''
# Deal with the parameters
if sort_by is None:
sort_by = [0]
if cols is None:
cols = [1,2,3]
if dp_tol is not None:
# test to see if already an iterable
try:
null = iter(dp_tol)
tols = np.array(dp_tol)
except TypeError:
tols = np.ones_like(cols) * dp_tol
# Convert to numbers of decimal places
# Find the 'order' of the axes
else:
tols = None
rnd_data = data.copy()
# set the tolerances
if tols is not None:
for col,tol in zip(cols, tols):
rnd_data[:,col] = np.around(rnd_data[:,col], decimals=tol)
# TODO: For now, use a slow Python 'for' loop, try to find a more
# numponic way later - see: http://stackoverflow.com/questions/2433882/
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in cols]))
rnd_data = rnd_data[sorted_indexes]
unique_kpts = []
for i in xrange(len(rnd_data)):
if i == 0:
unique_kpts.append(i)
else:
if (rnd_data[i, cols] == rnd_data[i-1, cols]).all():
continue
else:
unique_kpts.append(i)
rnd_data = rnd_data[unique_kpts]
# Now sort
sorted_indexes = np.lexsort(tuple([rnd_data[:,col] for col in sort_by]))
rnd_data = rnd_data[sorted_indexes]
return rnd_data
if __name__ == '__main__':
import doctest
doctest.testmod()
Have not tested this but if you sort your array along x then y then z this should get you the list of duplicates. You then need to choose which to keep.
def find_dup_xyz(anarray, x, y, z): #for example in an data = array([id,x,y,z,energy]) x=1 y=2 z=3
dup_xyz=[]
for i, row in enumerated(sortedArray):
nx=1
while (abs(row[x] - sortedArray[i+nx[x])<0.1) and (abs(row[z] and sortedArray[i+nx[y])<0.1) and (abs(row[z] - sortedArray[i+nx[z])<0.1):
nx=+1
dup_xyz.append(row)
return dup_xyz
Also just found this
http://mail.scipy.org/pipermail/scipy-user/2008-April/016504.html