I am trying to implement k-mean clustering algorithm for small project. I came upon this article which suggest that
K-Means is much faster if you write the update functions using operations on numpy arrays, instead of manually looping over the arrays and updating the values yourself.
I am exactly using iteration over each element of array to update it. For each element in dataset z, I am assigning the cluster array from nearest centroid via iteration through each element.
for i in range(z):
clstr[i] = closest_center(data[i], cen)
and my update function is
def closest_center(x, clist):
dlist = [fabs(x - i) for i in clist]
return clist[dlist.index(min(dlist))]
Since I am using grayscale image, I am using absolute value to calculate the Euclidean distance.
I noticed that opencv has this algorithm too. It takes less than 2s to execute the algorithm while mine takes more than 70s. May I know what the article is suggesting?
My images are imported as gray scale and is represented as 2d numpy array. I further converted into 1d array because it's easier to process 1d array.
The list comprehension is likely to slow down execution. I would suggest to vectorize the function closest_center. This is straightforward for 1-dimensional arrays:
import numpy as np
def closest_center(x, clist):
return clist[np.argmin(np.abs(x - clist))]
Related
I am trying to get a an array generated from applying differnt functions all stored in a numpy array on the same parameter, is there an efficient way coding this using numpy?
#func_array- a numpy array of different functions that get the same parameter
#X - parameter for evey function in func_array
def aplly_all(func_array, X):
return func_array(X)
#where return value is an array where index i has the value - func_array[i](X)
the only solution i thought of is iterating through the func_array and i wonder if there is a faster way of doing it
I once had the exact same questions, and this is what I was told:
The vectorization speed-up that numpy array operations provide is due to the base data-types defined for the array (say an array of floats, for instance).
When the array elements are objects, this advantage is mostly nullified. Since functions are objects, func_array is an array of objects. Thus any other method will hardly provide any speedup over iteration.
This is what I've learnt. I'm open to more experienced advice.
I have 3-dimensional DataArray (using xarray). I would like to apply a 1-dimensional to it along a certain dimension. Specifically, I want to apply the scipy.signal.medfilt() function but, again, it should be 1-dimensional.
So far I've successfully implemented this the following way:
for sample in data_raw.coords["sample"]:
for experiment in data_raw.coords["experiment"]:
data_filtered.loc[sample,experiment,:] = signal.medfilt(data_raw.loc[sample,experiment,:], 15)
(My data array has dimensions "sample", "experiment" and "wave_number. This code applies the filter along the "wave_number" dimension)
The problem with this is that it takes rather long to calculate and my intuition tells me that looping though coordinates like this is an inefficient way to do it. So I'm thinking about using the xarray.apply_ufunc() function, especially since I've used it in a similar fashion in the same code:
xr.apply_ufunc(np.linalg.norm, data, kwargs={"axis": 2}, input_core_dims=[["wave_number"]])
(This calculates the length of the vector along the "wave_number" dimension.)
I originally also had this loop through the coordinates just like the first code here.
The problem is when I try
xr.apply_ufunc(signal.medfilt, data_smooth, kwargs={"kernel_size": 15})
it returns a data array full of zeroes, presumably because it applies a 3D median filter and the data array contains NaN entries. I realize that the problem here is that I need to feed the scipy.signal.medfilt() function a 1D array but unfortunately there is no way to specify an axis along which to apply the filter (unlike numpy.linalg.norm()).
SO, how do I apply a 1D median filter without looping through coordinates?
If I understood correctly, you should use it like this:
xr.apply_ufunc(signal.medfilt, data_smooth, kwargs={"kernel_size": 15}, input_core_dims = [['wave_number']], vectorize=True)
with vectorize = True you vectorize your input function to be applied to slices of your array defined to preserve the core dimensions.
Nonetheless, as stated in the documentation:
This option exists for convenience, but is almost always slower than supplying a pre-vectorized function
because the implementation is essentially a for loop. However I still got faster results than by making my own loops.
I recently hit some performance bottlenecks with symbolic matrix derivatives in Sympy (specifically, the single line of code evaluating symbolic matrices via substitution using lambdas was taking ~90% of the program's runtime), so I decided to give Theano a go.
Its previous application was evaluating the partial derivatives over the hyperparameters of a Gaussian process, where using a (1, k) dimension matrix of Sympy symbols (MatrixSymbol) worked nicely in terms of iterating over this list and differentiating the matrix on each item.
However, this doesn't carry over into Theano, and the documentation doesn't seem to detail how to do this. Indexing a symbolic vector in Theano returns the Subtensor type, which is invalid for calculating the gradient on.
Below is a simple (but entirely algorithmically incorrect - stripped down to the functionality I'm trying to obtain) version of what I'm attempting to do.
EDIT: I have modified the code sample to include the data as a tensor to be passed into the function as suggested below, and included an alternate attempt at instead using a list of separate scalar tensors as I cannot index the values of a symbolic Theano vector, though also to no avail.
import theano
import numpy as np
# Sample data
data = np.array(10*np.random.rand(5, 3), dtype='int64')
# Not including data as tensor, incorrect/invalid indexing of symbolic vector
l_scales_sym = theano.tensor.dvector('l_scales')
x = theano.tensor.dmatrix('x')
f = x/l_scales_sym
f_eval = theano.function([x, l_scales_sym], f)
df_dl = theano.gradient.jacobian(f.flatten(), l_scales_sym[0])
df_dl_eval = theano.function([x, l_scales_sym], df_dl)
The second last line of the code snippet is where I am trying to get a partial derivative over one of the elements in the list of 'length scale' variables, but this sort of indexing is inapplicable to the symbolic vectors.
Any help would be greatly appreciated!
When using theano, all variables should be defined as theano tensors (or shared variables); otherwise, the variable does not become part of the computational graph. In f = data/l_scales_sym the variable data is a numpy array. Try to also define it as a a tensor, it should work.
I am mainly interested in ((d1,d2)) numpy arrays (matrices) but the question makes sense for arrays with more axes. I have function f(i,j) and I'd like to initialize an array by some operation of this function
A=np.empty((d1,d2))
for i in range(d1):
for j in range(d2):
A[i,j]=f(i,j)
This is readable and works but I am wondering if there is a faster way since my array A will be very large and I have to optimize this bit.
One way is to use np.fromfunction. Your code can be replaced with the line:
np.fromfunction(f, shape=(d1, d2))
This is implemented in terms of NumPy functions and so should be quite a bit faster than Python for loops for larger arrays.
a=np.arange(d1)
b=np.arange(d2)
A=f(a,b)
Note that if your arrays are of different size, then you have to create a meshgrid:
X,Y=meshgrid(a,b)
A=f(X,Y)
For the sake of speeding up my algorithm that has numpy arrays with tens of thousands of elements, I'm wondering if I can reduce the time used by numpy.delete().
In fact, if I can just eliminate it?
I have an algorithm where I've got my array alpha.
And this is what I'm currently doing:
alpha = np.delete(alpha, 0)
beta = sum(alpha)
But why do I need to delete the first element? Is it possible to simply sum up the entire array using all elements except the first one? Will that reduce the time used in the deletion operation?
Avoid np.delete whenever possible. It returns a a new array, which means that new memory has to be allocated, and (almost) all the original data has to be copied into the new array. That's slow, so avoid it if possible.
beta = alpha[1:].sum()
should be much faster.
Note also that sum(alpha) is calling the Python builtin function sum. That's not the fastest way to sum items in a NumPy array.
alpha[1:].sum() calls the NumPy array method sum which is much faster.
Note that if you were calling alpha.delete in a loop, then the code may be deleting more than just the first element from the original alpha. In that case, as Sven Marnach points out, it would be more efficient to compute all the partial sums like this:
np.cumsum(alpha[:0:-1])[::-1]