Populate numpy matrix from the difference of two vectors - python

Is it possible to construct a numpy matrix from a function? In this case specifically the function is the absolute difference of two vectors: S[i,j] = abs(A[i] - B[j]). A minimal working example that uses regular python:
import numpy as np
A = np.array([1,3,6])
B = np.array([2,4,6])
S = np.zeros((3,3))
for i,x in enumerate(A):
for j,y in enumerate(B):
S[i,j] = abs(x-y)
Giving:
[[ 1. 3. 5.]
[ 1. 1. 3.]
[ 4. 2. 0.]]
It would be nice to have a construction that looks something like:
def build_matrix(shape, input_function, *args)
where I can pass an input function with it's arguments and retain the speed advantage of numpy.

In addition to what #JoshAdel has suggested, you can also use the outer method of any numpy ufunc to do the broadcasting in the case of two arrays.
In this case, you just want np.subtract.outer(A, B) (Or, rather, the absolute value of it).
While either one is fairly readable for this example, in some cases broadcasting is more useful, while in others using ufunc methods is cleaner.
Either way, it's useful to know both tricks.
E.g.
import numpy as np
A = np.array([1,3,6])
B = np.array([2,4,6])
diff = np.subtract.outer(A, B)
result = np.abs(diff)
Basically, you can use outer, accumulate, reduce, and reduceat with any numpy ufunc such as subtract, multiply, divide, or even things like logical_and, etc.
For example, np.cumsum is equivalent to np.add.accumulate. This means you could implement something like a cumdiv by np.divide.accumulate if you even needed to.

I recommend taking a look into numpy's broadcasting capabilities:
In [6]: np.abs(A[:,np.newaxis] - B)
Out[6]:
array([[1, 3, 5],
[1, 1, 3],
[4, 2, 0]])
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
Then you could simply write your function as:
In [7]: def build_matrix(func,args):
...: return func(*args)
...:
In [8]: def f1(A,B):
...: return np.abs(A[:,np.newaxis] - B)
...:
In [9]: build_matrix(f1,(A,B))
Out[9]:
array([[1, 3, 5],
[1, 1, 3],
[4, 2, 0]])
This should also be considerably faster than your solution for larger arrays.

Related

Faster alternative to using the `map` function

I have a function f, for exapmle:
def f(x):
return x**2
and want to obtain an array consisting of f evaluated over an interval, for example the unit interval (0,1). We ca do this as follows:
import numpy as np
X = np.arange(0,1,0.01)
arr = np.array(list(map(f, X)))
However, this last line is very time consuming when the function is complicated (in my case it involves some integrals). Is there a way to do this faster? I am happy to have a non-elegant solution - the focus is on speed.
You could use list comprehension to slightly decrease runtime.
arr = [f(x) for x in range(0, 5)] # range is the interval
This should work. It will only slightly decrease runtime though. You shouldn't be worried about runtime unless you use very large numbers with map().
If f is so complicated that it can't be expressed in terms of compiled array operations, and can only take scalars, I have found that frompyfunc gives the best performance (about 2x compared to an explicit loop)
In [76]: def f(x):
...: return x**2
...:
In [77]: foo = np.frompyfunc(f,1,1)
In [78]: foo(np.arange(4))
Out[78]: array([0, 1, 4, 9], dtype=object)
In [79]: foo(np.arange(4)).astype(int)
Out[79]: array([0, 1, 4, 9])
It returns dtype object, so needs an astype. np.vectorize uses this as well, but is a bit slower. Both generalize to various shapes of input array(s).
For a 1d result fromiter works with map (without the list) part:
In [84]: np.fromiter((f(x) for x in range(4)),int)
Out[84]: array([0, 1, 4, 9])
In [86]: np.fromiter(map(f, range(4)),int)
Out[86]: array([0, 1, 4, 9])
You'll have to do your own timings in a realistic case.
Use operations that operate on entire arrays. For example, with a function that just squares the input (slightly corrected from your example):
def f(x):
return x**2
then you'd just do
arr = f(X)
because NumPy defines operators like ** to operate on entire arrays at once.
Your real function might not be quite as straightforward. You say there are integrals involved; to make whole-array operations work with that, you might have to pass arguments differently or change what you're using to compute the integrals. In general, though, whole-array operations will vastly outperform anything that needs to call Python-level code in a loop.
You could try numpy.vectorize. It's very good way to apply function to list or array
import numpy as np
def foo(x):
return x**2
foo = np.vectorize(foo)
arr = np.arange(10)
In [1]: foo(arr)
Out[1]: array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])

Iterate over numpy with index (numpy equivalent of python enumerate)

I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.

Can Python perform vectorized operations?

I want to implement the following Matlab code in Python:
x=1:100;
y=20*log10(x);
I tried using Numpy to do this:
y = numpy.zeros(x.shape)
for i in range(len(x)):
y[i] = 20*math.log10(x[i])
But this uses a for loop; is there anyway to do a vectorized operation like in Matlab? I know for some simple math such as division and multiplication, it's possible. But what about other more sophisticated operations like logarithm here?
y = numpy.log10(numpy.arange(1, 101)) * 20
In [30]: numpy.arange(1, 10)
Out[30]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [31]: numpy.log10(numpy.arange(1, 10))
Out[31]:
array([ 0. , 0.30103 , 0.47712125, 0.60205999, 0.69897 ,
0.77815125, 0.84509804, 0.90308999, 0.95424251])
In [32]: numpy.log10(numpy.arange(1, 10)) * 20
Out[32]:
array([ 0. , 6.02059991, 9.54242509, 12.04119983,
13.97940009, 15.56302501, 16.9019608 , 18.06179974, 19.08485019])
Yep, there certainly is.
x = numpy.arange(1, 100)
y = 20 * numpy.log10(x)
Numpy has a lot of built-in array operators like log10. If it's not listed in numpy's documentation and you can't generate it from combining built-in methods, then there's no easy way to do it efficiently. You can implement a C-level function to work on numpy arrays and compile that, but this is a lot more work than one or two lines of Python code.
For your case you almost have the right output already:
y = 20*numpy.log10(x)
You may want to take a look at the Numpy documentation. This is a good place to start:
http://docs.scipy.org/doc/numpy/reference/routines.html
And specifically related to your question:
http://docs.scipy.org/doc/numpy/reference/routines.math.html
If you're not trying to do anything complicated, the original code could be implemented this way as well, without requiring the use of numpy, if I'm not mistaken.
>>> import math
>>> x = range(1, 101)
>>> y = [ 20 * math.log10(z) for z in x ]
Apart from performing vectorized operation using numpy standard vectorized functions, you can also make your custom vectorized function using numpy.vectorize. Here is one example:
>>> def myfunc(a, b):
... "Return a-b if a>b, otherwise return a+b"
... if a > b:
... return a - b
... else:
... return a + b
>>>
>>> vfunc = np.vectorize(myfunc)
>>> vfunc([1, 2, 3, 4], 2)
array([3, 4, 1, 2])
As mentioned in documentation, unlike numpy's standard vectorized functions, this won't improve the performance

Indexing NumPy 2D array with another 2D array

I have something like
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
and
select = array([0,1,0,0])
My target is
result = array([1, 5, 7, 6])
I tried _ix as I read at Simplfy row AND column extraction, numpy, but this did not result in what I wanted.
p.s. Please change the title of this question if you can think of a more precise one.
The numpy way to do this is by using np.choose or fancy indexing/take (see below):
m = array([[1, 2],
[4, 5],
[7, 8],
[6, 2]])
select = array([0,1,0,0])
result = np.choose(select, m.T)
So there is no need for python loops, or anything, with all the speed advantages numpy gives you. m.T is just needed because choose is really more a choise between the two arrays np.choose(select, (m[:,0], m[:1])), but its straight forward to use it like this.
Using fancy indexing:
result = m[np.arange(len(select)), select]
And if speed is very important np.take, which works on a 1D view (its quite a bit faster for some reason, but maybe not for these tiny arrays):
result = m.take(select+np.arange(0, len(select) * m.shape[1], m.shape[1]))
I prefer to use NP.where for indexing tasks of this sort (rather than NP.ix_)
What is not mentioned in the OP is whether the result is selected by location (row/col in the source array) or by some condition (e.g., m >= 5). In any event, the code snippet below covers both scenarios.
Three steps:
create the condition array;
generate an index array by calling NP.where, passing in this
condition array; and
apply this index array against the source array
>>> import numpy as NP
>>> cnd = (m==1) | (m==5) | (m==7) | (m==6)
>>> cnd
matrix([[ True, False],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
>>> # generate the index array/matrix
>>> # by calling NP.where, passing in the condition (cnd)
>>> ndx = NP.where(cnd)
>>> ndx
(matrix([[0, 1, 2, 3]]), matrix([[0, 1, 0, 0]]))
>>> # now apply it against the source array
>>> m[ndx]
matrix([[1, 5, 7, 6]])
The argument passed to NP.where, cnd, is a boolean array, which in this case, is the result from a single expression comprised of compound conditional expressions (first line above)
If constructing such a value filter doesn't apply to your particular use case, that's fine, you just need to generate the actual boolean matrix (the value of cnd) some other way (or create it directly).
What about using python?
result = array([subarray[index] for subarray, index in zip(m, select)])
IMHO, this is simplest variant:
m[np.arange(4), select]
Since the title is referring to indexing a 2D array with another 2D array, the actual general numpy solution can be found here.
In short:
A 2D array of indices of shape (n,m) with arbitrary large dimension m, named inds, is used to access elements of another 2D array of shape (n,k), named B:
# array of index offsets to be added to each row of inds
offset = np.arange(0, inds.size, inds.shape[1])
# numpy.take(B, C) "flattens" arrays B and C and selects elements from B based on indices in C
Result = np.take(B, offset[:,np.newaxis]+inds)
Another solution, which doesn't use np.take and I find more intuitive, is the following:
B[np.expand_dims(np.arange(B.shape[0]), -1), inds]
The advantage of this syntax is that it can be used both for reading elements from B based on inds (like np.take), as well as for assignment.
result = array([m[j][0] if i==0 else m[j][1] for i,j in zip(select, range(0, len(m)))])

Difference between frompyfunc and vectorize in numpy

What is the difference between vectorize and frompyfunc in numpy?
Both seem very similar. What is a typical use case for each of them?
Edit: As JoshAdel indicates, the class vectorize seems to be built upon frompyfunc. (see the source). It is still unclear to me whether frompyfunc may have any use case that is not covered by vectorize...
As JoshAdel points out, vectorize wraps frompyfunc. Vectorize adds extra features:
Copies the docstring from the original function
Allows you to exclude an argument from broadcasting rules.
Returns an array of the correct dtype instead of dtype=object
Edit: After some brief benchmarking, I find that vectorize is significantly slower (~50%) than frompyfunc for large arrays. If performance is critical in your application, benchmark your use-case first.
`
>>> a = numpy.indices((3,3)).sum(0)
>>> print a, a.dtype
[[0 1 2]
[1 2 3]
[2 3 4]] int32
>>> def f(x,y):
"""Returns 2 times x plus y"""
return 2*x+y
>>> f_vectorize = numpy.vectorize(f)
>>> f_frompyfunc = numpy.frompyfunc(f, 2, 1)
>>> f_vectorize.__doc__
'Returns 2 times x plus y'
>>> f_frompyfunc.__doc__
'f (vectorized)(x1, x2[, out])\n\ndynamic ufunc based on a python function'
>>> f_vectorize(a,2)
array([[ 2, 4, 6],
[ 4, 6, 8],
[ 6, 8, 10]])
>>> f_frompyfunc(a,2)
array([[2, 4, 6],
[4, 6, 8],
[6, 8, 10]], dtype=object)
`
I'm not sure what the different use cases for each is, but if you look at the source code (/numpy/lib/function_base.py), you'll see that vectorize wraps frompyfunc. My reading of the code is mostly that vectorize is doing proper handling of the input arguments. There might be particular instances where you would prefer one vs the other, but it would seem that frompyfunc is just a lower level instance of vectorize.
Although both methods provide you a way to build your own ufunc, numpy.frompyfunc method always returns a python object, while you could specify a return type when using numpy.vectorize method

Categories

Resources