I have this array
A = array([[-0.49740509, -0.48618909, -0.49145315],
[-0.48959259, -0.48618909, -0.49145315],
[-0.49740509, -0.47837659, -0.49145315],
...,
[ 0.03079315, -0.01194593, -0.06872366],
[ 0.03054901, -0.01170179, -0.06872366],
[ 0.03079315, -0.01170179, -0.06872366]])
which is a collection of 3D vector. I was wondering if I could use a vectorial operation to get an array with the norm of each of my vector.
I tried with norm(A) but it didn't work.
Doing it manually might be fastest (although there's always some neat trick someone posts I didn't think of):
In [75]: from numpy import random, array
In [76]: from numpy.linalg import norm
In [77]:
In [77]: A = random.rand(1000,3)
In [78]: timeit normedA_0 = array([norm(v) for v in A])
100 loops, best of 3: 16.5 ms per loop
In [79]: timeit normedA_1 = array(map(norm, A))
100 loops, best of 3: 16.9 ms per loop
In [80]: timeit normedA_2 = map(norm, A)
100 loops, best of 3: 16.7 ms per loop
In [81]: timeit normedA_4 = (A*A).sum(axis=1)**0.5
10000 loops, best of 3: 46.2 us per loop
This assumes everything's real. Could multiply by the conjugate instead if that's not true.
Update: Eric's suggestion of using math.sqrt won't work -- it doesn't handle numpy arrays -- but the idea of using sqrt instead of **0.5 is a good one, so let's test it.
In [114]: timeit normedA_4 = (A*A).sum(axis=1)**0.5
10000 loops, best of 3: 46.2 us per loop
In [115]: from numpy import sqrt
In [116]: timeit normedA_4 = sqrt((A*A).sum(axis=1))
10000 loops, best of 3: 45.8 us per loop
I tried it a few times, and this was the largest difference I saw.
just had the same prob, maybe late to answer but this should help others.
You can use the axis argument in the norm function
norm(A, axis=1)
Having never used numpy, I'm going to guess:
normedA = array(norm(v) for v in A)
How about this method?
Also you might want to add a [numpy] tag to the post.
Related
I want to compute the output error for a neural network for each input by compare output signal and its true output value so I need two matrix to compute this task.
I have output matrix in shape of (n*1) but in the label I just have the index of neuron that should be activated, so I need a matrix in the same shape with all element equal to zero except the one which it's index is equal to the label. I could do that with a function but I wonder is there a built in method in numpy python that can do that for me?
You can do that multiple ways using numpy or standard libraries, one way is to create an array of zeros, and set the value corresponding to index as 1.
n = len(result)
a = np.zeros((n,));
a[id] = 1
It probably is going to be the fastest one as well:
>> %timeit a = np.zeros((n,)); a[id] = 1
1000000 loops, best of 3: 634 ns per loop
Alternatively you can use numpy.pad to pad [ 1 ] array with zeros. But this will almost definitely will be slower due to padding logic.
np.lib.pad([1],(id,n-id),'constant', constant_values=(0))
As expected order of magnitude slower:
>> %timeit np.lib.pad([1],(id,n-id),'constant', constant_values=(0))
10000 loops, best of 3: 47.4 µs per loop
And you can try list comprehension as suggested by the comments:
results = [7]
np.matrix([1 if x == id else 0 for x in results])
But it is much slower than the first method as well:
>> %timeit np.matrix([1 if x == id else 0 for x in results])
100000 loops, best of 3: 7.25 µs per loop
Edit:
But in my opinion, if you want to compute the neural networks error. You should just use np.argmax and compute whether it was successful or not. That error calculation may give you more noise than it is useful. You can make a confusion matrix if you feel your network is prone to similarities.
A few other methods that also seem to be slower than #umutto's above:
%timeit a = np.zeros((n,)); a[id] = 1 #umutto's method
The slowest run took 45.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.53 µs per loop
Boolean construction:
%timeit a = np.arange(n) == id
The slowest run took 13.98 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.76 µs per loop
Boolean construction to integer:
%timeit a = (np.arange(n) == id).astype(int)
The slowest run took 15.31 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.47 µs per loop
List construction:
%timeit a = [0]*n; a[id] = 1; a=np.asarray(a)
10000 loops, best of 3: 77.3 µs per loop
Using scipy.sparse
%timeit a = sparse.coo_matrix(([1], ([id],[0])), shape=(n,1))
10000 loops, best of 3: 51.1 µs per loop
Now what's actually faster may depend on what's being cached, but it seems like constructing the zero array is probably fastest, especially if you can use np.zeros_like(result) instead of np.zeros(len(result))
One liner:
x = np.identity(n)[id]
I have a list of floats I get from a machine learning algorithm. All these floats are between 0 and 1:
probs = [proba[0] for proba in self.classifier.predict_proba(x_test)]
probs is my list of floats. The predict_proba() function normally returns a numpy array. It takes about 9 seconds to get the list, and the list finally contains about 60k values.
I would like to scale, or normalize, all the values in the list against the highest value in the list.
Normally, I would do that:
maximum = max(probs)
list_values = [proba / maximum for proba in probs]
But for 60k values, it takes about 2 minutes. I would like to make it shorter.
Do you have any idea about how I could attend better performances ?
If you don't mind using an external library, numpy might be worth looking into:
import numpy
probs = numpy.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
list_values = probs/maximum
Another approach using numpy, potentially faster if your list of probabilities is large, is to convert your whole probabilities to a numpy array, and then operate over it:
import numpy as np
probs = np.asarray(self.classifier.predict_proba(x_test))
list_values = probs[:, 0] / probs.max()
The first line will convert all your probabilities to a N x M array (where N is your samples and M your number of classes).
The second line will select all the probabilities for the first class ([:, 0] means all rows of the column 0, which yields a vector of size N) and divide it for the maximum.
You can potentially extend this to all your probabilities:
all_probs = probs / probs.max()
The above will normalize all your probabilities for all the classes. And later you can access them like all_probs[:, i] where i is the class of interest.
You should use Scikit learn's normalize.
from sklearn.preprocessing import normalize
If you want your end results to be numpy.array , then it would be to faster to convert your list to numpy array before hand and to use array division directly , than list comprehension. Example -
import numpy as np
probsnp = np.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
maximum = probs.max()
list_values = probs/maximum
Examples of timing tests -
In [46]: import numpy.random as ndr
In [47]: probs = ndr.random_sample(1000)
In [48]: probs.shape
Out[48]: (1000,)
In [49]: def func1(probs):
....: maximum = max(probs)
....: probsnew = [i/maximum for i in probs]
....: return probsnew
....:
In [50]: def func2(probs):
....: maximum = probs.max()
....: probsnew = probs/maximum
....: return probsnew
....:
In [51]: %timeit func1(probs)
The slowest run took 229.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 279 µs per loop
In [52]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop
In [53]: %timeit func2(probs)
The slowest run took 356.45 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 81 µs per loop
In [54]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop
In [55]: %timeit func2(probs)
10000 loops, best of 3: 81.5 µs per loop
The numpy method takes only 1/3rd time as that of list comprehension.
Timing tests with numpy.array() conversion as part of func2 (in above example) -
In [60]: probslist = [p for p in probs]
In [61]: def func2(probs):
....: probsnp = np,array(probs)
....: maxprobs = probsnp.max()
....: probsnew = probsnp/maxprobs
....: return probsnew
....:
In [65]: %timeit func1(probslist)
1000 loops, best of 3: 212 µs per loop
In [66]: %timeit func2(probslist)
10000 loops, best of 3: 198 µs per loop
In [67]: probs = ndr.random_sample(60000)
In [68]: probslist = [p for p in probs]
In [74]: %timeit func1(probslist)
100 loops, best of 3: 11.5 ms per loop
In [75]: %timeit func2(probslist)
100 loops, best of 3: 5.79 ms per loop
In [76]: %timeit func1(probslist)
100 loops, best of 3: 11.4 ms per loop
In [77]: %timeit func2(probslist)
100 loops, best of 3: 5.81 ms per loop
Seems like its still a little faster to use numpy array.
How would you vectorize the evaluation of arrays of lambda functions?
Here's an example to understand what I'm talking about. (And even though I'm using numpy arrays, I'm not limiting myself to only using numpy.)
Let's say I have the following numpy arrays.
array1 = np.array(["hello", 9])
array2 = np.array([lambda s: s == "hello", lambda num: num < 10])
(You could store these kinds of objects in numpy without throwing an error, believe it or not.) What I want is something akin to the following.
array2 * array1
# Return np.array([True, True]). PS: An explanation of how to `AND` all of
# booleans together quickly would be nice too.
Of course, this seems impractical for arrays of size 2, but for arrays of arbitrary sizes, I'll assume this would yield a performance boost because of all of the low level optimizations.
So, anyone know how to write this weird kind of python code?
The simple answer, of course, is that you can't easily do this with numpy (or with standard Python, for that matter). Numpy doesn't actually vectorize most operations itself, to my knowledge: it uses libraries like BLAS/ATLAS/etc that do for certain situations. Even if it did, it would do so in C for specific situations: it certainly can't vectorize Python function execution.
If you want to involve multiprocessing in this, it is possible, but it depends on your situation. Are your individual function applications time-consuming, making them feasible to send out one-by-one, or do you need a very large number of fast function executions, in which case you'd probably want to send batches of them to each process?
In general, because of what could be argued as poor fundamental design (eg, the Global Interpreter Lock), it's very difficult with standard Python to have lightweight parallelization as you're hoping for here. There are significantly heavier methods, like the multiprocessing module or Ipython.parallel, but these require some work to use.
Alright guys, I have an answer: numpy's vectorize.
Please read the edited section though. You'll discover that python actually optimizes code for you, which actually defeats the purpose of using numpy arrays in this case. (But using numpy arrays does not decrease the performance.)
The last test really shows is that python lists are as efficient as they could be, and so this vectorization procedure is unnecessary. This is why I didn't mark this question as the "best answer".
Setup code:
def factory(i): return lambda num: num==i
array1 = list()
for i in range(10000): array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(10000))
The "unvectorized" version:
def evaluate(array1, array2):
return [func(val) for func, val in zip(array1, array2)]
%timeit evaluate(array1, array2)
# 100 loops, best of 3: 10 ms per loop
The vectorized version
def evaluate2(func, b): return func(b)
vec_evaluate = np.vectorize(evaluate2)
vec_evaluate(array1, array2)
# 100 loops, best of 3: 2.65 ms per loop
EDIT
Okay, I just wanted to paste more benchmarks that I received using the above tests, except with different test cases.
I made a third edit, showing what happens if you simply use python lists. The long story short, you actually won't regret much. This test case is on the very bottom.
Test cases only involving integers
In summary, if n is small, then the unvectorized version is better. Otherwise, vectorized is the way to go.
With n = 30
%timeit evaluate(array1, array2)
# 10000 loops, best of 3: 35.7 µs per loop
%timeit vec_evaluate(array1, array2)
# 10000 loops, best of 3: 27.6 µs per loop
With n = 7
%timeit evaluate(array1, array2)
100000 loops, best of 3: 9.93 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 21.6 µs per loop
Test cases involving strings
Vectorization wins.
Setup code:
def factory(i): return lambda num: str(num)==str(i)
array1 = list()
for i in range(7):
array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(7))
With n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 36.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 6.57 ms per loop
With n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 28.3 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 27.5 µs per loop
Random tests
Just to see how branch prediction played a role. From what I'm seeing, it didn't really change much. Vectorization still usually wins.
Setup code.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
When n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 25.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 4.67 ms per loop
When n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
Using python lists instead of numpy arrays
I ran this test to see what happened when I chose not to use the "optimized" numpy arrays, and I received some very surprising results.
The setup code is almost the same, except I'm choosing not to use numpy arrays. I'm also doing this test for only the "random" case.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
array1 = list()
for i in range(10000): array1.append(factory(i))
array2 = range(10000)
And the "unvectorized" version:
%timeit evaluate(array1, array2)
100 loops, best of 3: 4.93 ms per loop
You could see this is actually pretty surprising, because this is almost the same benchmark I was receiving with my random test case involving the vectorized evaluate.
%timeit vec_evaluate(array1, array2)
10 loops, best of 3: 19.8 ms per loop
Likewise, if you change these into numpy arrays before using vec_evaluate, you get the same 4.5 ms benchmark.
I know that in Python, the in-place operators use the __iadd__ method for in-place operators. For immutable types, the __iadd__ is a workaround using the __add__, e.g., like tmp = a + b; a = tmp, but mutable types (like lists) are modified in-place, which causes a slight speed boost.
However, if I have a NumPy array where I modify its contained immutable types, e.g., integers or floats, there is also an even more significant speed boost. How does this work? I did some example benchmarks below:
import numpy as np
def inplace(a, b):
a += b
return a
def assignment(a, b):
a = a + b
return a
int1 = 1
int2 = 1
list1 = [1]
list2 = [1]
npary1 = np.ones((1000,1000))
npary2 = np.ones((1000,1000))
print('Python integers')
%timeit inplace(int1, 1)
%timeit assignment(int2, 1)
print('\nPython lists')
%timeit inplace(list1, [1])
%timeit assignment(list2, [1])
print('\nNumPy Arrays')
%timeit inplace(npary1, 1)
%timeit assignment(npary2, 1)
What I would expect is a similar difference as for the Python integers when I used the in-place operators on NumPy arrays, however the results are completely different:
Python integers
1000000 loops, best of 3: 265 ns per loop
1000000 loops, best of 3: 249 ns per loop
Python lists
1000000 loops, best of 3: 449 ns per loop
1000000 loops, best of 3: 638 ns per loop
NumPy Arrays
100 loops, best of 3: 3.76 ms per loop
100 loops, best of 3: 6.6 ms per loop
Each call to assignment(npary2, 1) requires creating a new one million element array. Consider how much time it takes just to allocate a (1000, 1000)-shaped array of ones:
In [21]: %timeit np.ones((1000, 1000))
100 loops, best of 3: 3.84 ms per loop
This allocation of a new temporary array requires on my machine about 3.84 ms, and is on the right order of magnitude to explain the entire difference between inplace(npary1, 1) and assignment(nparay2, 1):
In [12]: %timeit inplace(npary1, 1)
1000 loops, best of 3: 1.8 ms per loop
In [13]: %timeit assignment(npary2, 1)
100 loops, best of 3: 4.04 ms per loop
So, given that allocation is a relatively slow process, it makes sense that in-place addition is significantly faster than assignment to a new array.
NumPy operations on NumPy arrays may be fast, but creation of NumPy arrays is relatively slow. Consider, for example, how much more time it takes to create a NumPy array than a Python list:
In [14]: %timeit list()
10000000 loops, best of 3: 106 ns per loop
In [15]: %timeit np.array([])
1000000 loops, best of 3: 563 ns per loop
This is one reason why it is generally better to use one large NumPy array (allocated once) rather than thousands of small NumPy arrays.
I have a problem where I have to do the following calculation.
I wanted to avoid the loop version, so I vectorized it.
Why is the loop version actually fast than the vectorized version?
Does anybody have an explanation for this.
thx
import numpy as np
from numpy.core.umath_tests import inner1d
num_vertices = 40000
num_pca_dims = 1000
num_vert_coords = 3
a = np.arange(num_vert_coords * num_vertices * num_pca_dims).reshape((num_pca_dims, num_vertices*num_vert_coords)).T
#n-by-3
norms = np.arange(num_vertices * num_vert_coords).reshape(num_vertices,-1)
#Loop version
def slowversion(a,norms):
res_list = []
for c_idx in range(a.shape[1]):
curr_col = a[:,c_idx].reshape(-1,3)
res = inner1d(curr_col, norms)
res_list.append(res)
res_list_conc = np.column_stack(res_list)
return res_list_conc
#Fast version
def fastversion(a,norms):
a_3 = a.reshape(num_vertices, 3, num_pca_dims)
fast_res = np.sum(a_3 * norms[:,:,None], axis=1)
return fast_res
res_list_conc = slowversion(a,norms)
fast_res = fastversion(a,norms)
assert np.all(res_list_conc == fast_res)
Your "slow code" is likely doing better because inner1d is a single optimized C++ function that can* make use of your BLAS implementation. Lets look at comparable timings for this operation:
np.allclose(inner1d(a[:,0].reshape(-1,3), norms),
np.sum(a[:,0].reshape(-1,3)*norms,axis=1))
True
%timeit inner1d(a[:,0].reshape(-1,3), norms)
10000 loops, best of 3: 200 µs per loop
%timeit np.sum(a[:,0].reshape(-1,3)*norms,axis=1)
1000 loops, best of 3: 625 µs per loop
%timeit np.einsum('ij,ij->i',a[:,0].reshape(-1,3), norms)
1000 loops, best of 3: 325 µs per loop
Using inner is quite a bit faster then the pure numpy operations. Note that einsum is almost twice as fast as pure numpy expressions and for good reason. As your loop is not that large and most of the FLOPS are in the inner computations the saving for the inner operation outweigh the cost of looping.
%timeit slowversion(a,norms)
1 loops, best of 3: 991 ms per loop
%timeit fastversion(a,norms)
1 loops, best of 3: 1.28 s per loop
#Thanks to DSM for writing this out
%timeit np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)
1 loops, best of 3: 488 ms per loop
Putting this back together we can see the overall advantage of the "slow version" wins out; however, using an einsum implementation, which is fairly optimized for this sort of thing, gives us a further speed increase.
*I don't see it right off in the code, but it is clearly threaded.