I want to apply outer addition of multiple vectors/matrices. Let's say four times:
import numpy as np
x = np.arange(100)
B = np.add.outer(x,x)
B = np.add.outer(B,x)
B = np.add.outer(B,x)
I would like best if the number of additions could be a variable, like a=4 --> 4 times the addition. Is this possible?
Approach #1
Here's one with array-initialization -
n = 4 # number of iterations to add outer versions
l = len(x)
out = np.zeros([l]*n,dtype=x.dtype)
for i in range(n):
out += x.reshape(np.insert([1]*(n-1),i,l))
Why this approach and not iterative addition to create new arrays at each iteration?
Iteratively creating new arrays at each iteration would require more memory and hence memory-overhead there. With array-initialization, we are adding element off x into an already initialized array. Hence, it tries to be memory-efficient with it.
Alternative #1
We can remove one iteration with initializing with x. Hence, the changes would be -
out = np.broadcast_to(x,[l]*n).copy()
for i in range(n-1):
Approach # 2: With np.add.reduce -
Another way would be with np.add.reduce, which again doesn't create any intermediate arrays, but being a reduction method might be better here as that's what it's implemented for -
l = len(x); n = 4
np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
Timings -
In [17]: x = np.arange(100)
In [18]: %%timeit
...: n = 4 # number of iterations to add outer versions
...: l = len(x)
...: out = np.zeros([l]*n,dtype=x.dtype)
...: for i in range(n):
...: out += x.reshape(np.insert([1]*(n-1),i,l))
829 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: l = len(x); n = 4
In [20]: %timeit np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
183 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I don't think there's a builtin argument to repeat this procedure several times, but you can define a custom function for it fairly easily
def recursive_outer_add(arr, num):
if num == 1:
return arr
x = np.add.outer(arr, arr)
for i in range(num - 1):
x = np.add.outer(x, arr)
return x
Just as a warning: the array gets really big really fast
Short and reasonably fast:
n = 4
l = 10
x = np.arange(l)
sum(np.ix_(*n*(x,)))
timeit(lambda:sum(np.ix_(*n*(x,))),number=1000)
# 0.049082988989539444
We can speed this up a little by going back to front:
timeit(lambda:sum(reversed(np.ix_(*n*(x,)))),number=1000)
# 0.03847671199764591
We can also build our own reversed np.ix_:
from operator import getitem
from itertools import accumulate,chain,repeat
sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem))
timeit(lambda:sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem)),number=1000)
# 0.02427654700295534
Related
Basically, I have :
An array giving indexes "I", e.g. (1, 2),
And a list of the same length giving the corresponding number of repetitions "N", e.g. [1, 3]
And I want to create an array containing the indexes I repeated N times, i.e. (1, 2, 2, 2) here, where 1 is repeated one time and 2 is repeated 3 times.
The best solution I've come up with uses the np.repeat and np.concatenate functions :
import numpy as np
list_index = np.arange(2)
list_no_repetition = [1, 3]
result = np.concatenate([np.repeat(index, no_repetition)
for index, no_repetition in zip(list_index, list_no_repetition)])
print(result)
I wonder if there is a "prettier"/"more efficient solution".
Thank you for your help.
Not sure about prettier, but you could solve it completely with list comprehension:
[x for i,l in zip(list_index, list_no_repetition) for x in [i]*l]
Hello this is the alternative that I propose:
import numpy as np
list_index = np.arange(2)
list_no_repetition = [1, 3]
result = np.array([])
for i in range(len(list_index)):
tempA=np.empty(list_no_repetition[i])
tempA.fill(list_index[i])
result = np.concatenate([result, tempA])
result
You could also use a dictionary with key as the index and the value as the amount of times repeated. I think that Andreas had it right with the list comprehension.
import numpy as np
repeatdict = {
1:1,
2:3,
3:6
}
result = [x for key, value in repeatdict.items() for x in [key]*value]
print(result)
If by "efficiency" you mean speed, you can use timeit. Here are some results for some arbitrary, larger data.
First, define the functions and data:
# generate some data (list values/indices and number of reps)
N = 1000
li_2 = np.arange(N)
lnr_2 = np.random.randint(low=0, high=10, size=N)
# three functions produce the same result
def by_range(items, rep_cts):
x = np.full(sum(rep_cts), np.nan)
i = 0
for val, reps in zip(items, rep_cts):
x[i:i + reps] = val
i = i + reps
return x
def by_comp(items, reps):
return np.array([val for val, rep in zip(items, reps) for i in range(rep)])
def by_cat(list_index, list_no_repetition):
return np.concatenate([np.repeat(index, no_repetition)
for index, no_repetition in zip(list_index, list_no_repetition)])
About the same speed: first allocating an array and then filling it in, vs. doing a one-line double-for comprehension.
# 820 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit by_range(li_2, lnr_2)
# 829 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit by_comp(li_2, lnr_2)
Original method of concatenation is slightly slower:
# 2.19 ms ± 98.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit by_cat(li_2, lnr_2)
Note that the results will differ depending on where/how you run this, and the specific data you're dealing with.
for my class I need to write more optimized math function using NumPy. Problem is, when using NumPy my solutions are slower when native Python.
function which cubes all the elements of an array and sum them
Python:
def cube(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
My, using NumPy (15-30% slower):
def cube(x):
it = numpy.nditer([x, None])
for a, b in it:
b[...] = a*a*a
return numpy.sum(it.operands[1])
Some random calculation function
Python:
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
NumPy (>10x slower):
def calc(x):
m = numpy.mean(x)
result = 0
for i in range(len(x)):
result += numpy.power((x[i] - m), 4)
return result / len(x)
I don't know how to approatch this, so far I have tried random functions from NumPy
To elaborate on what has been said in comments:
Numpy's power comes from being able to do all the looping in fast c/fortran rather than slow Python looping. For example, if you have an array x and you want to calculate the square of every value in that array, you could do
y = []
for value in x:
y.append(value**2)
or even (with a list comprehension)
y = [value**2 for value in x]
but it will be much faster if you can do all the looping inside numpy with
y = x**2
(assuming x is already a numpy array).
So for your examples, the proper way to do it in numpy would be
1.
def sum_of_cubes(x):
result = 0
for i in range(len(x)):
result += x[i] ** 3
return result
def sum_of_cubes_numpy(x):
return (x**3).sum()
def calc(x):
m = sum(x) / len(x)
result = 0
for i in range(len(x)):
result += (x[i] - m)**4
return result / len(x)
def calc_numpy(x):
m = numpy.mean(x) # or just x.mean()
return numpy.sum((x - m)**4) / len(x)
Note that I've assumed that the input x is already a numpy array, not a regular Python list: if you have a list lst, you can create an array from it with arr = numpy.array(lst).
In [337]: def cube(x):
...: result = 0
...: for i in range(len(x)):
...: result += x[i] ** 3
...: return result
...:
nditer is not a good numpy iterator, at least not when used in Python level code. It's really just a stepping stone toward writing compiled code. It's docs need a better disclaimer.
In [338]: def cube1(x):
...: it = numpy.nditer([x, None])
...: for a, b in it:
...: b[...] = a*a*a
...: return numpy.sum(it.operands[1])
...:
In [339]: cube(list(range(10)))
Out[339]: 2025
In [340]: cube1(list(range(10)))
Out[340]: 2025
In [341]: cube1(np.arange(10))
Out[341]: 2025
A more direct numpy iteration:
In [342]: def cube2(x):
...: it = [a*a*a for a in x]
...: return numpy.sum(it)
...:
The better whole-array code. Just as sum can work with the whole array, the power also applies the whole.
In [343]: def cube3(x):
...: return numpy.sum(x**3)
...:
In [344]: cube2(np.arange(10))
Out[344]: 2025
In [345]: cube3(np.arange(10))
Out[345]: 2025
Doing some timings:
The list reference:
In [346]: timeit cube(list(range(1000)))
438 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The slow nditer:
In [348]: timeit cube1(np.arange(1000))
2.8 ms ± 5.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The partial numpy:
In [349]: timeit cube2(np.arange(1000))
520 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I can improve its time by passing a list instead of an array. Iteration on lists is faster.
In [352]: timeit cube2(list(range(1000)))
229 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But the time for a 'pure' numpy version blows all of those out of the water:
In [350]: timeit cube3(np.arange(1000))
23.6 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The general rule is that numpy methods applied to a numpy array are fastest. But if you must loop, it's usually better to use lists.
Sometimes the pure numpy approach creates very large temporary array. Then memory management complexities can reduce performance. In such cases a modest of number of iterations on a complex task may be best.
How can I improve significantly the speed of the following code? Can mapping, numpy, matrix operations be efficiently used and/or something else to omit the for loop?
import time
def func(x):
if x%2 == 0:
return 'even'
else:
return 'odd'
starttime = time.time()
MAX=1000000
y=list(range(MAX))
for n in range(MAX):
y[n]=[n,n**2,func(n)]
print('That took {} seconds'.format(time.time() - starttime))
The following replacement does not improve the speed:
import numpy as np
r = np.array(range(MAX))
str = ['odd', 'even']
result = np.array([r, r ** 2, list(map(lambda x: str[x % 2], r))])
y = result.T
I think you can do it this way, the idea is to use as many numpy built-in functions as possible
%%timeit
y = np.arange(MAX)
y_2 = y**2
y_str = np.where(y%2==0,'even','odd')
res = np.rec.fromarrays((y,y_2,y_str), names=('y', 'y_2', 'y_str'))
#
# Some examples for working with the record array
res[3]
# (3, 9, 'odd')
res[:3]
# rec.array([(0, 0, 'even'), (1, 1, 'odd'), (2, 4, 'even')],
# dtype=[('y', '<i8'), ('y_2', '<i8'), ('y_str', '<U4')])
res['y_str'][:7]
# array(['even', 'odd', 'even', 'odd', 'even', 'odd', 'even'], dtype='<U4')
res.y_2[:7]
# array([ 0, 1, 4, 9, 16, 25, 36])
I have ran several tests, and it is significantly faster.
For large arrays of the same type, numpy is the way to go. But numpy where is slow, so if you just want 'odd' and 'even', you can use np.tile or something like it:
MAX = 1000000
%%timeit
y = np.arange(MAX)
ystr = np.where(y%2==0,'even','odd')
# 14.9 ms ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
temp = np.array(['even', 'odd'])
ystr = np.tile(temp, MAX//2)
# 4.1 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So tile is about 3-4x faster.
If you want something more complex, I'd still try to avoid where if speed is important. There's almost always a way because the where logic is usually simple so it's easy to take the logical expression that was inside the where and write it as an expression between numpy arrays. (Also, to be sure, using numpy and where will be much faster than pure Python lists, it's just usually slow relative to other numpy options.)
The others are fairly obvious:
y = np.arange(MAX)
y2 = y**2
Personally, I'd just stick these together in a list,
result = [y, y2, ystr]
Putting this all together (using tile), I get:
# 6.82 ms ± 84.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The short answer would be you can't.
Let's explore this a bit through examples.
# Original code
y=list(range(MAX))
for n in range(MAX):
y[n]=[n,n**2,func(n)]
# That took 0.86 seconds
This is the result on my machine so we have a baseline for comparison.
Let us make that a single line and shave off some time.
y = [[n, n ** 2, func(n)] for n in range(MAX)]
# That took 0.74 seconds
We are creating a list of lists and the Python interpreter needs to allocate an empty list MAX times.
In case you don't need to change the number of elements after initialization it might be better to use tuples instead of lists.
y = [(n, n ** 2, func(n)) for n in range(MAX)]
# That took 0.43 seconds
This is twice as fast as the original method.
Let's now assume that we can optimize even more by using some special library and then we just need to parse the result to populate the list. To simulate this we can pickle the list to a binary format and then measure the time it takes to load it.
import pickle
b = pickle.dumps([(n, n ** 2, func(n)) for n in range(MAX)])
starttime = time.time()
y = pickle.loads(b)
print('That took {:.2f} seconds'.format(time.time() - starttime))
# That took 0.23 seconds
This is probably close to what is possible to achieve without coding anything in a lower level language like C and creating Python objects from that language.
Alternative approach
If there is no requirement to create exactly the same object as in the original example and if it is enough that we can read y[10] or y[100:1000] we can do something completely different.
class LazyList():
def __init__(self, size):
self.size = size
def __getitem__(self, key):
if isinstance(key, slice):
r = range(self.size)[key]
return [(n, n ** 2, func(n)) for n in r]
return (key, key ** 2, func(key))
starttime = time.time()
y = LazyList(MAX)
print('That took {:.6f} seconds'.format(time.time() - starttime))
# That took 0.000005 seconds
This is multiple orders of magnitude faster. Of course, this is not a list and the results of the computation are not in memory. We created an object that will in some cases act like a list, but not in other cases (e.g. y[MAX*2] will work, even though it shouldn't). Note that with more work, the object can become even more similar to a list and also use a list as its base class.
If the object we got is converted to a list, the process spends the time that was saved by the alternative approach and the result is the same as in one of the previous examples.
y = y[:]
# That took 0.43 seconds
The longer answer is that it depends on the type of the result that is expected.
See below (it takes ~ 1 sec on my mac)
import time
MAX = 1000000
starttime = time.time()
y = [[n, n ** 2, 'even' if n % 2 == 0 else 'odd'] for n in range(MAX)]
print('That took {} seconds'.format(time.time() - starttime))
I would recommend you to run it with multiple processes in parallel:
import time
import multiprocessing
def func(x):
return [x, x ** 2, "even" if x % 2 == 0 else "odd"]
if __name__ == '__main__':
starttime = time.time()
MAX = 1000000
pool = multiprocessing.Pool(10)
y = pool.map(func, range(MAX))
print('That took {} seconds'.format(time.time() - starttime))
Try to tune the number of processes to get the optimal value for your environment. On mine, it took ~0.8 secs with 20 processes while your original snippet took ~1.1 secs.
Here is a numpy approach:
import time
import numpy as np
starttime = time.time()
r = np.arange(MAX)
res = [r, r ** 2, np.where(r % 2, 'odd', 'even')]
print('That took {:.4} seconds'.format(time.time() - starttime))
# That took 0.05125 seconds || original function took 1.5s
As #Divakar pointed out, how to move on from here depends on what end result you want.
One option would be to have an object array with mixed types:
res = np.array(res, dtype=object).T
print('That took {:.4} seconds'.format(time.time() - starttime))
# That took 0.1863 seconds
res[17]
# array([17, 289, 'odd'], dtype=object)
res[18] + res[17]
# array([35, 613, 'evenodd'], dtype=object) # add for int and str
Unfortunately it is quite expansive to combine the 3 different arrays. It is still way faster than using loops but depending on your next steps you could maybe make further improvements.
On my computer:
the original loop took about 1.01 seconds
NumPy solution took 10.3 ms
Numba solution took 4.25 ms
from numba import njit
import numpy as np
def f(n_max = 1_000_000):
y = x ** 2
z = x % 2
return y, z
#njit
def g(x):
y = x ** 2
z = x % 2
return y, z
n_max = 1_000_000
x = np.arange(n_max, dtype=int)
NumPy:
%%timeit
y, z = f(x)
10.3 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And Numba:
y, z = g(x) # don't time first run, which does compile AND execute
%%timeit
y, z = g(x)
4.25 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Given S being an n x m matrix, as a numpy array, I want to call function f on pairs of (S[i], S[j]) to calculate a particular value of interest, stored in a matrix with dimensions n x n. In my particular case the function f is commutative so f(x,y) = f(y,x).
With all this in mind I am wondering if I can do any tricks to speed this up as much as I can, n can be fairly large.
When I time the function f, it's around a couple of microseconds, which is as expected. It's a pretty straightforward calculation. Below I show you the timings I got, compared with max() and sum() for reference.
In [19]: %timeit sum(s[11])
4.68 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit max(s[11])
3.61 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: %timeit f(s[11], s[21], 50, 10, 1e-5)
1.23 µs ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit f(s[121], s[321], 50, 10, 1e-5)
1.26 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
However when I time the overall processing time for a 500x50 sample data (resulting in 500 x 500 /2 = 125K comparisons), the overall time blows up significantly (into minutes). I would have expected something like 0.2-0.3 seconds (1.25E5 * 2E-6 sec/calc).
In [12]: #jit
...: def testf(s, n, m, p):
...: tol = 1e-5
...: sim = np.zeros((n,n))
...: for i in range(n):
...: for j in range(n):
...: if i > j:
...: delta = p[i] - p[j]
...: if delta < 0:
...: res = f(s[i], s[j], m, abs(delta), tol) # <-- max(s[i])
...: else:
...: res = f(s[j], s[i], m, delta, tol) # <-- sum(s[j])
...: sim[i][j] = res
...: sim[j][i] = res
...: return sim
In code above I have changed the lines where res is assigned to max() and sum() (commented out parts) for testing and the code executes approx 100 times faster, even though the functions themselves are slower compared to my function f()
Which brings me to my questions:
Can I avoid the double loop to speed this up? Ideally I want to be able to run this for matrices where n = 1E5 size. (Comment: since the max and sum, functions work considerably faster, my guess is that the for loops isn't the bottleneck here, but still good to know if there is a better way)
What may cause the severe slowdown with my function, if it's not the double for loop?
EDIT
The specifics of the function f was asked, by some comments. It's iterating over two arrays and checks the number of values in the two arrays that are "close enough". I removed the comments and changes some variable names but the logic is as shown below. It was interesting to note that math.isclose(x,y,rel_tol) which is equivalent to the if-statements i have below, makes the code significantly slower, probably due to library call?
from numba import jit
#jit
def f(arr1, arr2, n, d, rel_tol):
counter = 0
i,j,k = 0,0,0
while (i < n and j < n and k < n):
val = arr1[j] + d
if abs(arr1[i] - arr2[k]) < rel_tol * max(arr1[i], arr2[k]):
counter += 1
i += 1
k += 1
elif abs(val - arr2[k]) < rel_tol * max(val, arr2[k]):
counter += 1
j += 1
k += 1
else:
# incremenet the index corresponding to the lightest
if arr1[i] <= arr2[k] and arr1[i] <= val:
if i < n:
i += 1
elif val <= arr1[i] and val <= arr2[k]:
if j < n:
j += 1
else:
k += 1
return counter
I have a sparse matrix in csr format, e.g.:
>>> a = sp.random(3, 3, 0.6, format='csr') # an example
>>> a.toarray() # just to see how it looks like
array([[0.31975333, 0.88437035, 0. ],
[0. , 0. , 0. ],
[0.14013856, 0.56245834, 0.62107962]])
>>> a.data # data array
array([0.31975333, 0.88437035, 0.14013856, 0.56245834, 0.62107962])
For this particular example, I want to get [0, 4] which are the data-array indices of the non-zero diagonal elements 0.31975333 and 0.62107962.
A simple way to do this is the following:
ind = []
seen = set()
for i, val in enumerate(a.data):
if val in a.diagonal() and val not in seen:
ind.append(i)
seen.add(val)
But in practice the matrix is very big, so I don't want to use the for loops or convert to numpy array using toarray() method. Is there a more efficient way to do it?
Edit: I just realized that the above code gives incorrect result in cases when there are off-diagonal elements equal to and preceding some of the diagonal elements: it returns the indices of that off-diagonal element. Also, it doesn't return the indices of repeating diagonal elements. For example:
a = np.array([[0.31975333, 0.88437035, 0. ],
[0.62107962, 0.31975333, 0. ],
[0.14013856, 0.56245834, 0.62107962]])
a = sp.csr_matrix(a)
>>> a.data
array([0.31975333, 0.88437035, 0.62107962, 0.31975333, 0.14013856,
0.56245834, 0.62107962])
My code returns ind = [0, 2], but it should be [0, 3, 6].
The code provided by Andras Deak (his get_rowwise function), returns the correct result.
I've found a possibly more efficient solution, though it still loops. However, it loops over the rows of the matrix rather than on the elements themselves. Depending on the sparsity pattern of your matrix this might or might not be faster. This is guaranteed to cost N iterations for a sparse matrix with N rows.
We just loop through each row, fetch the filled column indices via a.indices and a.indptr, and if the diagonal element for the given row is present in the filled values then we compute its index:
import numpy as np
import scipy.sparse as sp
def orig_loopy(a):
ind = []
seen = set()
for i, val in enumerate(a.data):
if val in a.diagonal() and val not in seen:
ind.append(i)
seen.add(val)
return ind
def get_rowwise(a):
datainds = []
indices = a.indices # column indices of filled values
indptr = a.indptr # auxiliary "pointer" to data indices
for irow in range(a.shape[0]):
rowinds = indices[indptr[irow]:indptr[irow+1]] # column indices of the row
if irow in rowinds:
# then we've got a diagonal in this row
# so let's find its index
datainds.append(indptr[irow] + np.flatnonzero(irow == rowinds)[0])
return datainds
a = sp.random(300, 300, 0.6, format='csr')
orig_loopy(a) == get_rowwise(a) # True
For a (300,300)-shaped random input with the same density the original version runs in 3.7 seconds, the new version runs in 5.5 milliseconds.
Method 1
This is a vectorized approach, which generates all nonzero indices first and than gets the positions where row and column index is the same. This is a bit slow and has a high memory usage.
import numpy as np
import scipy.sparse as sp
import numba as nb
def get_diag_ind_vec(csr_array):
inds=csr_array.nonzero()
return np.array(np.where(inds[0]==inds[1])[0])
Method 2
Loopy approaches are in general no problem regarding peformance, as long as you make use of Compiler eg. Numba or Cython. I allocated memory for the maximum diagonal elements that could occour. If this method uses to much memory it can be easily modified.
#nb.jit()
def get_diag_ind(csr_array):
ind=np.empty(csr_array.shape[0],dtype=np.uint64)
rowPtr=csr_array.indptr
colInd=csr_array.indices
ii=0
for i in range(rowPtr.shape[0]-1):
for j in range(rowPtr[i],rowPtr[i+1]):
if (i==colInd[j]):
ind[ii]=j
ii+=1
return ind[:ii]
Timings
csr_array = sp.random(1000, 1000, 0.5, format='csr')
get_diag_ind_vec(csr_array) -> 8.25ms
get_diag_ind(csr_array) -> 0.65ms (first call excluded)
Here's my solution which seems to be faster than get_rowwise (Andras Deak) and get_diag_ind_vec (max9111) (I do not consider the use of Numba or Cython).
The idea is to set the non-zero diagonal elements of the matrix (or its copy) to some unique value x that is not in the original matrix (I chose the max value + 1), and then simply use np.where(a.data == x) to return the desired indices.
def diag_ind(a):
a = a.copy()
i = a.diagonal() != 0
x = np.max(a.data) + 1
a[i, i] = x
return np.where(a.data == x)
Timing:
A = sp.random(1000, 1000, 0.5, format='csr')
>>> %timeit diag_ind(A)
6.32 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit get_diag_ind_vec(A)
14.6 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit get_rowwise(A)
24.3 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit: copying the sparse matrix (in order to preserve the original matrix) is not memory efficient, so a better solution would be to store the diagonal elements and later use them for restoring the original matrix.
def diag_ind2(a):
a_diag = a.diagonal()
i = a_diag != 0
x = np.max(a.data) + 1
a[i, i] = x
ind = np.where(a.data == x)
a[i, i] = a_diag[np.nonzero(a_diag)]
return ind
This is even faster:
>>> %timeit diag_ind2(A)
2.83 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)