Related
Let $A$ be a csr_matrix representing the connectivity matrix for a graph where $A_{ij}$ is the weight of an edge. Now, I need to inverse each non-zero element of the matrix in an efficient way. The way I'm doing this right now is
B = 1.0 / A.toarray()
B[B == np.inf] = 0
This has two down-sides:
memory usage increases by converting a csr_matrix to an array.
a division by zero happens
Are there any suggestions to do this more efficient?
One way you could do this is to create a new matrix from the data, indices and indptr of A: B = csr_matrix((1/A.data, A.indices, A.indptr)).
(This assumes that there are no explicitly stored zeros in A, so 1/A.data doesn't result in some values being inf.)
For example,
In [108]: A
Out[108]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [109]: A.A
Out[109]:
array([[0. , 1. , 2.5, 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 4. ],
[2. , 0. , 0. , 0. ]])
In [110]: B = csr_matrix((1/A.data, A.indices, A.indptr))
In [111]: B
Out[111]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [112]: B.A
Out[112]:
array([[0. , 1. , 0.4 , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.25],
[0.5 , 0. , 0. , 0. ]])
csr has a power method:
In [598]: M = sparse.csr_matrix([[0,3,2],[.5,0,10]])
In [599]: M
Out[599]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [600]: M.A
Out[600]:
array([[ 0. , 3. , 2. ],
[ 0.5, 0. , 10. ]])
In [601]: x = M.power(-1)
In [602]: x
Out[602]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [603]: x.A
Out[603]:
array([[0. , 0.33333333, 0.5 ],
[2. , 0. , 0.1 ]])
I have a 2D array (a confusion matrix), for example (3,3). The number in the array refers to the index into a set of labels.
I know that this array should actually be (5,5) instead of (3,3), for the 5 row and column labels. I can find the labels that have been "hit":
import numpy as np
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x) # array([1, 4]
I know that the row and column for the missed index is all zero, so the output I want is this:
y = np.array([[3, 0, 0, 3, 0],
[0, 0, 0, 0, 0], # <- Inserted row at index 1 all zeros
[0, 0, 2, 0, 0],
[2, 0, 3, 3, 0],
[0, 0, 0, 0, 0]]) # <- Inserted row at index 4 all zeros
# ^ ^
# | |
# Inserted columns at index 1 and 4 all zeros
I can do that with multiple calls to np.insert in a loop over all missing indices:
def insert_rows_columns_at_slow(arr, indices):
result = arr.copy()
for idx in indices:
result = np.insert(result, idx, np.zeros(result.shape[1]), 0)
result = np.insert(result, idx, np.zeros(result.shape[0]), 1)
However, my real array is much bigger, and there may be many more missing indices. Since np.insert re-allocates every time, this is not very efficient.
How can I achieve the same result, but in a more efficient, vectorized way? Bonus points if it works in more than 2 dimensions.
Just another option:
Instead of using the missing indeces, use the non missing indeces:
non_missing_idxs = np.union1d(np.arange(len(labels)), x) # array([0, 2, 3])
y = np.zeros((5,5))
y[non_missing_idxs[:,None], non_missing_idxs] = x
output:
array([[3., 0., 0., 3., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 2., 0., 0.],
[2., 0., 3., 3., 0.],
[0., 0., 0., 0., 0.]])
You can do this by pre-allocating the full resulting array and filling the rows and columns with the old array, even in multiple dimensions, and the dimensions don't have to match size:
def insert_at(arr, output_size, indices):
"""
Insert zeros at specific indices over whole dimensions, e.g. rows and/or columns and/or channels.
You need to specify indices for each dimension, or leave a dimension untouched by specifying
`...` for it. The following assertion should hold:
`assert len(output_size) == len(indices) == len(arr.shape)`
:param arr: The array to insert zeros into
:param output_size: The size of the array after insertion is completed
:param indices: The indices where zeros should be inserted, per dimension. For each dimension, you can
specify: - an int
- a tuple of ints
- a generator yielding ints (such as `range`)
- Ellipsis (=...)
:return: An array of shape `output_size` with the content of arr and zeros inserted at the given indices.
"""
# assert len(output_size) == len(indices) == len(arr.shape)
result = np.zeros(output_size)
existing_indices = [np.setdiff1d(np.arange(axis_size), axis_indices,assume_unique=True)
for axis_size, axis_indices in zip(output_size, indices)]
result[np.ix_(*existing_indices)] = arr
return result
For your use-case, you can use it like this:
def fill_by_label(arr, labels):
# If this is your only use-case, you can make it more efficient
# By not computing the missing indices first, just to compute
# The existing indices again
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
return insert_at(arr, output_size=(len(labels), len(labels)),
indices=(missing_idxs, missing_idxs))
x = np.array([[3, 0, 3],
[0, 2, 0],
[2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
print(fill_by_label(x, labels))
>> [[3. 0. 0. 3. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 2. 0. 0.]
[2. 0. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
But this is very flexible. You can use it for zero padding:
def zero_pad(arr):
out_size = np.array(arr.shape) + 2
indices = (0, out_size[0] - 1), (0, out_size[1] - 1)
return insert_at(arr, output_size=out_size,
indices=indices)
print(zero_pad(x))
>> [[0. 0. 0. 0. 0.]
[0. 3. 0. 3. 0.]
[0. 0. 2. 0. 0.]
[0. 2. 3. 3. 0.]
[0. 0. 0. 0. 0.]]
It also works with non-quadratic inputs and outputs:
x = np.ones((3, 4))
print(insert_at(x, (4, 5), (2, 3)))
>>[[1. 1. 1. 0. 1.]
[1. 1. 1. 0. 1.]
[0. 0. 0. 0. 0.]
[1. 1. 1. 0. 1.]]
With different number of insertions per dimension:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, (2, 4))))
>> [[1. 1. 0. 1. 0. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 1. 0. 1.]
[1. 1. 0. 1. 0. 1.]]
You can use range (or other generators) instead of enumerating every index:
x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, range(2, 4))))
>>[[1. 1. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0.]
[1. 1. 0. 0. 1. 1.]
[1. 1. 0. 0. 1. 1.]]
It works with arbitrary dimensions (as long as you specify indices for every dimension)1:
x = np.ones((2, 2, 2))
print(insert_at(x, (3, 3, 3), (0, 0, 0)))
>>>[[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]
[[0. 0. 0.]
[0. 1. 1.]
[0. 1. 1.]]]
You can use Ellipsis (=...) to indicate that you don't want to change a dimension1,2:
x = np.ones((2, 2))
print(insert_at(x, (2, 4), (..., (0, 1))))
>>[[0. 0. 1. 1.]
[0. 0. 1. 1.]]
1: You could automatically detect this based on arr.shape and output_size, and fill it with ... as needed, but I'll leave that up to you if you need it. If you wanted to, you could probably get rid of the output_size parameter instead, but then it gets trickier with passing in generators.
2: This is somewhat different to the normal numpy ... semantics, as you need to specify ... for every dimension that you want to keep, i.e. the following does NOT work:
x = np.ones((2, 2, 2))
print(insert_at(x, (2, 2, 3), (..., 0)))
For timing, I ran the insertion of 10 rows and columns into a 90x90 array 100000 times, this is the result:
x = np.random.random(size=(90, 90))
indices = np.arange(10) * 10
def measure_time_fast():
insert_at(x, (100, 100), (indices, indices))
def measure_time_slow():
insert_rows_columns_at_slow(x, indices)
if __name__ == '__main__':
import timeit
for speed in ("fast", "slow"):
times = timeit.repeat(f"measure_time_{speed}()", setup=f"from __main__ import measure_time_{speed}", repeat=10, number=10000)
print(f"Min: {np.min(times) / 10000}, Max: {np.max(times) / 10000}, Mean: {np.mean(times) / 10000} seconds per call")
For the fast version:
Min: 7.336409069976071e-05, Max: 7.7440657400075e-05, Mean:
7.520040466995852e-05 seconds per call
That is about 75 microseconds.
For your slow version:
Min: 0.00028272533010022016, Max: 0.0002923079213000165, Mean:
0.00028581595062998535 seconds per call
That is about 300 microseconds.
The difference will be greater, the bigger the arrays get. E.g. for inserting 100 rows and columns into a 900x900 array, these are the results (ran only 1000 times):
Fast version:
Min: 0.00022916630539984907, Max: 0.0022916630539984908, Mean:
0.0022916630539984908 seconds per call
Slow version:
Min: 0.013766934227399906, Max: 0.13766934227399907, Mean:
0.13766934227399907 seconds per call
I have an arbitrary row vector "u" and an arbitrary matrix "e" as follows:
u = np.resize(np.array([8,3]),[1,2])
e = np.resize(np.array([[2,2,5,5],[1, 6, 7, 4]]),[4,2])
np.cov(u,e)
array([[ 12.5, 0. , 0. , -12.5, 7.5],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[-12.5, 0. , 0. , 12.5, -7.5],
[ 7.5, 0. , 0. , -7.5, 4.5]])
The matrix that this returns is 5x5. This is confusing to me because the largest dimension of the inputs is only 4.
Thus, this may be less of a numpy question and more of a math question...not sure...
Please refer to the official numpy documentation (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cov.html) and check whether you usage of the numpy.cov function is consistent with what you are trying to achieve and you understand what you are trying to do.
When looking at the signature
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
m : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of m represents a variable, and each column a single observation > > of all those variables. Also see rowvar below.
y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
Note how m and y are combined as shown in the last example on the page
>>> x = [-2.1, -1, 4.3]
>>> y = [3, 1.1, 0.12]
>>> X = np.stack((x, y), axis=0)
>>> print(np.cov(X))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x, y))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x))
11.71
Does anyone has experience in creating sparse matrix with the non-zero values follows a uniform distribution of [-0.5, 0.5] and has zero mean (zero centered) in python (e.g. using Scipy.sparse)?
I am aware that scipy.sparse package provide a few method on creating random sparse matrix, like 'rand' and 'random'. However I could not achieve what I want with those method. For example, I tried:
import numpy as np
import scipy.sparse as sp
s = np.random.uniform(-0.5,0.5)
W=sp.random(1024, 1024, density=0.01, format='csc', data_rvs=s)
To specifiy my idea:
Let say I want the above mentioned matrix which is non-sparse, or dense, I will create it by:
dense=np.random.rand(1024,1024)-0.5
'np.random.rand(1024,1024)' will create a dense uniform matrix with values in [0,1]. To make it zero mean, I centre the matrix by substract it 0.5.
However if I create a sparse matrix, let say:
sparse=sp.rand(1024,1024,density=0.01, format='csc')
The matrix will be having non-zero values in uniform [0,1]. However, if I want to centre the matrix, I cannot simply do 'sparse-=0.5' which will cause all the originally zero entries non-zero after substraction.
So, how can I achieve the same as for the above example for dense matrix on sparse matrix?
Thank you for all of your help!
The data_rvs parameter is expecting a "callable" that takes a size. This isn't exactly obvious from the documentation. This can be done with a lambda as follows:
import numpy as np
import scipy.sparse as sp
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=lambda s: np.random.uniform(-0.5, 0.5, size=s))
Then print(W) gives:
(243, 0) -0.171300809713
(315, 0) 0.0739590145626
(400, 0) 0.188151369316
(440, 0) -0.187384896218
: :
(1016, 0) 0.29262088084
(156, 1) -0.149881296136
(166, 1) -0.490405135834
(191, 1) 0.188167190147
(212, 1) 0.0334533020488
: :
(411, 1) 0.122330200832
(431, 1) -0.0494334160833
(813, 1) -0.0076379249885
(828, 1) 0.462807265425
: :
(840, 1021) 0.456423017883
(12, 1022) -0.47313075329
: :
(563, 1022) -0.477190349161
(655, 1022) -0.460942546313
(673, 1022) 0.0930207181126
(676, 1022) 0.253643616387
: :
(843, 1023) 0.463793903168
(860, 1023) 0.454427252782
For the newbie, the lambda may look odd - this is just an unnamed function. The sp.random function takes an optional argument data_rvs that defaults to None. When specified, it is expected to be a function that takes a size argument and returns that number of random numbers. A simple function to do this would be:
def generate_n_uniform_randoms(n):
return np.uniform(-0.5, 0.5, n)
I don't know the origin of the API, but the shape is not needed as sp.random presumably first figures out which indices will be non-zero, and then it just needs to compute random values for those indices, which is a set of a known size.
The lambda is just syntactic sugar that allows us to define that function inline in terms of some other function call. We could instead write
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=generate_n_uniform_randoms)
Actually, this can be a "callable" - some object f for which f(n) returns n random variables. This can be a function, but it can also be an object of a class that implements the __call__(self, n) function. For example:
class ufoo(object):
def __call__(self, n):
import numpy
return numpy.random.uniform(-0.5, 0.5, n)
W = sp.random(1024, 1024, density=0.01, format='csc',
data_rvs=ufoo())
If you need the mean to be exactly zero (within roundoff of course), this can be done by subtracting the mean from the non-zero values, as I mentioned above:
W.data -= np.mean(W.data)
Then:
W[idx].mean()
-2.3718641632430623e-18
sparse.random does 2 things - distributes nonzeros randomly, and generates random uniform values.
In [62]: M = sparse.random(10,10,density=.2, format='csr')
In [63]: M
Out[63]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
In [64]: M.data
Out[64]:
array([ 0.42825407, 0.51858978, 0.8084335 , 0.08691635, 0.13210409,
0.61288928, 0.39675205, 0.58242891, 0.5174367 , 0.57859824,
0.48812484, 0.13472883, 0.82992478, 0.70568697, 0.45001632,
0.52147305, 0.72943809, 0.55801913, 0.97018861, 0.83236235])
You can modify the data values cheaply without changing the sparsity distribution:
In [65]: M.data -= 0.5
In [66]: M.A
Out[66]:
array([[ 0. , 0. , 0. , -0.07174593, 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0.01858978, 0. , 0. , 0.3084335 , -0.41308365,
0. , 0. , 0. , 0. , -0.36789591],
[ 0. , 0. , 0. , 0. , 0.11288928,
-0.10324795, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.08242891, 0.0174367 , 0. ],
[ 0. , 0. , 0.07859824, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. , -0.01187516, 0. , 0. , -0.36527117],
[ 0. , 0. , 0.32992478, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.20568697,
0. , 0. , -0.04998368, 0. , 0. ],
[ 0.02147305, 0. , 0.22943809, 0.05801913, 0. ,
0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.47018861, 0.33236235, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ]])
In [67]: np.mean(M.data)
Out[67]: 0.044118297661574338
Or replacing the nonzero values with a new set of values:
In [69]: M.data = np.random.randint(-5,5,20)
In [70]: M
Out[70]:
<10x10 sparse matrix of type '<class 'numpy.int32'>'
with 20 stored elements in Compressed Sparse Row format>
In [71]: M.A
Out[71]:
array([[ 0, 0, 0, 4, 0, 0, 0, 0, 0, 0],
[-1, 0, 0, 1, 2, 0, 0, 0, 0, -4],
[ 0, 0, 0, 0, 0, 4, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, -5, -5, 0],
[ 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, -3, 0, 0, 3],
[ 0, 0, -1, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, -4, 0, 0, -1, 0, 0],
[-1, 0, -5, -2, 0, 0, 0, 0, 0, 0],
[ 0, 3, 1, 0, 0, 0, 0, 0, 0, 0]])
In [72]: M.data
Out[72]:
array([ 4, -1, 1, 2, -4, 0, 4, -5, -5, 2, -3, 3, -1, -4, -1, -1, -5,
-2, 3, 1])
In my opinion, your requirements are still incomplete (see disadvantage mentioned below).
Here is some implementation for my simple construction outlined above in my comment:
import numpy as np
import scipy.sparse as sp
M, N, NNZ = 5, 5, 10
assert NNZ % 2 == 0
flat_dim = M*N
valuesA = np.random.uniform(-0.5, 0.5, size=NNZ // 2)
valuesB = valuesA * -1
values = np.hstack((valuesA, valuesB))
positions_flat = np.random.choice(flat_dim, size=NNZ, replace=False)
positions_2d = np.unravel_index(positions_flat, dims=(M, N))
mat = sp.coo_matrix((values, (positions_2d[0], positions_2d[1])), shape=(M, N))
print(mat.todense())
print(mat.data.mean())
Output:
[[ 0. 0. 0. 0.0273862 0. ]
[-0.3943963 0. 0. -0.04134932 0. ]
[-0.10121743 0. -0.0273862 0. 0.04134932]
[ 0.3943963 0. 0. 0. 0. ]
[-0.24680983 0. 0.24680983 0.10121743 0. ]]
0.0
Advantages
sparse
zero mean
entries from uniform distribution
Potential disadvantage:
for each value x in the matrix, somewhere -x is to be found!
meaning: it's not uniform in a more broad joint-distribution sense
if that's hurtful only you can tell
if yes: the above construction could be easily modified to use any centered values from some distribution, so your problem collapses into this somewhat smaller (but not necessarily much easier problem)
Now in regards to that linked problem: i'm guessing here, but i would not be surprised to see that sampling x values uniformly with the constraint mean(x)=0 is NP-hard.
Keep in mind, that a-posteriori centering of nonzeros, as recommend in the other answer, changes the underlying distribution (even for simple distributions). In some cases even invalidating bounds (leaving interval -0.5, 0.5).
This means: this question is all about formalizing which objective is how important and balance these out in some way.
Given an adjacency list:
adj_list = [array([0,1]),array([0,1,2]),array([0,2])]
And an array of indices,
ind_arr = array([0,1,2])
Goal:
A = np.zeros((3,3))
for i in ind_arr:
A[i,list(adj_list[x])] = 1.0/float(adj_list[x].shape[0])
Currently, I have written:
A[ind_list[:],adj_list[:]] = 1. / len(adj_list[:])
And tried various configurations of indexing within this scaffold.
Here's one approach -
lens = np.array([len(i) for i in adj_list])
col_idx = np.concatenate(adj_list)
out = np.zeros((len(lens), col_idx.max()+1))
row_idx = np.repeat(np.arange(len(lens)), lens)
vals = np.repeat(1.0/lens, lens)
out[row_idx, col_idx] = vals
Sample input, output -
In [494]: adj_list = [np.array([0,2]),np.array([0,1,4])]
In [496]: out
Out[496]:
array([[ 0.5 , 0. , 0.5 , 0. , 0. ],
[ 0.33333333, 0.33333333, 0. , 0. , 0.33333333]])
Sparse matrix as output
Additionally, if you want to save memory and create a sparse matrix instead, that's an easy extension -
In [506]: from scipy.sparse import csr_matrix
In [507]: csr_matrix((vals, (row_idx, col_idx)), shape=(len(lens), col_idx.max()+1))
Out[507]:
<2x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [508]: _.toarray()
Out[508]:
array([[ 0.5 , 0. , 0.5 , 0. , 0. ],
[ 0.33333333, 0.33333333, 0. , 0. , 0.33333333]])
I don't think you can completely eliminate loops due to the mixed data types, but you can reduce the nested double for loops to a single one:
A = np.zeros((2, 3))
for i, arr in enumerate(adj_list):
arr_size = len(arr)
A[i, :arr_size] = 1./arr_size
A
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])
Or if the numbers in the arrays are actually columns positions:
A = np.zeros((2, 3))
for i, arr in enumerate(adj_list):
A[i, arr] = 1./len(arr)
A
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])
Another option using MultiLabelBinarizer from sklearn(but may not be as efficient):
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
adj_list = [np.array([0,1]),np.array([0,1,2])]
sizes = np.fromiter(map(len, adj_list), dtype=int)
mlb.fit_transform(adj_list)/sizes[:,None]
# array([[ 0.5 , 0.5 , 0. ],
# [ 0.33333333, 0.33333333, 0.33333333]])