Related
I have a matrix in a sparse csr format for example:
from scipy.sparse import csr_matrix
import numpy as np
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
M = csr_matrix((data, (row, col)), shape=(3, 3))
M.A =
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I am re-ordering the matrix with the index [2,0,1] using the following approach:
order = np.array([2,0,1])
M = M[order,:]
M = M[:,order]
M.A
array([[6, 4, 5],
[2, 1, 0],
[3, 0, 0]])
This approach works but it is not feasible for my real csr_matrix which has the size of 16580746 X 1672751804 and causes memory error.
I took another approach like this:
edge_list = zip(row,col,dat)
index = dict(zip(order, range(len(order))))
all_coeff = zip(*((index[u], index[v],d) for u,v,d in edge_list if u in index and v in index))
new_row,new_col,new_data = all_coeff
n = len(order)
graph = csr_matrix((new_data, (new_row, new_col)), shape=(n, n))
This also works but fall into the same trap of memory error for large sparse matrix. Any suggestions to efficiently do this?
I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, new_row_order=None, new_col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix
using the specified array(s) of indexes
e.g., [1,0,2,3,...] would swap the first and second row/col.
"""
if new_row_order is None and new_col_order is None:
return M
new_M = M
if new_row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[new_row_order]
new_M = I.dot(new_M)
if new_col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[new_col_order]
new_M = new_M.dot(I)
return new_M
Let's think smart.
Instead of reordering the matrix, why don't you work directly on the row and column indexes that you provided at the start?
So for example, you can replace your row indexes in this way, from:
[0, 0, 1, 2, 2, 2]
to:
[2, 2, 0, 1, 1, 1]
And your column indexes, from:
[0, 2, 2, 0, 1, 2]
to:
[2, 1, 1, 2, 0, 1]
I have a NumPy array with each row representing some (x, y, z) coordinate like so:
a = array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
I also have another NumPy array with unique values of the z-coordinates of that array like so:
b = array([1, 2])
How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.
In the end, the output would be an array the same shape as b.
I'm trying to vectorize this to make it as fast as possible. Thanks!
Example of an expected output (assuming that f is count()):
c = array([2, 2])
because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.
A trivial solution would be to iterate over array b like so:
for val in b:
apply function to a based on val
append to an array c
My attempt:
I tried doing something like this, but it just returns an empty array.
func(a[a[:, 2]==b])
The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2],
[4, 3, 1]])
a_sorted = a[a[:,2].argsort()]
inds = np.unique(a_sorted[:,2], return_index=True)[1]
a_split = np.split(a_sorted, inds)[1:]
# [array([[0, 0, 1],
# [4, 5, 1],
# [4, 3, 1]]),
# array([[1, 1, 2],
# [4, 5, 2]])]
f = np.sum # example of a function
result = list(map(f, a_split))
# [19, 15]
But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.
EDIT: For completeness, here are the other two solutions
List comprehension:
b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]
Pandas:
df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()
This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):
As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.
If you are allowed to use pandas:
import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').agg(f)
Here f can be any custom function working on grouped data.
Numeric example:
a = np.array([[0, 0, 1],
[1, 1, 2],
[4, 5, 1],
[4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()
z
1 2
2 2
dtype: int64
Remark that .size is the way to count number of rows per group.
To keep it into pure numpy, maybe this can suit your case:
tmp = np.array([a[a[:,2]==i] for i in b])
tmp
array([[[0, 0, 1],
[4, 5, 1]],
[[1, 1, 2],
[4, 5, 2]]])
which is an array with each group of arrays.
c = np.array([])
for x in np.nditer(b):
c = np.append(c, np.where((a[:,2] == x))[0].shape[0])
Output:
[2. 2.]
I want to assign 0 to different length slices of a 2d array.
Example:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = np.array([0,1,2,0])
Given the above array arr and indices idxs how can you assign to different length slices. Such that the result is:
arr = np.array([[0,2,3,4],
[0,0,3,4],
[0,0,0,4],
[0,2,3,4]])
These don't work
slices = np.array([np.arange(i) for i in idxs])
arr[slices] = 0
arr[:, :idxs] = 0
You can use broadcasted comparison to generate a mask, and index into arr accordingly:
arr[np.arange(arr.shape[1]) <= idxs[:, None]] = 0
print(arr)
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])
This does the trick:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = [0,1,2,0]
for i,j in zip(range(arr.shape[0]),idxs):
arr[i,:j+1]=0
import numpy as np
arr = np.array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
idxs = np.array([0, 1, 2, 0])
for i, idx in enumerate(idxs):
arr[i,:idx+1] = 0
Here is a sparse solution that may be useful in cases where only a small fraction of places should be zeroed out:
>>> idx = idxs+1
>>> I = idx.cumsum()
>>> cidx = np.ones((I[-1],), int)
>>> cidx[0] = 0
>>> cidx[I[:-1]]-=idx[:-1]
>>> cidx=np.cumsum(cidx)
>>> ridx = np.repeat(np.arange(idx.size), idx)
>>> arr[ridx, cidx]=0
>>> arr
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])
Explanation: We need to construct the coordinates of the positions we want to put zeros in.
The row indices are easy: we just need to go from 0 to 3 repeating each number to fill the corresponding slice.
The column indices start at zero and most of the time are incremented by 1. So to construct them we use cumsum on mostly ones. Only at the start of each new row we have to reset. We do that by subtracting the length of the corresponding slice such as to cancel the ones we have summed in that row.
I have this 2d array of zeros z and this 1d array of starting points starts. In addition, I have an 1d array of offsets
z = z = np.zeros(35, dtype='i').reshape(5, 7)
starts = np.array([1, 5, 3, 0, 3])
offsets = np.arange(5) + 1
I would like to vectorize this little for loop here, but I seem to be unable to do it.
for i in range(z.shape[0]):
z[i, starts[i]:] += offsets[i]
The result in this example should look like this:
z
array([[0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 2, 2],
[0, 0, 0, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4],
[0, 0, 0, 5, 5, 5, 5]])
We could use some masking and NumPy broadcasting -
mask = starts[:,None] <= np.arange(z.shape[1])
z[mask] = np.repeat(offsets, mask.sum(1))
We could play a trick of broadcasted multiplication to get the final output -
z = offsets[:,None] * mask
Other way would be to assign values into z from offsets and then mask out the rest of mask, like so -
z[:] = offsets[:,None]
z[~mask] = 0
And other way would be have a replicated version from offsets as the starting z and then mask out -
z = np.repeat(offsets,z.shape[1]).reshape(z.shape[0],-1)
z[~mask] = 0
Of course, we would need the shape parameters before-hand.
If z is not initialized as zeros array, then only one of the solutions mentioned earlier would be applicable and that would need to be updated with +=, like so -
z[mask] += np.repeat(offsets, mask.sum(1))
I would like to scale an array of shape (h, w) by a factor of n, resulting in an array of shape (h*n, w*n), with the.
Say that I have a 2x2 array:
array([[1, 1],
[0, 1]])
I would like to scale the array to become 4x4:
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 1, 1],
[0, 0, 1, 1]])
That is, the value of each cell in the original array is copied into 4 corresponding cells in the resulting array. Assuming arbitrary array size and scaling factor, what's the most efficient way to do this?
You should use the Kronecker product, numpy.kron:
Computes the Kronecker product, a composite array made of blocks of the second array scaled by the first
import numpy as np
a = np.array([[1, 1],
[0, 1]])
n = 2
np.kron(a, np.ones((n,n)))
which gives what you want:
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 1, 1],
[0, 0, 1, 1]])
You could use repeat:
In [6]: a.repeat(2,axis=0).repeat(2,axis=1)
Out[6]:
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 1, 1],
[0, 0, 1, 1]])
I am not sure if there's a neat way to combine the two operations into one.
scipy.misc.imresize can scale images. It can be used to scale numpy arrays, too:
#!/usr/bin/env python
import numpy as np
import scipy.misc
def scale_array(x, new_size):
min_el = np.min(x)
max_el = np.max(x)
y = scipy.misc.imresize(x, new_size, mode='L', interp='nearest')
y = y / 255 * (max_el - min_el) + min_el
return y
x = np.array([[1, 1],
[0, 1]])
n = 2
new_size = n * np.array(x.shape)
y = scale_array(x, new_size)
print(y)
To scale effectively I use following approach. Works 5 times faster than repeat and 10 times faster that kron. First, initialise target array, to fill scaled array in-place. And predefine slices to win few cycles:
K = 2 # scale factor
a_x = numpy.zeros((h * K, w *K), dtype = a.dtype) # upscaled array
Y = a_x.shape[0]
X = a_x.shape[1]
myslices = []
for y in range(0, K) :
for x in range(0, K) :
s = slice(y,Y,K), slice(x,X,K)
myslices.append(s)
Now this function will do the scale:
def scale(A, B, slices): # fill A with B through slices
for s in slices: A[s] = B
Or the same thing simply in one function:
def scale(A, B, k): # fill A with B scaled by k
Y = A.shape[0]
X = A.shape[1]
for y in range(0, k):
for x in range(0, k):
A[y:Y:k, x:X:k] = B