Construct (N+1)-dimensional diagonal matrix from values in N-dimensional array - python

I have an N-dimensional array. I want to expand it to an (N+1)-dimensional array by putting the values of the final dimension in the diagonal.
For example, using explicit looping:
In [197]: M = arange(5*3).reshape(5, 3)
In [198]: numpy.dstack([numpy.diag(M[i, :]) for i in range(M.shape[0])]).T
Out[198]:
array([[[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 0, 2]],
[[ 3, 0, 0],
[ 0, 4, 0],
[ 0, 0, 5]],
[[ 6, 0, 0],
[ 0, 7, 0],
[ 0, 0, 8]],
[[ 9, 0, 0],
[ 0, 10, 0],
[ 0, 0, 11]],
[[12, 0, 0],
[ 0, 13, 0],
[ 0, 0, 14]]])
which is a 5×3×3 array.
My actual arrays are large and I would like to avoid explicit looping (hiding the looping in map instead of a list comprehension has no performance gain; it's still a loop). Although numpy.diag works for constructing a regular, 2-D diagonal matrix, it does not extend to higher dimensions (when given a 2-D array, it will extract its diagonal instead). The array returned by numpy.diagflat makes everything into one big diagonal, resulting in a 15×15 array which has far more zeroes and cannot be reshaped into 5×3×3.
Is there a way to efficiently construct an (N+1)-diagonal matrix from the values in a N-dimensional array, without calling diag many times?

Use numpy.diagonal to take a view of the relevant diagonals of a properly-shaped N+1-dimensional array, force the view to be writeable with setflags, and write to the view:
expanded = numpy.zeros(M.shape + M.shape[-1:], dtype=M.dtype)
diagonals = numpy.diagonal(expanded, axis1=-2, axis2=-1)
diagonals.setflags(write=True)
diagonals[:] = M
This produces your desired array as expanded.

You can use an almost-impossible-to-guess-if-you-don't-know feature of the ubiquitous np.einsum. When used as follows einsum will return a writable view of the generalized diagonal:
>>> import numpy as np
>>> M = np.arange(5*3).reshape(5, 3)
>>>
>>> out = np.zeros((*M.shape, M.shape[-1]), M.dtype)
>>> np.einsum('...jj->...j', out)[...] = M
>>> out
array([[[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 0, 2]],
[[ 3, 0, 0],
[ 0, 4, 0],
[ 0, 0, 5]],
[[ 6, 0, 0],
[ 0, 7, 0],
[ 0, 0, 8]],
[[ 9, 0, 0],
[ 0, 10, 0],
[ 0, 0, 11]],
[[12, 0, 0],
[ 0, 13, 0],
[ 0, 0, 14]]])

A general way to turn the last dimension of a N-D array into a diagonal matrix:
We will need to reduce the dimensionality of the array, apply the numpy.diag() function to each vector, and then rebuild that to the original dimensionality + 1.
reshaping the matrix to 2 dimensional:
M.reshape(-1, M.shape[-1])
then use map to apply np.diag to that, and rebuild the matrix with an additional dimension using the following:
result.reshape([*M.shape, M.shape[-1]])
All of this combined gives the following:
result = np.array(list(map(
np.diag,
M.reshape(-1, M.shape[-1])
))).reshape([*M.shape, M.shape[-1]])
An example:
shape = np.arange(2,8)
M = np.arange(shape.prod()).reshape(shape)
print(M.shape) # (2, 3, 4, 5, 6, 7)
result = np.array(list(map(np.diag, M.reshape(-1, M.shape[-1])))).reshape([*M.shape, M.shape[-1]])
print(result.shape) # (2, 3, 4, 5, 6, 7, 7)
and res[0,0,0,0,2,:] contains the following:
array([[14, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0],
[ 0, 0, 16, 0, 0, 0, 0],
[ 0, 0, 0, 17, 0, 0, 0],
[ 0, 0, 0, 0, 18, 0, 0],
[ 0, 0, 0, 0, 0, 19, 0],
[ 0, 0, 0, 0, 0, 0, 20]])

Related

numpy resize n-dimensional array with padding

I have two arrays, a and b.
a has shape (1, 2, 3, 4)
b has shape (4, 3, 2, 1)
I would like to make them both (4, 3, 3, 4) with the new positions filled with 0's.
I can do:
new_shape = (4, 3, 3, 4)
a = np.resize(a, new_shape)
b = np.resize(b, new_shape)
..but this repeats the elements of each to form the new elements, which does not work for me.
Instead I thought I could do:
a = a.resize(new_shape)
b = b.resize(new_shape)
..which according to the documentation pads with 0's.
But it doesn't work for multi-dimensional arrays, raising error:
ValueError: resize only works on single-segment arrays
So is there a different way to achieve this? ie. same as np.resize but with 0-padding?
NB: I am only looking for pure-numpy solutions.
EDIT: I'm using numpy version 1.20.2
EDIT: I just found out that is works for numbers, but not for objects, I forgot to mention that it is an array of objects not numbers.
resize method pads with 0s in a flattened sense; the function pads with repeats.
To illustrate how resize "flattens" before padding:
In [108]: a = np.arange(12).reshape(1,4,3)
In [109]: a
Out[109]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]]])
In [110]: a1 = a.copy()
In [111]: a1.resize((2,4,4))
In [112]: a1
Out[112]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 0, 0, 0]],
[[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0]]])
If instead I make a target array of the right shape, and copy, I can maintain the original multidimensional block:
In [114]: res = np.zeros((2,4,4),a.dtype)
In [115]: res[:a.shape[0],:a.shape[1],:a.shape[2]]=a
In [116]: res
Out[116]:
array([[[ 0, 1, 2, 0],
[ 3, 4, 5, 0],
[ 6, 7, 8, 0],
[ 9, 10, 11, 0]],
[[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0]]])
I wrote out the slices explicitly (for clarity). Such a tuple could be created programmatically if needed.

What's the most efficient way to replace some given indices of a NumPy array?

I have three arrays, indices, values, and replace_values. I have to loop over indices, replacing each value in old_values[indices[i]] with new_values[i]. What's the fastest possible way to do this? It feels like there should be some way to speed it up using NumPy functions or advanced slicing instead of the normal for loop.
This code works, but is relatively slow:
import numpy as np
# Example indices and values
values = np.zeros([5, 5, 3]).astype(int)
indices = np.array([[0,0], [1,0], [1,3]])
replace_values = np.array([[140, 150, 160], [20, 30, 40], [100, 110, 120]])
print("The old values are:")
print(values)
for i in range(len(indices)):
values[indices[i][0], indices[i][1]] = replace_values[i]
print("The new values are:")
print(values)
Use zip to separate x and y indices, then cast to tuple and assign:
>>> values[tuple(zip(*indices))] = replace_values
>>> values
array([[[140, 150, 160],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 20, 30, 40],
[ 0, 0, 0],
[ 0, 0, 0],
[100, 110, 120],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]]])
Where tuple(zip(*indices)) returns:
((0, 1, 1), (0, 0, 3))
As your indices is np.array itself, you can remove zip and use transpose, as pointed out by #MadPhysicist:
>>> values[tuple(*indices.T)]

Group by sparse matrix in scipy and return a matrix

There are a few questions on SO dealing with using groupby with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.
I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.
Here's the context:
I have vectorized some documents (sample data here):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)
print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)
Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>
From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:
from scipy import sparse, hstack
df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))
print("Dimensions of concatenated set: {0}".format(train_X_all.shape))
Dimensions of concatenated set: (8, 181)
Now I'd like to apply groupby (or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.
The output matrix would be 3 x 181 and look something like this:
1, 1, 1, ..., 2, 1, 3
2, 1, 3, ..., 1, 1, 4
0, 0, 0, ..., 1, 2, 5
Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.
The best way of calculating the sum of selected columns (or rows) of a csr sparse matrix is a matrix product with another sparse matrix that has 1's where you want to sum. In fact csr sum (for a whole row or column) works by matrix product, and index rows (or columns) is also done with a product (https://stackoverflow.com/a/39500986/901925)
So I'd group the dates array, and use that information to construct the summing 'mask'.
For sake of discussion, consider this dense array:
In [117]: A
Out[117]:
array([[0, 2, 7, 5, 0, 7, 0, 8, 0, 7],
[0, 0, 3, 0, 0, 1, 2, 6, 0, 0],
[0, 0, 0, 0, 2, 0, 5, 0, 0, 0],
[4, 0, 6, 0, 0, 5, 0, 0, 1, 4],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 7, 0, 8, 1, 0, 9, 0, 2, 4],
[9, 0, 8, 4, 0, 0, 0, 0, 9, 7],
[0, 0, 0, 1, 2, 0, 2, 0, 4, 7],
[3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
[0, 0, 1, 8, 5, 0, 0, 0, 8, 0]])
Make a sparse copy:
In [118]: M=sparse.csr_matrix(A)
generate some groups, based on the last column; collections.defaultdict is a convenient tool to do this:
In [119]: grps=defaultdict(list)
In [120]: for i,v in enumerate(A[:,-1]):
...: grps[v].append(i)
In [121]: grps
Out[121]: defaultdict(list, {0: [1, 2, 4, 9], 2: [8], 4: [3, 5], 7: [0, 6, 7]})
I can iterate on those groups, collect rows of M, sum across those rows and produce:
In [122]: {k:M[v,:].sum(axis=0) for k, v in grps.items()}
Out[122]:
{0: matrix([[0, 0, 4, 8, 7, 2, 7, 6, 8, 0]], dtype=int32),
2: matrix([[3, 0, 1, 0, 0, 0, 0, 0, 0, 2]], dtype=int32),
4: matrix([[4, 7, 6, 8, 1, 5, 9, 0, 3, 8]], dtype=int32),
7: matrix([[ 9, 2, 15, 10, 2, 7, 2, 8, 13, 21]], dtype=int32)}
In the last column, values include 2*4, and 3*7
So there are 2 tasks - collecting the groups, whether with this defaultdict, or itertools.groupby (which in this case would require sorting), or pandas groupby. And secondly this collection of rows and summing. This dictionary iteration is conceptually simple.
A masking matrix might work like this:
In [141]: mask=np.zeros((10,10),int)
In [142]: for i,v in enumerate(A[:,-1]): # same sort of iteration
...: mask[v,i]=1
...:
In [143]: Mask=sparse.csr_matrix(mask)
...
In [145]: Mask.A
Out[145]:
array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
....
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [146]: (Mask*M).A
Out[146]:
array([[ 0, 0, 4, 8, 7, 2, 7, 6, 8, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 4, 7, 6, 8, 1, 5, 9, 0, 3, 8],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 9, 2, 15, 10, 2, 7, 2, 8, 13, 21],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
This Mask*M has the same values as the dictionary row, but with the extra 0s. I can isolate the nonzero values with the lil format:
In [147]: (Mask*M).tolil().data
Out[147]:
array([[4, 8, 7, 2, 7, 6, 8], [], [3, 1, 2], [],
[4, 7, 6, 8, 1, 5, 9, 3, 8], [], [],
[9, 2, 15, 10, 2, 7, 2, 8, 13, 21], [], []], dtype=object)
I can construct the Mask matrix directly using the coo sparse style of input:
Mask = sparse.csr_matrix((np.ones(A.shape[0],int),
(A[:,-1], np.arange(A.shape[0]))), shape=(A.shape))
That should be faster and avoid the memory error (no loop or large dense array).
Here is a trick using LabelBinarizer and matrix multiplication.
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output=True)
grouped = lb.fit_transform(groups).T.dot(train_X)
grouped is the output sparse matrix of size 3 x 180. And you can find the list of its groups in lb.classes_.

Initializing an n-dimensional matrix elegantly in Python

There have been a couple questions on SO about how to initialize a 2-dimensional matrix, with the answer being something like this:
matrix = [[0 for x in range(10)] for x in range(10)]
Is there any way to generalize this to n dimensions without using for blocks or writing out a really long nested list comprehension?
As integers are immutable you can reduce your code to:
matrix = [[0] * 10 for x in range(10)]
As #iCodez mentioned in comments if NumPy is an option you can simply do:
import numpy as np
matrix = np.zeros((10, 10))
If you really want a matrix, np.zeros and np.ones can quickly create such a 2 dimensional array for instantiating a matrix:
import numpy as np
my_matrix = np.matrix(np.zeros((10,10)))
To generalize to n dimensions, you can't use a matrix, which by definition is 2 dimensional:
n_dimensions = 3
width = 10
n_dimensional_array = np.ones((width,) * n_dimensions)
#brian-putman was faster and better... anyway, this is my solution:
init = lambda x, y: [init(x, y-1) if y>1 else 0 for _ in xrange(x)]
that generates only square matrices of size x filled with zeroes in y dimensions. called like this
init(5, 3)
[[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]]
I agree that if numpy is an option, it's a much easier way to work with matrices. I highly recommend it.
That being said, this recursive function is a reasonable way to generalize your code to n dimensions. The first parameter is a list or tuple specifying how large each dimension should be (and, indirectly, how many dimensions). The second parameter is the constant value to fill the matrix with (in your example, 0):
def init(sizes, value=0):
if (len(sizes) == 1):
return [value] * sizes[0]
else:
# old code - fixed per comment. This method does not create
# sizes[0] *new* lists, it just repeats the same list
# sizes[0] times. This causes unexpected behavior when you
# try to set an item in a list and all of its siblings get
# the same change
# return [init(sizes[1:], value)] * sizes[0]
# this method works better; it creates a new list each time through
return [init(sizes[1:], value) for i in xrange(sizes[0])]
matrix = init((2,3,4), 5)
matrix[0][0][0] = 100 # setting value per request in comment
print matrix
>>> [[[100, 5, 5, 5], [5, 5, 5, 5], [5, 5, 5, 5]], [[5, 5, 5, 5], [5, 5, 5, 5], [5, 5, 5, 5]]]
N-dimensional arrays are a little hard to print on a 2D screen, but you can see the structure of matrix a little more easily in the snippet below which I manually indented. It's an array of length 2, containing arrays of length 3, containing arrays of length 4, where every value is set to 5:
[
[
[100, 5, 5, 5],
[5, 5, 5, 5],
[5, 5, 5, 5]
],
[
[5, 5, 5, 5],
[5, 5, 5, 5],
[5, 5, 5, 5]
]
]

stacking unequal matrices in python

Can someone tell me how to join two unequal numpy arrays(one sparse and one dense). I tried using hstack/vstack but keep getting the dimensionality error.
from scipy import sparse
from scipy.sparse import coo_matrix
m=(coo_matrix(X_new)) # is a(7395,50000) sparse array
a=(other) # is a (7395,20) dense array
new_tr=scipy.sparse.hstack((m,a))
Please post more context to make your problem reproducible. As posted, your code works for me:
X_new = np.zeros((10,5), int)
X_new[(np.random.randint(0,10,5),np.random.randint(0,5,5))] = np.random.randint(0,10,5)
X_new
#array([[ 0, 0, 7, 0, 0],
# [ 0, 0, 0, 0, 0],
# [ 0, 0, 8, 3, 0],
# [ 0, 0, 0, 0, 0],
# [ 0, 0, 0, 0, 0],
# [ 0, 0, 0, 0, 0],
# [ 0, 0, 8, 0, 0],
# [ 1, 0, 0, 0, 0],
# [ 0, 0, 0, 0, 0],
# [ 0, 0, 0, 0, 0]])
m = coo_matrix(X_new)
m
#<10x5 sparse matrix of type '<type 'numpy.int64'>'
# with 5 stored elements in COOrdinate format>
a = np.matrix(np.random.randint(0,10,(10,2)))
a
#matrix([[2, 1],
# [5, 2],
# [4, 1],
# [1, 4],
# [5, 2],
# [7, 2],
# [6, 3],
# [8, 4],
# [5, 5],
# [7, 4]])
new_tr = sparse.hstack([m,a])
new_tr
#<10x7 sparse matrix of type '<type 'numpy.int64'>'
# with 25 stored elements in COOrdinate format>
new_tr.todense()
#matrix([[ 0, 0, 7, 0, 0, 2, 1],
# [ 0, 0, 0, 0, 0, 5, 2],
# [ 0, 0, 8, 3, 0, 4, 1],
# [ 0, 0, 0, 0, 0, 1, 4],
# [ 0, 0, 0, 0, 0, 5, 2],
# [ 0, 0, 0, 0, 0, 7, 2],
# [ 0, 0, 8, 0, 0, 6, 3],
# [ 1, 0, 0, 0, 0, 8, 4],
# [ 0, 0, 0, 0, 0, 5, 5],
# [ 0, 0, 0, 0, 0, 7, 4]])

Categories

Resources