Set directly data members in scipy sparse matrix

Set directly data members in scipy sparse matrix - python

I'm building a large CSR sparse matrix which uses quite some memory even in sparse format so I want to avoid a copy when I create the matrix. The most efficient way I found is building directly the compressed sparse row representation. However, the class initializer copies the arrays I pass to it, so I have set directly the data members. Example:
from scipy import sparse
m = sparse.csr_matrix((5,5))
m.data = np.arange(5)
m.indices = np.arange(5)
m.indptr = np.arange(6)
This appears to work but I didn't find it in the documentation, I'd like to know if it is supported, if it breaks something I have not tried.
Also, it would be useful to know if I can use memmapped arrays without quirks, or use different integer datatypes for the indices.
Edit:
The accepted answer shows that no copy happens provided the indices types are correct. I have checked the __init__ and, even when it doesn't copy indices and indptr, it does scan two times both of them to find the minimum and maximum values, and it effectively does nothing more than setting the data, indices and indptr members if the inputs are well-formed, so for performance what I'm doing now is:
# [...] get shape and data from somewhere
m = sparse.csr_matrix(shape, dtype=data.dtype)
indices = np.empty(..., dtype=m.indices.dtype)
indptr = np.empty(..., dtype=m.indptr.dtype)
# [...] fill indices and indptr
m.data = data
m.indices = indices
m.indptr = indptr
# Possibly also do one or both of the following:
m.has_sorted_indices = True
m.has_canonical_format = True

Here's an example of making a sparse matrix without copying the definition arrays:
In [191]: data=np.arange(5)
...: indices=np.arange(5).astype('int32')
...: indptr=np.arange(6).astype('int32')
In [192]: M = sparse.csr_matrix((data,indices,indptr))
In [193]: data.__array_interface__['data'], M.data.__array_interface__['data']
Out[193]: ((55897168, False), (55897168, False))
In [194]: indices.__array_interface__['data'], M.indices.__array_interface__['data']
Out[194]: ((70189040, False), (70189040, False))
In [195]: indptr.__array_interface__['data'], M.indptr.__array_interface__['data']
Out[195]: ((56184432, False), (56184432, False))
https://github.com/scipy/scipy/blob/v1.4.1/scipy/sparse/compressed.py
I wrote that with the __init__ in mind. Look also at the check_format method to see what it checks for consistency.

Related

Numpy empty list type inference

Why is the empty list [] being inferred as float type when using np.append?
np.append([1,2,3], [0])
# output: array([1, 2, 3, 0]), dtype = np.int64
np.append([1,2,3], [])
# output: array([1., 2., 3.]), dtype = np.float64
This is persistent even when using a np.array([1,2,3], dtype=np.int32) as arr.
It's not possible to specify a dtype for append, so I am just curious on why this happens. Numpy's concatenate does the same thing, but when I try to specify the dtype I get an error:
np.concatenate([[1,2,3], []], dtype=np.int64)
Error:
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'
But finally if I set the unsafe casting rule it works:
np.concatenate([[1,2,3], []], dtype=np.int64, casting='unsafe')
Why is [] considered a float?

np.append is subject to well-defined semantic rules like any Numpy binary operation. As a result, it first converts the input operands to Numpy arrays if this is not the case (typically with np.array) and then apply the semantic rules to find the type of the resulting array and check it is a valid operation before applying the actual operation (here the concatenation). The array type returned by np.array is "determined as the minimum type required to hold the objects in the sequence" regarding to the documentation. When the list is empty, like in your case, the default type is numpy.float64 as stated in the documentation of np.empty. This arbitrary choice was made long ago and has not been changed since in order not to break old codes. Please note that It seems not all Numpy developers agree with the current choice and so this is a matter of debate. For more information, you can read this opened issue.
The rule of thumb is to use either existing Numpy arrays or to perform an explicit conversion to a Numpy array using np.array with a fixed dtype parameter (as described in the above comments).

Look at the code for np.append (via docs link or ipython):
def append(arr, values, axis=None):
arr = asanyarray(arr)
if axis is None:
if arr.ndim != 1:
arr = arr.ravel()
values = ravel(values)
axis = arr.ndim-1
return concatenate((arr, values), axis=axis)
The first argument is turned into an array, if it isn't one already.
You don't specify the axis, so both arr and values are ravelled - turned into 1d array. np.ravel is also python code, and does asanyarray(a).ravel(order=order)
So the dtype inference is done by np.asanyarray.
The rest of the action is np.concatenate. It too will convert the inputs to arrays if necessary. The result dtype is the "highest" of the inputs.
np.append is a poorly conceived (IMO) alternative way of using np.concatenate. It is not a list append clone.
Also be careful about "empty" arrays:
In [73]: np.array([])
Out[73]: array([], dtype=float64)
In [74]: np.empty((0))
Out[74]: array([], dtype=float64)
In [75]: np.empty((0),int)
Out[75]: array([], dtype=int64)
The common list idiom
alist = []
for i in range(10):
alist.append(i)
does not translate well into numpy. Build a list of arrays, and do one concatenate/vstack at the end. Don't iterate over "empty" arrays, however created.

how does memory allocation occur in numpy array?

import numpy as np
a = np.arange(5)
for i in a:
print("Id of {} : {} \n".format(i,id(i)))
>>>>
Id of 0 : 2295176255984
Id of 1 : 2295176255696
Id of 2 : 2295176255984
Id of 3 : 2295176255696
Id of 4 : 2295176255984
I want to understand how the elements of numpy array are being allocated in the memory, which I understand is different from that of Python arrays seeing the output.
Any help is appreciated.

In [68]: arr = np.arange(5)
In [69]: arr
Out[69]: array([0, 1, 2, 3, 4])
One way of viewing the attributes of a numpy array is:
In [70]: arr.__array_interface__
Out[70]:
{'data': (139628245945184, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (5,),
'version': 3}
data is something like the id of its data buffer, where the values are actually stored. We can't use this number in other code, but it is useful when checking for things like view. The rest is used to interpret those values.
The memory for arr is a c array 40 bytes long (5*8) somewhere. That where does not matter to us. Any view of arr will work with the same data buffer. a copy will have its own data buffer.
Iterating on the array is like accessing values one by one:
In [71]: i = arr[1]
In [72]: i
Out[72]: 1
In [73]: type(i)
Out[73]: numpy.int64
This i is not a reference to an element of a. It is new object with the same value. It's a lot like a 0d array, with many of the same attributes, including:
In [74]: i.__array_interface__
Out[74]:
{'data': (25251568, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (),
'version': 3,
'__ref': array(1)}
This is why you can't make much sense from looking at the id in the iteration. It is also why iterating on a numpy array is slower than iterating on list. We strongly discourage iteration like this.
Contrast that with a list, where elements are stored (in some sort of data buffer) by reference:
In [78]: a,b,c = 100,'b',{}
In [79]: id(a)
Out[79]: 9788064
In [80]: alist=[a,b,c]
In [81]: id(alist[0])
Out[81]: 9788064
The list actually contains a, or if you prefer a reference to the same object that the variable a references. Remember, Python is object oriented all the way down.
In sum, Python lists contain references. Numpy arrays contain values, which its own methods access and manipulate. There is an object dtype that does contain references, but let's not go there.

I'm a fan of Code with Mosh. He teaches all such kind of things on his youtube channel as well as udemy. I've purchased his udemy course on Data structures and Algorithms which goes deep into how something works.
For example, while teaching about an array, he shows how to make an array so as to understand the underlying concepts behind it.
You can take a look here: https://www.youtube.com/watch?v=BBpAmxU_NQo
If you're only interested in knowing about only the NumPy array:
First I'll tell you about the differences:
Difference between NumPy and an Array
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays. A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
The Python core library provided Lists. A list is the Python equivalent of an array, but is resizeable and can contain elements of different types.
A common beginner question is what is the real difference here. The answer is performance. Numpy data structures perform better in:
Size - Numpy data structures take up less space
Performance - they have a need for speed and are faster than lists
Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built-in.
Another key notable difference is in how they store and make use of memory
Memory
The main benefits of using NumPy arrays should be smaller memory consumption and better runtime behaviour.
For Python Lists - We can conclude from this that for every new element, we need another eight bytes for the reference to the new object. The new integer object itself consumes 28 bytes.
NumPy takes up less space. This means that an arbitrary integer array of length "n" in NumPy needs
If you are curious and want me to prove that NumPy really takes less time:
# importing required packages
import numpy
import time
# size of arrays and lists
size = 1000000
# declaring lists
list1 = range(size)
list2 = range(size)
# declaring arrays
array1 = numpy.arange(size)
array2 = numpy.arange(size)
# capturing time before the multiplication of Python lists
initialTime = time.time()
# multiplying elements of both the lists and stored in another list
resultantList = [(a * b) for a, b in zip(list1, list2)]
# calculating execution time
print("Time taken by Lists to perform multiplication:",
(time.time() - initialTime),
"seconds")
# capturing time before the multiplication of Numpy arrays
initialTime = time.time()
# multiplying elements of both the Numpy arrays and stored in another Numpy array
resultantArray = array1 * array2
# calculating execution time
print("Time taken by NumPy Arrays to perform multiplication:",
(time.time() - initialTime),
"seconds")
Output:
Time taken by Lists : 0.15030384063720703 seconds
Time taken by NumPy Arrays : 0.005921125411987305 seconds
Wait.. There is a very big disadvantage too:
Requires continuous allocation of memory -
Insertion and deletion operations can become costly as data is stored in contiguous memory locations as shifting it requires shifting.
If you want to learn more about numpy:
https://www.educba.com/introduction-to-numpy/
You can thank me later!

Why the length of the array appended in loop is more than the number of iteration?

I ran this code and expected an array size of 10000 as time is a numpy array of length of 10000.
freq=np.empty([])
for i,t in enumerate(time):
freq=np.append(freq,np.sin(t))
print(time.shape)
print(freq.shape)
But this is the output I got
(10000,)
(10001,)
Can someone explain why I am getting this disparity?

It turns out that the function np.empty() returns an uninitialized array of a given shape. Hence, when you do np.empty([]), it returns an uninitialized array as array(0.14112001). It's like having a value "ready to be used", but without having the actual value. You can check this out by printing the variable freq before the loop starts.
So, when you loop over freq = np.append(freq,np.sin(t)) this actually initializes the array and append a second value to it.
Also, if you just need to create an empty array just do x = np.array([]) or x = [].
You can read more about this numpy.empty function here:
https://numpy.org/doc/1.18/reference/generated/numpy.empty.html
And more about initializing arrays here:
https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.3/com.ibm.xlc1313.aix.doc/language_ref/aryin.html
I'm not sure if I was clear enough. It's not a straight forward concept. So please let me know.

You should fill np.empty(0).
I look for source code of numpy numpy/core.py
def empty(shape, dtype=None, order='C'):
"""Return a new matrix of given shape and type, without initializing entries.
Parameters
----------
shape : int or tuple of int
Shape of the empty matrix.
dtype : data-type, optional
Desired output data-type.
order : {'C', 'F'}, optional
Whether to store multi-dimensional data in row-major
(C-style) or column-major (Fortran-style) order in
memory.
See Also
--------
empty_like, zeros
Notes
-----
`empty`, unlike `zeros`, does not set the matrix values to zero,
and may therefore be marginally faster. On the other hand, it requires
the user to manually set all the values in the array, and should be
used with caution.
Examples
--------
>>> import numpy.matlib
>>> np.matlib.empty((2, 2)) # filled with random data
matrix([[ 6.76425276e-320, 9.79033856e-307], # random
[ 7.39337286e-309, 3.22135945e-309]])
>>> np.matlib.empty((2, 2), dtype=int)
matrix([[ 6600475, 0], # random
[ 6586976, 22740995]])
"""
return ndarray.__new__(matrix, shape, dtype, order=order)
It will input first arg shape into ndarray, so it will init a new array as [].
And you can print np.empty(0) and freq=np.empty([]) to see what are their differences.

I think you are trying to replicate a list operation:
freq=[]
for i,t in enumerate(time):
freq.append(np.sin(t))
But neither np.empty or np.append are exact clones; the names are similar but the differences are significant.
First:
In [75]: np.empty([])
Out[75]: array(1.)
In [77]: np.empty([]).shape
Out[77]: ()
This is a 1 element, 0d array.
If you look at the code for np.append you'll see that if the 1st argument is not 1d (and axis is not provided) it flattens it (that's documented as well):
In [78]: np.append??
In [82]: np.empty([]).ravel()
Out[82]: array([1.])
In [83]: np.empty([]).ravel().shape
Out[83]: (1,)
It is not a 1d, 1 element array. Append that with another array:
In [84]: np.append(np.empty([]), np.sin(2))
Out[84]: array([1. , 0.90929743])
The result is 2d. Repeat that 1000 times and you end up with 1001 values.
np.empty despite its name does not produce a [] list equivalent. As others show np.array([]) sort of does, as would np.empty(0).
np.append is not a list append clone. It is just a cover function to np.concatenate. It's ok for adding an element to a longer array, but beyond that it has too many pitfalls to be useful. It's especially bad in a loop like this. Getting a correct start array is tricky. And it is slow (compared to list append). Actually these problems apply to all uses of concatenate and stack... in a loop.

Numpy view contiguous part of non-contiguous array as dtype of bigger size

I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:
# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')
Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:
tri = np.lib.stride_tricks.as_strided(a, (len(a) - 2, 3), a.strides * 2)
This generates a trigram list with shape (2**28 - 2, 3) where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3) so that numpy displays it more "reasonably" (instead of individual chars).
tri = tri.view('S3')
It gives the exception:
ValueError: To change to a dtype of a different size, the array must be C-contiguous
I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.
So I'm wondering how to view contiguous part in non-contiguous np.ndarray as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape and stride freely with np.lib.stride_tricks.as_strided, but I can't force the dtype to be something, which is the problem here.
EDIT
Non-contiguous array can be made by simple slicing. For example:
np.empty((8, 4), 'uint32')[:, :2].view('uint64')
will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.

If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.
For example your trigrams can be obtained like so:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')
In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided does not do many checks we are essentially free to do whatever we like.
It seems we can always get such a stub by slicing to a size 0 array. For your second example:
>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
[ 32],
[ 32083728],
[ 31978800],
[ 0],
[ 29686448],
[ 32],
[ 32362720]], dtype=uint64)

As of numpy 1.23.0, you will be able to do exactly what you want without jumping through extra hoops. I've added PR#20722 to numpy to address pretty much this exact issue. The idea is that if your new dtype is smaller than the current, you can clearly expand a unit or contiguous axis without any problems. If the new dtype is larger, you can shrink a contiguous axis.
With the update, your code runs out of the box:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b'\x19', b'\xf9', b'\r', ..., b'\xc3', b'\xa3', b'{'], dtype='|S1')
>>> tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
>>> tri.view('S3')
array([[b'\x9dB\xeb'],
[b'B\xebU'],
[b'\xebU\xa4'],
...,
[b'-\xcbM'],
[b'\xcbM\x97'],
[b'M\x97o']], dtype='|S3')
The array has to have a unit dimension or be contiguous in the last axis, which is true in your case.
I've also added PR#20694 to introduce slicing to the np.char module. If that PR gets accepted as-is, you will be able to do:
>>> np.char.slice_(a.view(f'U{len(a)}'), step=1, chunksize=3)

How this mixed scipy.sparse / numpy program should be handled

I am currently trying to use numpy as well a scipy in order to handle sparse matrices, but, in the process of evaluating sparsity of a matrix, I had trouble, and I don't know how the following behaviour should be understood:
import numpy as np
import scipy.sparse as sp
a=sp.csc.csc_matrix(np.ones((3,3)))
a
np.count_nonzero(a)
When evaluating a, and non zero count, using the above code, I saw this output in ipython:
Out[9]: <3x3 sparse matrix of type '' with 9
stored elements in Compressed Sparse Column format>
Out[10]: 1
I think there is something I don't understand here.
A 3*3 matrix full of 1, should have 9 non-zero term, and this is the answer I get if I use the toarray method from scipy.
I may be using numpy and scipy the wrong way ?

The nonzero count is available as an attribute:
In [295]: a=sparse.csr_matrix(np.arange(9).reshape(3,3))
In [296]: a
Out[296]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in Compressed Sparse Row format>
In [297]: a.nnz
Out[297]: 8
As Warren commented, you can't count on numpy functions working on sparse. Use sparse functions and methods. Sometimes numpy functions are written in a way that invokes the arrays own method, in which the function call might work. But that is true only on a case by case basis.
In Ipython I make heavy use of the a.<tab> to get a list of completions (attributes and methods). I also use the function?? to look at the code.
In the case of np.count_nonzero I see no code - it is compiled, and only works on np.ndarray objects.
np.nonzero(a) works. Look at its code, and see that it looks for the array's method: nonzero = a.nonzero
The sparse nonzero method code is:
def nonzero(self):
...
# convert to COOrdinate format
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])
The A.data !=0 line is there because it is possible to construct a matrix with 0 data elements, particularly if you use the coo (data,(i,j)) format. So apart from that caution, the nnz attribute gives a reliable count.
Doing a.<tab> I also see a.getnnz and a.eleminate_zeros methods, which may be helpful if you are worried about sneaky zeros.
Sometimes it is useful to work directly with the data attributes of a sparse matrix. It's safer to access them than to modify them. But each sparse format has different attributes. In the csr case you can do:
In [306]: np.count_nonzero(a.data)
Out[306]: 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.