Mean of non zero values in sparse matrix? - python

I'm trying to calculate the mean of non-zero values in each row of a sparse row matrix. Using the matrix's mean method doesn't do it:
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[0, 0, 2], [1, 3, 8]])
>>> a.mean(axis=1)
matrix([[ 0.66666667],
[ 4. ]])
The following works but is slow for large matrices:
>>> import numpy as np
>>> b = np.zeros(a.shape[0])
>>> for i in range(a.shape[0]):
... b[i] = a.getrow(i).data.mean()
...
>>> b
array([ 2., 4.])
Could anyone please tell me if there is a faster method?

With a CSR format matrix, you can do this even more easily:
sums = a.sum(axis=1).A1
counts = np.diff(a.indptr)
averages = sums / counts
Row-sums are directly supported, and the structure of the CSR format means that the difference between successive values in the indptr array correspond exactly to the number of nonzero elements in each row.

This seems the typical problem where you can use numpy.bincount. For this I made use of three functions:
(x,y,z)=scipy.sparse.find(a)
returns rows(x),columns(y) and values(z) of the sparse matrix. For instace, x is array([0, 1, 1, 1].
numpy.bincount(x) returns, for each row number, how meny nonzero elemnts you have.
numpy.bincount(x,wights=z) returns, for each row , the sums of non-zero elements.
A final working code:
from scipy.sparse import csr_matrix
a = csr_matrix([[0, 0, 2], [1, 3, 8]])
import numpy
import scipy.sparse
(x,y,z)=scipy.sparse.find(a)
countings=numpy.bincount(x)
sums=numpy.bincount(x,weights=z)
averages=sums/countings
print(averages)
returns:
[ 2. 4.]

I always like summing the values over whatever axis you are interested in and dividing by the total of the nonzero elements in the respective row/column.
Like so:
sp_arr = csr_matrix([[0, 0, 2], [1, 3, 8]])
col_avg = sp_arr.sum(0) / (sp_arr != 0).sum(0)
row_avg = sp_arr.sum(1) / (sp_arr != 0).sum(1)
print(col_avg)
matrix([[ 1., 3., 5.]])
print(row_avg)
matrix([[ 2.],
[ 4.]])
Basically you are summing the total value of all entries along the given axis and dividing by the sum of the True entries where the matrix != 0 (which is the number of real entries).
I find this approach less complicated and easier than the other options.

A simple method to return a list of average value:
a.sum(axis=0) / a.getnnz(axis=0)
Assume that you don't have any explicit zero in your matrix.
Change the axis if you will.

Related

Accumulate 2D numpy arrays into a 3D tensor, then average them all element-wise after

Accumulation stage
In the script, the same-sized data matrix X is re-estimated by some model (here just a random number generator (RNG)) and accumulated/saved in a matrix Y over the course of a finite number of trials t.
import numpy as np
from numpy.random import random
import pandas as pd
k = 3 #shape
t = 5 #trials
Y = np.zeros((t,k,k))
for i in range(5):
X = random((k,k)) #2D estimate
X = pd.DataFrame(X)
Y[i,:,:] = X #3D tensor
Reduction stage
Afterwards, how do I then apply an element-wise reduction of all accumulated 2d X arrays inside the 3d Y tensor into a single 2d matrix Z that is the same shape as X? An example reduction is the average of all the individual X elements reduced into Z:
Z[0,0] = average of: {the first Z[0,0], second Z[0,0], ... , fifth Z[0,0]}
I'd prefer no element-by-element loops, if possible. I showed the accumulation stage using numpy arrays because I don't think pandas DataFrames can be 3d tensors, being restricted to 2d inputs only, but can the arithmetic reduction stage (averaging across accumulated arrays) be done as a pandas DataFrame operation?
Is this what you're looking for?
Toy example:
test = np.arange(12).reshape(-1,2,3)
array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])
Solution
np.apply_over_axes(np.mean,test,0).reshape(test.shape[1],test.shape[2])
array([[3., 4., 5.],
[6., 7., 8.]])
And IIRC I think you're correct, pandas cannot really handle 3d tensors unless you mess about with Multi-indices, so personally I would rather deal with this operation in numpy first and then convert it to a dataframe. You can convert dataframes to numpy via to_numpy().

Numpy : Grouping/ binning values based on associations

Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.
The problem.
Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the sum of the values grouped by associations.
An example for better clarification.
result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values array:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
Note: Here result and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.
x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping - [0, 3, 2, 2, 0, 3, 2, 1]
Here, 1st value(values[0]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we are counting the number of associations, then element value at result[0,0] will be 2 as 1st value and 5th value are associated with result[0, 0]. If we are taking the sum, the result[0, 0] = value[0] + value[4] which is 6.
Current solution
# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)
values = np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
result[-y, x] = result[-y, x] + values[i]
The result,
[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]
Failed solution; But why?
test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution
To my surprise elements are overwritten in test_result. Values at test_result,
[[5, 0],
[6, 2],
[7, 0],
[8, 0]]
Question
1. Why, in the second solution, every element is overwritten?
As #Divakar has pointed out in the comment in his answer -
NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =. It randomly assigns from one of the instances.
2. Is there any Numpy way to do this? That is without looping? I'm looking for some speed optimization.
Approach #2 in #Divakar's answer gives me good results. For 23315 associations, for loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.
Approach #1
Most intutive one would be with np.add.at for those repeated indices -
np.add.at(result, [-y_mapping, x_mapping], values)
Approach #2
We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount and have an implementation like so -
# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)
# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)
If you are always starting off with a zeros array for result, the last step could be made more performant with -
result = binned_sums.astype(out_dtype).reshape(m,n)
I guess you were to write
y_mapping = np.random.randint(low=0, high=result.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=result.shape[1], size=values.shape[0])
With that correction, the code works for me as expected.

Get list of connected nodes ordered by shortest distance

I have a numpy 2D array that represents distances between nodes on a graph. I want to, for a single node, get a list of connected nodes ordered by shortest distance, how can I do that?
# create some data
distances = np.array([[0., 1., 2., 3.], [1.,0.,5.,7.], [2.,5.,0.,4.], [3.,7.,4.,0.]])
# get just the node I care about, 1
closest_to_node = distances[:,1]
print (closest_to_node)
# outputs [ 1. 0. 5. 7.]
I would like to order closest_to_node by distance, but my only way of knowing what node it relates to is the order in the array.
I would like a list that was [1,0,2,3] or even better since item 1 (value 0) is meaningless in this case [1,2,3]
IIUC you could do -
((distances - closest_to_node[:,None])**2).sum(0).argsort()
Alernatively, with Scipy's cdist -
from scipy.spatial.distance import cdist
idx = cdist(distances, closest_to_node[None]).argsort(0).ravel()
Output for given sample -
In [147]: ((distances - closest_to_node[:,None])**2).sum(0).argsort()
Out[147]: array([1, 0, 2, 3])
In [148]: cdist(distances, closest_to_node[None]).argsort(0).ravel()
Out[148]: array([1, 0, 2, 3])

How to convert a Numpy 2D array with object dtype to a regular 2D array of floats

As part of broader program I am working on, I ended up with object arrays with strings, 3D coordinates and etc all mixed. I know object arrays might not be very favorite in comparison to structured arrays but I am hoping to get around this without changing a lot of codes.
Lets assume every row of my array obj_array (with N rows) has format of
Single entry/object of obj_array: ['NAME',[10.0,20.0,30.0],....]
Now, I am trying to load this object array and slice the 3D coordinate chunk. Up to here, everything works fine with simply asking lets say for .
obj_array[:,[1,2,3]]
However the result is also an object array and I will face problem as I want to form a 2D array of floats with:
size [N,3] of N rows and 3 entries of X,Y,Z coordinates
For now, I am looping over rows and assigning every row to a row of a destination 2D flot array to get around the problem. I am wondering if there is any better way with array conversion tools of numpy ? I tried a few things and could not get around it.
Centers = np.zeros([N,3])
for row in range(obj_array.shape[0]):
Centers[row,:] = obj_array[row,1]
Thanks
Nasty little problem... I have been fooling around with this toy example:
>>> arr = np.array([['one', [1, 2, 3]],['two', [4, 5, 6]]], dtype=np.object)
>>> arr
array([['one', [1, 2, 3]],
['two', [4, 5, 6]]], dtype=object)
My first guess was:
>>> np.array(arr[:, 1])
array([[1, 2, 3], [4, 5, 6]], dtype=object)
But that keeps the object dtype, so perhaps then:
>>> np.array(arr[:, 1], dtype=np.float)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
You can normally work around this doing the following:
>>> np.array(arr[:, 1], dtype=[('', np.float)]*3).view(np.float).reshape(-1, 3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected a readable buffer object
Not here though, which was kind of puzzling. Apparently it is the fact that the objects in your array are lists that throws this off, as replacing the lists with tuples works:
>>> np.array([tuple(j) for j in arr[:, 1]],
... dtype=[('', np.float)]*3).view(np.float).reshape(-1, 3)
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
Since there doesn't seem to be any entirely satisfactory solution, the easiest is probably to go with:
>>> np.array(list(arr[:, 1]), dtype=np.float)
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
Although that will not be very efficient, probably better to go with something like:
>>> np.fromiter((tuple(j) for j in arr[:, 1]), dtype=[('', np.float)]*3,
... count=len(arr)).view(np.float).reshape(-1, 3)
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
Based on Jaime's toy example I think you can do this very simply using np.vstack():
arr = np.array([['one', [1, 2, 3]],['two', [4, 5, 6]]], dtype=np.object)
float_arr = np.vstack(arr[:, 1]).astype(np.float)
This will work regardless of whether the 'numeric' elements in your object array are 1D numpy arrays, lists or tuples.
This works great working on your array arr to convert from an object to an array of floats. Number processing is extremely easy after. Thanks for that last post!!!! I just modified it to include any DataFrame size:
float_arr = np.vstack(arr[:, :]).astype(np.float)
This is way faster to just convert your object array to a NumPy float array:
arr=np.array(arr, dtype=[('O', np.float)]).astype(np.float) - from there no looping, index it just like you'd normally do on a NumPy array. You'd have to do it in chunks though with your different datatypes arr[:, 1], arr[:,2], etc. Had the same issue with a NumPy tuple object returned from a C++ DLL function - conversion for 17M elements takes <2s.
You may want to use structured array, so that when you need to access the names and the values independently you can easily do so. In this example, there are two data points:
x = zeros(2, dtype=[('name','S10'), ('value','f4',(3,))])
x[0][0]='item1'
x[1][0]='item2'
y1=x['name']
y2=x['value']
the result:
>>> y1
array(['item1', 'item2'],
dtype='|S10')
>>> y2
array([[ 0., 0., 0.],
[ 0., 0., 0.]], dtype=float32)
See more details: http://docs.scipy.org/doc/numpy/user/basics.rec.html
This problem usually happens when you have a dataset with different types, usually, dates in the first column or so.
What I use to do, is to store the date column in a different variable; and take the rest of the "X matrix of features" into X. So I have dates and X, for instance.
Then I apply the conversion to the X matrix as:
X = np.array(list(X[:,:]), dtype=np.float)
Hope to help!
For structured arrays use
structured_to_unstructured(arr).astype(np.float)
See: https://numpy.org/doc/stable/user/basics.rec.html#numpy.lib.recfunctions.structured_to_unstructured
np.array(list(arr), dtype=np.float) would work to convert all the elements in array to float at once.

initialize a numpy array

Is there way to initialize a numpy array of a shape and add to it? I will explain what I need with a list example. If I want to create a list of objects generated in a loop, I can do:
a = []
for i in range(5):
a.append(i)
I want to do something similar with a numpy array. I know about vstack, concatenate etc. However, it seems these require two numpy arrays as inputs. What I need is:
big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
array i of shape = (2,4) created.
add to big_array
The big_array should have a shape (10,4). How to do this?
EDIT:
I want to add the following clarification. I am aware that I can define big_array = numpy.zeros((10,4)) and then fill it up. However, this requires specifying the size of big_array in advance. I know the size in this case, but what if I do not? When we use the .append function for extending the list in python, we don't need to know its final size in advance. I am wondering if something similar exists for creating a bigger array from smaller arrays, starting with an empty array.
numpy.zeros
Return a new array of given shape and
type, filled with zeros.
or
numpy.ones
Return a new array of given shape and
type, filled with ones.
or
numpy.empty
Return a new array of given shape and
type, without initializing entries.
However, the mentality in which we construct an array by appending elements to a list is not much used in numpy, because it's less efficient (numpy datatypes are much closer to the underlying C arrays). Instead, you should preallocate the array to the size that you need it to be, and then fill in the rows. You can use numpy.append if you must, though.
The way I usually do that is by creating a regular list, then append my stuff into it, and finally transform the list to a numpy array as follows :
import numpy as np
big_array = [] # empty regular list
for i in range(5):
arr = i*np.ones((2,4)) # for instance
big_array.append(arr)
big_np_array = np.array(big_array) # transformed to a numpy array
of course your final object takes twice the space in the memory at the creation step, but appending on python list is very fast, and creation using np.array() also.
Introduced in numpy 1.8:
numpy.full
Return a new array of given shape and type, filled with fill_value.
Examples:
>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf, inf],
[ inf, inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
[10, 10]])
Array analogue for the python's
a = []
for i in range(5):
a.append(i)
is:
import numpy as np
a = np.empty((0))
for i in range(5):
a = np.append(a, i)
You do want to avoid explicit loops as much as possible when doing array computing, as that reduces the speed gain from that form of computing. There are multiple ways to initialize a numpy array. If you want it filled with zeros, do as katrielalex said:
big_array = numpy.zeros((10,4))
EDIT: What sort of sequence is it you're making? You should check out the different numpy functions that create arrays, like numpy.linspace(start, stop, size) (equally spaced number), or numpy.arange(start, stop, inc). Where possible, these functions will make arrays substantially faster than doing the same work in explicit loops
To initialize a numpy array with a specific matrix:
import numpy as np
mat = np.array([[1, 1, 0, 0, 0],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 1]])
print mat.shape
print mat
output:
(5, 5)
[[1 1 0 0 0]
[0 1 0 0 1]
[1 0 0 1 1]
[0 0 0 0 0]
[1 0 1 0 1]]
For your first array example use,
a = numpy.arange(5)
To initialize big_array, use
big_array = numpy.zeros((10,4))
This assumes you want to initialize with zeros, which is pretty typical, but there are many other ways to initialize an array in numpy.
Edit:
If you don't know the size of big_array in advance, it's generally best to first build a Python list using append, and when you have everything collected in the list, convert this list to a numpy array using numpy.array(mylist). The reason for this is that lists are meant to grow very efficiently and quickly, whereas numpy.concatenate would be very inefficient since numpy arrays don't change size easily. But once everything is collected in a list, and you know the final array size, a numpy array can be efficiently constructed.
numpy.fromiter() is what you are looking for:
big_array = numpy.fromiter(xrange(5), dtype="int")
It also works with generator expressions, e.g.:
big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )
If you know the length of the array in advance, you can specify it with an optional 'count' argument.
I realize that this is a bit late, but I did not notice any of the other answers mentioning indexing into the empty array:
big_array = numpy.empty(10, 4)
for i in range(5):
array_i = numpy.random.random(2, 4)
big_array[2 * i:2 * (i + 1), :] = array_i
This way, you preallocate the entire result array with numpy.empty and fill in the rows as you go using indexed assignment.
It is perfectly safe to preallocate with empty instead of zeros in the example you gave since you are guaranteeing that the entire array will be filled with the chunks you generate.
I'd suggest defining shape first.
Then iterate over it to insert values.
big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
big_array[it] = (it,it) # For example
>>>big_array
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.],
[ 5., 5.]])
Whenever you are in the following situation:
a = []
for i in range(5):
a.append(i)
and you want something similar in numpy, several previous answers have pointed out ways to do it, but as #katrielalex pointed out these methods are not efficient. The efficient way to do this is to build a long list and then reshape it the way you want after you have a long list. For example, let's say I am reading some lines from a file and each row has a list of numbers and I want to build a numpy array of shape (number of lines read, length of vector in each row). Here is how I would do it more efficiently:
long_list = []
counter = 0
with open('filename', 'r') as f:
for row in f:
row_list = row.split()
long_list.extend(row_list)
counter++
# now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) # desired numpy array
Maybe something like this will fit your needs..
import numpy as np
N = 5
res = []
for i in range(N):
res.append(np.cumsum(np.ones(shape=(2,4))))
res = np.array(res).reshape((10, 4))
print(res)
Which produces the following output
[[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]]
If you want to add your item in multi-dimensional array, here is the solution.
import numpy as np
big_array = np.ndarray(shape=(0, 2, 4) # Empty with height and width 2, 4 and length 0
for i in range(5):
big_array = np.concatenate((big_array, i))
Here is the numpy official document for referral
# https://thispointer.com/create-an-empty-2d-numpy-array-matrix-and-append-rows-or-columns-in-python/
# Create an empty Numpy array with 4 columns or 0 rows
empty_array = np.empty((0, 4), int)
# Append a row to the 2D numpy array
empty_array = np.append(empty_array, np.array([[11, 21, 31, 41]]), axis=0)
# Append 2nd rows to the 2D Numpy array
empty_array = np.append(empty_array, np.array([[15, 25, 35, 45]]), axis=0)
print('2D Numpy array:')
print(empty_array)
pay attention that each inputed np.array is 2-dimensional

Categories

Resources