NumPy native way to store dictionary of arrays - python

I'm looking to see if there is a more efficient way (i.e. using native NumPy functionality) to achieve what I'm doing currently.
My process is I start with an array a:
a = np.array([[0,2,0,-1],[-0.2,0,-0.1,0],[0,0,-0.1,0],[0,0,0,0]])
array([[ 0. , 2. , 0. , -1. ],
[-0.2, 0. , -0.1, 0. ],
[ 0. , 0. , -0.1, 0. ],
[ 0. , 0. , 0. , 0. ]])
I then filter based on where the values are not equal to 0:
r_indices, c_indicies = np.where(a != 0)
(array([0, 0, 1, 1, 2]), array([1, 3, 0, 2, 2]))
From there, I create a Python dictionary b like so:
b = {i: c_indices[r_indices == i] for i in np.unique(r_indices)}
{
0: array([1, 3]),
1: array([0, 2]),
2: array([2])},
}
I do this because I want to know for a given unique row index r, which column indices are not 0.
My own preference is to try to use NumPy as much as possible to take advantage of speed benefits. However, I'm not sure how else to structure this in NumPy since the values in the dictionary could range from a length of 0 (no values are not zero) to 4 (all values are not zero).
Am I being paranoid about the potential speed benefits?

You can use Pandas in the following way:
import pandas as pd
import numpy as np
if __name__=='__main__':
a = np.array([[0, 2, 0, -1], [-0.2, 0, -0.1, 0], [0, 0, -0.1, 0], [0, 0, 0, 0]])
rows, cols = np.where(a !=0)
x = list(zip(rows, cols))
df = pd.DataFrame.from_records(data=x)
l = df.groupby(0)[1].apply(list)
L = [np.array(a) for a in l.values]
d = dict(zip(np.unique(rows), L))
Output
{0: array([1, 3]), 1: array([0, 2]), 2: array([2])}
As pandas works with numpy under the hood, this code will be much more efficient than the regular list comprehension.
Also, if all you need is a dictionary-like object - you could inhance the performance further by using the l Pandas.GroupBy as:
l.loc[0]
which will result in :
[1, 3]
which is equivalent to the b[0] in your example.
and omitting the last two lines altogether, as Pandas provide a very fast mechanisms for handling large amounts of tabular data, and generally preferable to a plain dict object, if they used for the same thing.
Cheers.

Related

Assign zeros to minimum values in numpy 3d array

I have a numpy array of shape (100, 100, 20) (in python 3)
I want to find for each 'pixel' the 15 channels with minimum values, and make them zeros (meaning: make the array sparse, keep only the 5 highest values).
Example:
input: array = [[1,2,3], [7,6,9], [12,71,3]], num_channles_to_zero = 2
output: [[0,0,3], [0,0,9], [0,71,0]]
How can I do it?
what I have for now:
array = numpy.random.rand(100, 100, 20)
inds = numpy.argsort(array, axis=-1) # also shape (100, 100, 20)
I want to do something like
array[..., inds[..., :15]] = 0
but it doesn't give me what I want
np.argsort outputs indices suitable for the [...]_along_axis functions of numpy. This includes np.put_along_axis:
import numpy as np
array = np.random.rand(100, 100, 20)
print(array[0,0])
#[0.44116124 0.94656705 0.20833932 0.29239585 0.33001399 0.82396784
# 0.35841905 0.20670957 0.41473762 0.01568006 0.1435386 0.75231818
# 0.5532527 0.69366173 0.17247832 0.28939985 0.95098187 0.63648877
# 0.90629116 0.35841627]
inds = np.argsort(array, axis=-1)
np.put_along_axis(array, inds[..., :15], 0, axis=-1)
print(array[0,0])
#[0. 0.94656705 0. 0. 0. 0.82396784
# 0. 0. 0. 0. 0. 0.75231818
# 0. 0. 0. 0. 0.95098187 0.
# 0.90629116 0. ]
As it mentioned in the numpy documentation
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
>>>x = np.array([[1, 2], [3, 4], [5, 6]])
>>>x[[0, 1, 2], [0, 1, 0]]
array([1, 4, 5])
So, for your example:
a = np.array([[1,2,3], [7,6,9], [12,71,3]])
amax = a.argmax(axis=-1)
a[np.arange(a.shape[0]), amax] = 0
a
array([[ 1, 2, 0],
[ 7, 6, 0],
[12, 0, 3]])

Updating a numpy array values based on multiple conditions

I have an array P as shown below:
P
array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
and a vector y:
y
array([0, 0, 1, 0], dtype=int16)
I want to modify another matrix Z which has the same dimension as P, such that Z_ij = y_j when Z_ij < 0.
In the above example, my Z matrix should be
Z = array([[-, -, -, 0],
[0, -, -, -],
[-, 0, 1, 0],
[-, 0, -, 0]])
Where '-' indicates the original Z values. What I thought about is very straightforward implementation which basically iterates through each row of Z and comparing the column values against corresponding Y and P. Do you know any better pythonic/numpy approach?
What you need is np.where. This is how to use it:-
import numpy as np
z = np.array([[ 0.49530662, 0.32619367, 0.54593724, -0.0224462 ],
[-0.10503237, 0.48607405, 0.28572714, 0.15175049],
[ 0.0286128 , -0.32407902, -0.56598029, -0.26743756],
[ 0.14353725, -0.35624814, 0.25655861, -0.09241335]])
y=([0, 0, 1, 0])
result = np.where(z<0,y,z)
#Where z<0, replace it by y
Result
>>> print(result)
[[0.49530662 0.32619367 0.54593724 0. ]
[0. 0.48607405 0.28572714 0.15175049]
[0.0286128 0. 1. 0. ]
[0.14353725 0. 0.25655861 0. ]]

Building NumPy array using values from another array

Consider the following code:
import numpy as np
index_info = np.matrix([[1, 1], [1, 2]])
value = np.matrix([[0.5, 0.5]])
initial = np.zeros((3, 3))
How can I produce a matrix, final, which has the structure of initial with the elements specified by value at the locations specified by index_info WITHOUT a for loop? In this toy example, see below.
final = np.matrix([[0, 0, 0], [0, 0.5, 0.5], [0, 0, 0]])
With a for loop, you can easily loop through all of the index's in index_info and value and use that to populate initial and form final. But is there a way to do so with vectorization (no for loop)?
Convert index_info to a tuple and use it to assign:
>>> initial[(*index_info,)]=value
>>> initial
array([[0. , 0. , 0. ],
[0. , 0.5, 0.5],
[0. , 0. , 0. ]])
Please note that use of the matrix class is discouraged. Use ndarray instead.
You can do this with NumPy's array indexing:
>>> initial = np.zeros((3, 3))
>>> row = np.array([1, 1])
>>> col = np.array([1, 2])
>>> final = np.zeros_like(initial)
>>> final[row, col] = [0.5, 0.5]
>>> final
array([[0. , 0. , 0. ],
[0. , 0.5, 0.5],
[0. , 0. , 0. ]])
This is similar to #PaulPanzer's answer, where he is unpacking row and col from index_info all in one step. In other words:
row, col = (*index_info,)

Interpolate 2D matrix along columns using Python

I am trying to interpolate a 2D numpy matrix with the dimensions (5, 3) to a matrix with the dimensions (7, 3) along the axis 1 (columns). Obviously, the wrong approach would be to randomly insert rows anywhere between the original matrix, see the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Target (terrible interpolation -> not wanted!):
[[0, 1, 1]
[0, 1.5, 0.5]
[0, 2, 0]
[0, 3, 1]
[0, 3.5, 0.5]
[0, 4, 0]
[0, 5, 1]]
The correct approach would be to take every row into account and interpolate between all of them to expand the source matrix to a (7, 3) matrix. I am aware of the scipy.interpolate.interp1d or scipy.interpolate.interp2d methods, but could not get it to work with other Stack Overflow posts or websites. I hope to receive any type of tips or tricks.
Update #1: The expected values should be equally spaced.
Update #2:
What I want to do is basically use the separate columns of the original matrix, expand the length of the column to 7 and interpolate between the values of the original column. See the following example:
Source:
[[0, 1, 1]
[0, 2, 0]
[0, 3, 1]
[0, 4, 0]
[0, 5, 1]]
Split into 3 separate Columns:
[0 [1 [1
0 2 0
0 3 1
0 4 0
0] 5] 1]
Expand length to 7 and interpolate between them, example for second column:
[1
1.66
2.33
3
3.66
4.33
5]
It seems like each column can be treated completely independently, but for each column you need to define essentially an "x" coordinate so that you can fit some function "f(x)" from which you generate your output matrix.
Unless the rows in your matrix are associated with some other datastructure (e.g. a vector of timestamps), an obvious set of x values is just the row-number:
x = numpy.arange(0, Source.shape[0])
You can then construct an interpolating function:
fit = scipy.interpolate.interp1d(x, Source, axis=0)
and use that to construct your output matrix:
Target = fit(numpy.linspace(0, Source.shape[0]-1, 7)
which produces:
array([[ 0. , 1. , 1. ],
[ 0. , 1.66666667, 0.33333333],
[ 0. , 2.33333333, 0.33333333],
[ 0. , 3. , 1. ],
[ 0. , 3.66666667, 0.33333333],
[ 0. , 4.33333333, 0.33333333],
[ 0. , 5. , 1. ]])
By default, scipy.interpolate.interp1d uses piecewise-linear interpolation. There are many more exotic options within scipy.interpolate, based on higher order polynomials, etc. Interpolation is a big topic in itself, and unless the rows of your matrix have some particular properties (e.g. being regular samples of a signal with a known frequency range), there may be no "truly correct" way of interpolating. So, to some extent, the choice of interpolation scheme will be somewhat arbitrary.
You can do this as follows:
from scipy.interpolate import interp1d
import numpy as np
a = np.array([[0, 1, 1],
[0, 2, 0],
[0, 3, 1],
[0, 4, 0],
[0, 5, 1]])
x = np.array(range(a.shape[0]))
# define new x range, we need 7 equally spaced values
xnew = np.linspace(x.min(), x.max(), 7)
# apply the interpolation to each column
f = interp1d(x, a, axis=0)
# get final result
print(f(xnew))
This will print
[[ 0. 1. 1. ]
[ 0. 1.66666667 0.33333333]
[ 0. 2.33333333 0.33333333]
[ 0. 3. 1. ]
[ 0. 3.66666667 0.33333333]
[ 0. 4.33333333 0.33333333]
[ 0. 5. 1. ]]

scipy csr_matrix from several vectors represented as list of sets

I have several sparse vectors represented as lists of tuples eg.
[[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)],
...
]
And my goal is to compose scipy.sparse.csr_matrix from several millions of vectors like this.
I would like to ask if there exists some simple elegant solution for this kind of conversion without trying to stuck everything to memory.
EDIT:
Just a clarification: My goal is to build the 2d matrix, where each of my sparse vectors represent one row in matrix.
Collecting indices,data into a structured array avoids the integer-double conversion issue. It is also a bit faster than the vstack approach (in limited testing) (With list data like this np.array is faster than np.vstack.)
indptr = np.cumsum([0]+[len(i) for i in vectors])
aa = np.array(vectors,dtype='i,f').flatten()
A = sparse.csr_matrix((aa['f1'], aa['f0'], indptr))
I substituted the list comprehension for map since I'm using Python3.
Indicies in the coo format (data, (i,j)) might be more intuitive
ii = [[i]*len(v) for i,v in enumerate(vectors)])
ii = np.array(ii).flatten()
aa = np.array(vectors,dtype='i,f').flatten()
A2 = sparse.coo_matrix((aa['f1'],(np.array(ii), aa['f0'])))
# A2.tocsr()
Here, ii from the 1st step is the row numbers for each sublist.
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
...]]
This construction method is slower than the csr direct indptr.
For a case where there are differing numbers of entries per row, this approach works (using intertools.chain to flatten lists):
A sample list (no empty rows for now):
In [779]: vectors=[[(1, .12),(3, .234),(6,1.23)],
[(2,.222)],
[(2,.23),(1,.34)]]
row indexes:
In [780]: ii=[[i]*len(v) for i,v in enumerate(vectors)]
In [781]: ii=list(chain(*ii))
column and data values pulled from tuples and flattened
In [782]: jj=[j for j,_ in chain(*vectors)]
In [783]: data=[d for _,d in chain(*vectors)]
In [784]: ii
Out[784]: [0, 0, 0, 1, 2, 2]
In [785]: jj
Out[785]: [1, 3, 6, 2, 2, 1]
In [786]: data
Out[786]: [0.12, 0.234, 1.23, 0.222, 0.23, 0.34]
In [787]: A=sparse.csr_matrix((data,(ii,jj))) # coo style input
In [788]: A.A
Out[788]:
array([[ 0. , 0.12 , 0. , 0.234, 0. , 0. , 1.23 ],
[ 0. , 0. , 0.222, 0. , 0. , 0. , 0. ],
[ 0. , 0.34 , 0.23 , 0. , 0. , 0. , 0. ]])
Consider the following:
import numpy as np
from scipy.sparse import csr_matrix
vectors = [[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)]]
indptr = np.cumsum([0] + map(len, vectors))
indices, data = np.vstack(vectors).T
A = csr_matrix((data, indices.astype(int), indptr))
Unfortunately, this way the column indices are converted from integers to doubles and back. This works correctly for up to very large matrices, but is not ideal.

Categories

Resources