avoid unnecessary matrix multiplications with sparse dummies

avoid unnecessary matrix multiplications with sparse dummies - python

import numpy as np
import pandas as pd
catVar = np.array(list('abcbca')) #categorical independent variable
groupIDs = np.array([10,10,20,20,30,30]) #groups(/strata)
p = np.array([0.5, 0.5, 0.25, 0.75, 1, 0]) #'probabilities'
dummies = pd.get_dummies(catVar)
_,idx,tags = np.unique(groupIDs, return_index=1, return_inverse=1)
np.add.reduceat((p * dummies.T).T, idx)[tags]
[[ 0.5 0.5 0. ]
[ 0.5 0.5 0. ]
[ 0. 0.75 0.25]
[ 0. 0.75 0.25]
[ 0. 0. 1. ]
[ 0. 0. 1. ]]
In the last two lines of code, I am creating a new table with the sum of products between p and X per group for each column. Because my data set is ~ .5m x 4k, this calculation takes quite some time which I am trying to reduce. My question is, whether it would be possible to get the same result when I define my dummies as a sparse matrix,
from scipy import sparse
dumSp = sparse.csc_matrix(dummies)
and whether the output of the above calculation could go directly to a sparse matrix as well.

Related

Interpolate between two matrices with numpy

I have two HxW matrices A and B. I'd like to get an NxHxW matrix C such that C[0]=A, C[-1]=B, and each of the remaining N-2 slices are linearly interpolated between A and B. Is there a single numpy function I can do this with, without needing a for loop?

Just use linspace if you are looking for linear interpolation between just 2 points.
A = np.array([[0,1],
[2,3]])
B = np.array([[1, 3],
[-1,-2]])
C = np.linspace(A,B,4) #<- Change this to H+2, which is H linearly interpolated values between the 2 points
C
array([[[ 0. , 1. ], #<-- A matrix is C[0]
[ 2. , 3. ]],
[[ 0.33333333, 1.66666667],
[ 1. , 1.33333333]], #
#<-- Elementwise equally spaced values
[[ 0.66666667, 2.33333333], #
[ 0. , -0.33333333]],
[[ 1. , 3. ], #<-- B matrix is C[-1]
[-1. , -2. ]]])

Numpy covariance command returning matrix with more dimensions than input

I have an arbitrary row vector "u" and an arbitrary matrix "e" as follows:
u = np.resize(np.array([8,3]),[1,2])
e = np.resize(np.array([[2,2,5,5],[1, 6, 7, 4]]),[4,2])
np.cov(u,e)
array([[ 12.5, 0. , 0. , -12.5, 7.5],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[-12.5, 0. , 0. , 12.5, -7.5],
[ 7.5, 0. , 0. , -7.5, 4.5]])
The matrix that this returns is 5x5. This is confusing to me because the largest dimension of the inputs is only 4.
Thus, this may be less of a numpy question and more of a math question...not sure...

Please refer to the official numpy documentation (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cov.html) and check whether you usage of the numpy.cov function is consistent with what you are trying to achieve and you understand what you are trying to do.
When looking at the signature
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
m : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of m represents a variable, and each column a single observation > > of all those variables. Also see rowvar below.
y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
Note how m and y are combined as shown in the last example on the page
>>> x = [-2.1, -1, 4.3]
>>> y = [3, 1.1, 0.12]
>>> X = np.stack((x, y), axis=0)
>>> print(np.cov(X))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x, y))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x))
11.71

Different operations for different columns in a numpy matrix?

I have a square numpy matrix with 0 and 1 and I have to do different operations according to the column.
If the column contains all 0 I have to replace these 0 with 1/number_of_the_colomns (i use the command matrix.shape[1]) , else (if colomn doesn't contain all 0) i have to divide each element by the sum of the colomn.
In essence, after these operations the sum of each colomn must be 1.
I try this but i have error in the third line: index returns 3-dim structure
a=numpy.nonzero(out_degree)
b=numpy.where(out_degree==0)
graph[:,b]=1/graph.shape[0]
graph[:,a]=graph/out_degree
graph is the numpy matrix, out_degree is a vector that contains the sum of each colomn
I have to use numpy without loop to save time.

A start would be:
import numpy as np
np.random.seed(1)
M, N = 5, 4
a = np.random.choice([0, 1, 2], size=(M, N), p=[0.6, 0.2, 0.2]).astype(float)
print(a)
a_inds = np.where(~a.any(axis=0))[0]
b_inds = np.setdiff1d(np.arange(N), a_inds, assume_unique=True)
b_col_sums = np.sum(a[:, b_inds], axis=0)
a[:, a_inds] = 1 / N
a[:, b_inds] /= b_col_sums
print(a)
Output:
[[ 0. 1. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 2. 0. 1.]
[ 0. 0. 0. 0.]]
[[ 0.25 0.33333333 0.25 0. ]
[ 0.25 0. 0.25 0. ]
[ 0.25 0. 0.25 0.5 ]
[ 0.25 0.66666667 0.25 0.5 ]
[ 0.25 0. 0.25 0. ]]
This should be easy to read and of medium performance. It's probably not the fastest because of a lot of fancy-indexing.
It also does not check the problematic cases of divide by zero (not part of your specification)!
Edit: OP is only interested in square-arrays, so the following is to be ignored!
You state: In essence, after these operations the sum of each colomn must be 1. and give the operation: have to replace these 0 with 1/number_of_the_columns, which is a contradiction. Maybe you need to replace N with M in a[:, a_inds] = 1 / N.
Then you obtain:
[[ 0.2 0.33333333 0.2 0. ]
[ 0.2 0. 0.2 0. ]
[ 0.2 0. 0.2 0.5 ]
[ 0.2 0.66666667 0.2 0.5 ]
[ 0.2 0. 0.2 0. ]]

You can check for nonzero elements otherwise just sum it.
for col in range(a.shape[1]):
if np.any(a[:, col]):
a[:, col] /= np.sum(a[:, col])
else:
a[:, col] = 1/a.shape[1]

Generate sparse vector

I have two lists. One with indexes (set obviously) and one with values. Is it possible to convert them in a numpy array with fixed size efficiently?
indexes = [1,2,6,7]
values = [0.2,0.5,0.6,0.2]
size = 10
What I want to output is:
print(magic_func(indexes,values,size))
array(0,0.2,0.5,0,0,
0,0.6,0.2,0,0)

It's easy in two lines, if you want:
In [1]: import numpy as np
In [2]: arr = np.zeros(size)
In [3]: arr[indexes] = values
In [4]: arr
Out[4]: array([ 0. , 0.2, 0.5, 0. , 0. , 0. , 0.6, 0.2, 0. , 0. ])

Build diagonal matrix without using for loop

I am trying to build the following matrix in Python without using a for loop:
A
[[ 0.1 0.2 0. 0. 0. ]
[ 1. 2. 3. 0. 0. ]
[ 0. 1. 2. 3. 0. ]
[ 0. 0. 1. 2. 3. ]
[ 0. 0. 0. 4. 5. ]]
I tried the fill_diagonal method in NumPy (see matrix B below) but it does not give me the same matrix as shown in matrix A:
B
[[ 1. 0.2 0. 0. 0. ]
[ 0. 2. 0. 0. 0. ]
[ 0. 0. 3. 0. 0. ]
[ 0. 0. 0. 1. 0. ]
[ 0. 0. 0. 4. 5. ]]
Here is the Python code that I used to construct the matrices:
import numpy as np
import scipy.linalg as sp # maybe use scipy to build diagonal matrix?
#---- build diagonal square array using "for" loop
m = 5
A = np.zeros((m, m))
A[0, 0] = 0.1
A[0, 1] = 0.2
for i in range(1, m-1):
A[i, i-1] = 1 # m-1
A[i, i] = 2 # m
A[i, i+1] = 3 # m+1
A[m-1, m-2] = 4
A[m-1, m-1] = 5
print('A \n', A)
#---- build diagonal square array without loop
B = np.zeros((m, m))
B[0, 0] = 0.1
B[0, 1] = 0.2
np.fill_diagonal(B, [1, 2, 3])
B[m-1, m-2] = 4
B[m-1, m-1] = 5
print('B \n', B)
So is there a way to construct a diagonal matrix like the one shown by matrix A without using a for loop?

There are functions for this in scipy.sparse, e.g.:
from scipy.sparse import diags
C = diags([1,2,3], [-1,0,1], shape=(5,5), dtype=float)
C = C.toarray()
C[0, 0] = 0.1
C[0, 1] = 0.2
C[-1, -2] = 4
C[-1, -1] = 5
Diagonal matrices are generally very sparse, so you could also keep it as a sparse matrix. This could even have large efficiency benefits, depending on the application.
The efficiency gains sparse matrices could give you depend very much on matrix size. For a 5x5 array you can't really be bothered I guess. But for larger matrices creating the array could be a lot faster with sparse matrices, illustrated by the following example with an identity matrix:
%timeit np.eye(3000)
# 100 loops, best of 3: 3.12 ms per loop
%timeit sparse.eye(3000)
# 10000 loops, best of 3: 79.5 µs per loop
But the real strength of the sparse matrix data type is shown when you need to do mathematical operations on arrays that are sparse:
%timeit np.eye(3000).dot(np.eye(3000))
# 1 loops, best of 3: 2.8 s per loop
%timeit sparse.eye(3000).dot(sparse.eye(3000))
# 1000 loops, best of 3: 1.11 ms per loop
Or when you need to work with some very large but sparse array:
np.eye(1E6)
# ValueError: array is too big.
sparse.eye(1E6)
# <1000000x1000000 sparse matrix of type '<type 'numpy.float64'>'
# with 1000000 stored elements (1 diagonals) in DIAgonal format>

Notice that the number of 0 is always 3 (or a constant whenever you want to have a diagonal matrix like this):
In [10]:
import numpy as np
A1=[0.1, 0.2]
A2=[1,2,3]
A3=[4,5]
SPC=[0,0,0] #=or use np.zeros #spacing zeros
np.hstack((A1,SPC,A2,SPC,A2,SPC,A2,SPC,A3)).reshape(5,5)
Out[10]:
array([[ 0.1, 0.2, 0. , 0. , 0. ],
[ 1. , 2. , 3. , 0. , 0. ],
[ 0. , 1. , 2. , 3. , 0. ],
[ 0. , 0. , 1. , 2. , 3. ],
[ 0. , 0. , 0. , 4. , 5. ]])
In [11]:
import itertools #A more general way of doing it
np.hstack(list(itertools.chain(*[(item, SPC) for item in [A1, A2, A2, A2, A3]]))[:-1]).reshape(5,5)
Out[11]:
array([[ 0.1, 0.2, 0. , 0. , 0. ],
[ 1. , 2. , 3. , 0. , 0. ],
[ 0. , 1. , 2. , 3. , 0. ],
[ 0. , 0. , 1. , 2. , 3. ],
[ 0. , 0. , 0. , 4. , 5. ]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

avoid unnecessary matrix multiplications with sparse dummies - python

Related

Interpolate between two matrices with numpy

Numpy covariance command returning matrix with more dimensions than input

Different operations for different columns in a numpy matrix?

Generate sparse vector

Build diagonal matrix without using for loop

Categories

Resources