PYMC / Theano memory usage with sparse matrices

PYMC / Theano memory usage with sparse matrices - python

I am using the latest version of PYMC3 / Theano to compute the MAP estimate for the following simple logistic regression model:
w ~ N(0, 1)
y ~ Bin(σ(Xw))
My features are strictly categorical, and I am using one-hot encoding to transform the design matrix into a sparse (CSR) matrix of 0's and 1's. The sparse matrix occupies roughly 15MB, but I am seeing memory spikes up to 4GB. Incidentally, this is the size of a dense matrix with the same dimensions.
Below is a simplified version of my code,
from __future__ import division, print_function
import numpy as np
import pymc3 as pm
from scipy import sparse as sp
from scipy.optimize import fmin_l_bfgs_b
from scipy.special import expit
from theano import sparse as S
np.random.seed(0)
# properties of sparse design matrix (taken from the real data)
N = 100000 # number of samples
M = 5000 # number of dimensions
D = 0.002 # matrix density
# fake data
mu0, sd0 = 0.0, 1.0
w = np.random.normal(mu0, sd0, M)
X = sp.random(N, M, density=D, format='csr', data_rvs=np.ones)
y = np.random.binomial(1, expit(X.dot(w)), N)
# estimate memory usage
size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes + y.nbytes
print('{:.2f} MB of data'.format(size / 1024 ** 2))
# model definition
with pm.Model() as model:
w = pm.Normal('w', mu0, sd=sd0, shape=M)
p = pm.sigmoid(S.dot(X, w.reshape((-1, 1)))).flatten()
pm.Bernoulli('y', p, observed=y)
print(pm.find_MAP(vars=[w], fmin=fmin_l_bfgs_b))
which produces the following memory profile:
Even if the L-BFGS optimizer computed the full dense Hessian (which it doesn't), that would take up only 190MB. I suspect that somewhere along the way, X is automatically converted into a Numpy array to implement some Theano op.
Has anyone successfully worked with sparse matrices in PYMC3? Any ideas where exactly the problem could lie?

Related

Import large .tiff file as sparse matrix

I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)

Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero step.

Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix
empty size
full size
filling time
csr_matrix
2856
2992
477.67 s
doc_matrix
726
35807578
3.15 s
lil_matrix
8840
8840
0.54 s
size 1000x1000
matrix
empty size
full size
filling time
csr_matrix
4856
4992
7164.94 s
doc_matrix
726
150578858
12.81 s
lil_matrix
16840
16840
2.19 s
Probably the best solution would be to use lil_matrix

PCA implementation on 3D numpy array

I have a feature set of size 2240*5*16. 2240 are number of samples, 5 represents number of channels and 16 shows # of statistical features extracted such as mean, variance, etc.
Now, I want to apply PCA. However, PCA is applicable on 2D array. I applied the following code:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca.fit(features)
I get the following error.
ValueError: Found array with dim 3. Estimator expected <= 2.
It doesn't support axis argument. As it is only applicable on 2D, how can I utilize it on my case (3D)? Any suggestion, if I want to reduce the dimensions from 2240*5*16 to 2240*5*5, please?

I would just loop over each channel and do PCA separately.
import numpy as np
from sklearn.decomposition import PCA
X = np.random.rand(1000, 5, 10)
X_transform = np.zeros((X.shape[0], 5, 5))
for i in range(X.shape[1]):
pca = PCA(n_components=5)
f = pca.fit_transform(X[:, i, :])
X_transform[:, i, :] = f
print((X_transform.shape))

Python scipy.sparse: how to efficiently set a set of entries to 0?

Let a be a big scipy.sparse matrix and IJ={(i0,j0),(i1,j1),...} a set of positions. How can I efficiently set all the entries in a in positions IJ to 0? Something like a[IJ]=0.
In Mathematica, I would create a new sparse matrix b with background value 1 (instead of 0) and all entries in IJ. Then, I would use a=a*b (entry-wise multiplication). That does not seem to be an option here.
A toy example:
import scipy.sparse as sp
import numpy as np
np.set_printoptions(linewidth=200,edgeitems=5,precision=4)
m=n=10**1;
a=sp.random(m,n,4/m,format='csr'); print(a.toarray())
IJ=np.array([range(0,n,2),range(0,n,2)]); print(IJ) #every second diagonal

You are almost there. To go by your definitions, all you'd need to do is:
a[IJ[0],IJ[1]] = 0
Note that scipy will warn you:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
You can read more about that here.

The scipy sparse matrices can't have a non-zero background value. While it it possible to make a "sparse" matrix with lots of non-zero value, the performance (speed & memory) would be far worse than dense matrix multiplication.
A possible work-around is to rewrite every sparse matrix to have a default value of zero. For example, if matrix Y' contains mostly 1, I can replace Y' by I - Y where Y = I - Y' and I is the identity matrix.
import scipy.sparse as sp
import numpy as np
size = (100, 100)
x = np.random.uniform(-1, 1, size=size)
y = sp.random(*size, 0.001, format='csr')
# Z = (I - Y)X = X - YX
z = x - y.multiply(x)
# A = X(I - Y) = X - XY = X - transpose(YX)
a = x - y.multiply(x).T

How to generate a Random Sparse Hermitian Matrix in python?

I would like to generate a Random Sparse Hermitian Matrix of a given shape in python. How can I do it efficiently? Is there any built-in python function for this task?
I have found a solution for the random sparse matrix, but I want the matrix to be Hermitian too. Here is the solution for the random sparse matrix that I found
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()

We know that a matrix plus it's hermitian is a hermitian matrix. So to ensure your final matrix B is hermitian, just do
B = A + A.conj().T

Sparse Matrix error in MLPRegressor

Context
I'm running into an error when trying to use sparse matrices as an input to sklearn.neural_network.MLPRegressor. Nominally, this method is able to handle sparse matrices. I think this might be a bug in scikit-learn, but wanted to check on here before I submit an issue.
The Problem
When passing a scipy.sparse input to sklearn.neural_network.MLPRegressor I get:
ValueError: input must be a square array
The error is raised by the matrix_power function within numpy.matrixlab.defmatrix. It seems to occur because matrix_power passes the sparse matrix to numpy.asanyarray (L137), which returns an array of size=1, ndim=0 containing the sparse matrix object. matrix_power then performs some dimension checks (L138-141) to make sure the input is a square matrix, which fail because the array returned by numpy.asanyarray is not square, even though the underlying sparse matrix is square.
As far as I can tell, the problem stems from numpy.asanyarray preventing the dimensions of the sparse matrix being determined. The sparse matrix itself has a size attribute which would allow it to pass the dimension checks, but only if it's not run through asanyarray.
I think this might be a bug, but don't want to dive around filing issues until I've confirmed that I'm not just being an idiot! Please see below, to check.
If it is a bug, where would be the most appropriate place to raise an issue? NumPy? SciPy? or Scikit-Learn?
Minimal Example
Environment
Arch Linux
kernel 4.15.7-1
Python 3.6.4
numpy 1.14.1
scipy 1.0.0
sklearn 0.19.1
Code
import numpy as np
from scipy import sparse
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.neural_network import MLPRegressor
## Generate some synthetic data
def fW(A, B, C):
return A * np.random.normal(.3, .1) + B * np.random.normal(.6, .1)
def fX(A, B, C):
return B * np.random.normal(-1, .1) + A * np.random.normal(-.9, .1) / C
# independent variables
N = int(1e4)
A = np.random.uniform(2, 12, N)
B = np.random.uniform(2, 12, N)
C = np.random.uniform(2, 12, N)
# synthetic data
mW = fW(A, B, C)
mX = fX(A, B, C)
# combine datasets
real = np.vstack([A, B, C]).T
meas = np.vstack([mW, mX]).T
# add noise to meas
meas *= np.random.normal(1, 0.0001, meas.shape)
## Make data sparse
prob_null = 0.2
real[np.random.choice([True, False], real.shape, p=[prob_null, 1-prob_null])] = np.nan
meas[np.random.choice([True, False], meas.shape, p=[prob_null, 1-prob_null])] = np.nan
# NB: problem persists whichever sparse matrix method is used.
real = sparse.csr_matrix(real)
meas = sparse.csr_matrix(meas)
# replace missing values with mean
rmnan = Imputer()
real = rmnan.fit_transform(real)
meas = rmnan.fit_transform(meas)
# split into test/training sets
real_train, real_test, meas_train, meas_test = model_selection.train_test_split(real, meas, test_size=0.3)
# create scalers and apply to data
real_scaler = StandardScaler(with_mean=False)
meas_scaler = StandardScaler(with_mean=False)
real_scaler.fit(real_train)
meas_scaler.fit(meas_train)
treal_train = real_scaler.transform(real_train)
tmeas_train = meas_scaler.transform(meas_train)
treal_test = real_scaler.transform(real_test)
tmeas_test = meas_scaler.transform(meas_test)
nn = MLPRegressor((100,100,10), solver='lbfgs', early_stopping=True, activation='tanh')
nn.fit(tmeas_train, treal_train)
## ERROR RAISED HERE
## The problem:
# the sparse matrix has a shape attribute that would pass the square matrix validation
tmeas_train.shape
# but not after it's been through asanyarray
np.asanyarray(tmeas_train).shape

MLPRegressor.fit() as given in documentation supports sparse matrix for X but not for y
Parameters:
X : array-like or sparse matrix, shape (n_samples, n_features)
The input data.
y : array-like, shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels in classification, real numbers in regression).
I am able to successfully run your code with:
nn.fit(tmeas_train, treal_train.toarray())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PYMC / Theano memory usage with sparse matrices - python

Related

Import large .tiff file as sparse matrix

PCA implementation on 3D numpy array

Python scipy.sparse: how to efficiently set a set of entries to 0?

How to generate a Random Sparse Hermitian Matrix in python?

Sparse Matrix error in MLPRegressor

Categories

Resources