I am using the latest version of PYMC3 / Theano to compute the MAP estimate for the following simple logistic regression model:
w ~ N(0, 1)
y ~ Bin(σ(Xw))
My features are strictly categorical, and I am using one-hot encoding to transform the design matrix into a sparse (CSR) matrix of 0's and 1's. The sparse matrix occupies roughly 15MB, but I am seeing memory spikes up to 4GB. Incidentally, this is the size of a dense matrix with the same dimensions.
Below is a simplified version of my code,
from __future__ import division, print_function
import numpy as np
import pymc3 as pm
from scipy import sparse as sp
from scipy.optimize import fmin_l_bfgs_b
from scipy.special import expit
from theano import sparse as S
np.random.seed(0)
# properties of sparse design matrix (taken from the real data)
N = 100000 # number of samples
M = 5000 # number of dimensions
D = 0.002 # matrix density
# fake data
mu0, sd0 = 0.0, 1.0
w = np.random.normal(mu0, sd0, M)
X = sp.random(N, M, density=D, format='csr', data_rvs=np.ones)
y = np.random.binomial(1, expit(X.dot(w)), N)
# estimate memory usage
size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes + y.nbytes
print('{:.2f} MB of data'.format(size / 1024 ** 2))
# model definition
with pm.Model() as model:
w = pm.Normal('w', mu0, sd=sd0, shape=M)
p = pm.sigmoid(S.dot(X, w.reshape((-1, 1)))).flatten()
pm.Bernoulli('y', p, observed=y)
print(pm.find_MAP(vars=[w], fmin=fmin_l_bfgs_b))
which produces the following memory profile:
Even if the L-BFGS optimizer computed the full dense Hessian (which it doesn't), that would take up only 190MB. I suspect that somewhere along the way, X is automatically converted into a Numpy array to implement some Theano op.
Has anyone successfully worked with sparse matrices in PYMC3? Any ideas where exactly the problem could lie?
Related
I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)
Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero step.
Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix
empty size
full size
filling time
csr_matrix
2856
2992
477.67 s
doc_matrix
726
35807578
3.15 s
lil_matrix
8840
8840
0.54 s
size 1000x1000
matrix
empty size
full size
filling time
csr_matrix
4856
4992
7164.94 s
doc_matrix
726
150578858
12.81 s
lil_matrix
16840
16840
2.19 s
Probably the best solution would be to use lil_matrix
I have a feature set of size 2240*5*16. 2240 are number of samples, 5 represents number of channels and 16 shows # of statistical features extracted such as mean, variance, etc.
Now, I want to apply PCA. However, PCA is applicable on 2D array. I applied the following code:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca.fit(features)
I get the following error.
ValueError: Found array with dim 3. Estimator expected <= 2.
It doesn't support axis argument. As it is only applicable on 2D, how can I utilize it on my case (3D)? Any suggestion, if I want to reduce the dimensions from 2240*5*16 to 2240*5*5, please?
I would just loop over each channel and do PCA separately.
import numpy as np
from sklearn.decomposition import PCA
X = np.random.rand(1000, 5, 10)
X_transform = np.zeros((X.shape[0], 5, 5))
for i in range(X.shape[1]):
pca = PCA(n_components=5)
f = pca.fit_transform(X[:, i, :])
X_transform[:, i, :] = f
print((X_transform.shape))
Let a be a big scipy.sparse matrix and IJ={(i0,j0),(i1,j1),...} a set of positions. How can I efficiently set all the entries in a in positions IJ to 0? Something like a[IJ]=0.
In Mathematica, I would create a new sparse matrix b with background value 1 (instead of 0) and all entries in IJ. Then, I would use a=a*b (entry-wise multiplication). That does not seem to be an option here.
A toy example:
import scipy.sparse as sp
import numpy as np
np.set_printoptions(linewidth=200,edgeitems=5,precision=4)
m=n=10**1;
a=sp.random(m,n,4/m,format='csr'); print(a.toarray())
IJ=np.array([range(0,n,2),range(0,n,2)]); print(IJ) #every second diagonal
You are almost there. To go by your definitions, all you'd need to do is:
a[IJ[0],IJ[1]] = 0
Note that scipy will warn you:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
You can read more about that here.
The scipy sparse matrices can't have a non-zero background value. While it it possible to make a "sparse" matrix with lots of non-zero value, the performance (speed & memory) would be far worse than dense matrix multiplication.
A possible work-around is to rewrite every sparse matrix to have a default value of zero. For example, if matrix Y' contains mostly 1, I can replace Y' by I - Y where Y = I - Y' and I is the identity matrix.
import scipy.sparse as sp
import numpy as np
size = (100, 100)
x = np.random.uniform(-1, 1, size=size)
y = sp.random(*size, 0.001, format='csr')
# Z = (I - Y)X = X - YX
z = x - y.multiply(x)
# A = X(I - Y) = X - XY = X - transpose(YX)
a = x - y.multiply(x).T
I would like to generate a Random Sparse Hermitian Matrix of a given shape in python. How can I do it efficiently? Is there any built-in python function for this task?
I have found a solution for the random sparse matrix, but I want the matrix to be Hermitian too. Here is the solution for the random sparse matrix that I found
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()
We know that a matrix plus it's hermitian is a hermitian matrix. So to ensure your final matrix B is hermitian, just do
B = A + A.conj().T
Context
I'm running into an error when trying to use sparse matrices as an input to sklearn.neural_network.MLPRegressor. Nominally, this method is able to handle sparse matrices. I think this might be a bug in scikit-learn, but wanted to check on here before I submit an issue.
The Problem
When passing a scipy.sparse input to sklearn.neural_network.MLPRegressor I get:
ValueError: input must be a square array
The error is raised by the matrix_power function within numpy.matrixlab.defmatrix. It seems to occur because matrix_power passes the sparse matrix to numpy.asanyarray (L137), which returns an array of size=1, ndim=0 containing the sparse matrix object. matrix_power then performs some dimension checks (L138-141) to make sure the input is a square matrix, which fail because the array returned by numpy.asanyarray is not square, even though the underlying sparse matrix is square.
As far as I can tell, the problem stems from numpy.asanyarray preventing the dimensions of the sparse matrix being determined. The sparse matrix itself has a size attribute which would allow it to pass the dimension checks, but only if it's not run through asanyarray.
I think this might be a bug, but don't want to dive around filing issues until I've confirmed that I'm not just being an idiot! Please see below, to check.
If it is a bug, where would be the most appropriate place to raise an issue? NumPy? SciPy? or Scikit-Learn?
Minimal Example
Environment
Arch Linux
kernel 4.15.7-1
Python 3.6.4
numpy 1.14.1
scipy 1.0.0
sklearn 0.19.1
Code
import numpy as np
from scipy import sparse
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn.neural_network import MLPRegressor
## Generate some synthetic data
def fW(A, B, C):
return A * np.random.normal(.3, .1) + B * np.random.normal(.6, .1)
def fX(A, B, C):
return B * np.random.normal(-1, .1) + A * np.random.normal(-.9, .1) / C
# independent variables
N = int(1e4)
A = np.random.uniform(2, 12, N)
B = np.random.uniform(2, 12, N)
C = np.random.uniform(2, 12, N)
# synthetic data
mW = fW(A, B, C)
mX = fX(A, B, C)
# combine datasets
real = np.vstack([A, B, C]).T
meas = np.vstack([mW, mX]).T
# add noise to meas
meas *= np.random.normal(1, 0.0001, meas.shape)
## Make data sparse
prob_null = 0.2
real[np.random.choice([True, False], real.shape, p=[prob_null, 1-prob_null])] = np.nan
meas[np.random.choice([True, False], meas.shape, p=[prob_null, 1-prob_null])] = np.nan
# NB: problem persists whichever sparse matrix method is used.
real = sparse.csr_matrix(real)
meas = sparse.csr_matrix(meas)
# replace missing values with mean
rmnan = Imputer()
real = rmnan.fit_transform(real)
meas = rmnan.fit_transform(meas)
# split into test/training sets
real_train, real_test, meas_train, meas_test = model_selection.train_test_split(real, meas, test_size=0.3)
# create scalers and apply to data
real_scaler = StandardScaler(with_mean=False)
meas_scaler = StandardScaler(with_mean=False)
real_scaler.fit(real_train)
meas_scaler.fit(meas_train)
treal_train = real_scaler.transform(real_train)
tmeas_train = meas_scaler.transform(meas_train)
treal_test = real_scaler.transform(real_test)
tmeas_test = meas_scaler.transform(meas_test)
nn = MLPRegressor((100,100,10), solver='lbfgs', early_stopping=True, activation='tanh')
nn.fit(tmeas_train, treal_train)
## ERROR RAISED HERE
## The problem:
# the sparse matrix has a shape attribute that would pass the square matrix validation
tmeas_train.shape
# but not after it's been through asanyarray
np.asanyarray(tmeas_train).shape
MLPRegressor.fit() as given in documentation supports sparse matrix for X but not for y
Parameters:
X : array-like or sparse matrix, shape (n_samples, n_features)
The input data.
y : array-like, shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels in classification, real numbers in regression).
I am able to successfully run your code with:
nn.fit(tmeas_train, treal_train.toarray())