MemoryError with large sparse matrices - python

For a project I have built a program that constructs large matrices.
def ExpandSparse(LNew):
SpId = ssp.csr_matrix(np.identity(MS))
Sz = MS**LNew
HNew = ssp.csr_matrix((Sz,Sz))
Bulk = dict()
for i in range(LNew-1):
for j in range(LNew-1):
if i == j:
Bulk[(i,j)]=H2
else:
Bulk[(i,j)]=SpId
Ha = ssp.csr_matrix((8,8))
try:
for i in range(LNew-1):
for j in range(LNew-2):
if j < 1:
Ha = ssp.csr_matrix(ssp.kron(Bulk[(i,j)],Bulk[(i,j+1)]))
else:
Ha = ssp.csr_matrix(ssp.kron(Ha,Bulk[(i,j+1)]))
HNew = HNew + Ha
except MemoryError:
print('The matrix you tried to build requires too much memory space.')
return
return HNew
This does the job, however it does not work as well as I would have expected. The problem is that it won't allow for really large matrices. When LNewis larger than 13 I will get a MemoryError. My experiences with numpy suggest that, memorywise, I should be able to get LNew up to 18 or 19 before I get this error. Does this have to do with my code, or with the way scipy.sparse.kron() works with these matrices?
Another note that might be important is that I use Windows not Linux.

After some more reading on the working of the scipy.sparse.kron() function I have noticed that there is a third term named format you can enter. The default setting is None, but when it is put on 'csr' or another supported format it will only use the sparse format making it a lot more efficient, now for me it can build a 2097152 x 2097152 matrix. Here LNew is 21.

Related

optimise zarr array processing

I have a list (mylist) of 80 5-D zarr files with the following structure (T, F, B, Az, El). The array has shape [24x4096x2016x24x8].
I want to extract sliced data and run a probability along some axis using the following function
def GetPolarData(mylist, freq, FreqLo, FreqHi):
'''
This function will take the list of zarr files (T, F, B, Az, El), open them, used selected frequency to return an array
of files with Azimuth and Elevation probabilities
'''
ChanIndx = FreqCut(FreqLo, FreqHi,freq)
if len(ChanIndx) != 0:
MyData = []
for i in range(len(mylist)):
print('Adding file {} : {}'.format(i,mylist[i][32:]))
try:
zarrf = xr.open_zarr(mylist[i], group = 'arr')
m = zarrf.master.sum(dim = ['time','baseline'])
m = m[ChanIndx].sum(dim = ['frequency'])
c = zarrf.counter.sum(dim = ['time','baseline'])
c = c[ChanIndx].sum(dim = ['frequency'])
p = m.astype(float)/c.astype(float)
MyData.append(p)
except Exception as e:
print(e)
continue
else:
print("Something went wrong in Frequency selection")
print("##########################################")
print("This will be contribution to selected band")
print("##########################################")
print(f"Min {np.nanmin(MyData)*100:.3f}% ")
print(f"Max {np.nanmax(MyData)*100:.3f}% ")
print(f"Average {np.nanmean(MyData)*100:.3f}% ")
return(MyData)
If I call the function using the following,
FreqLo = 470.
FreqHi = 854.
MyTVData =np.array(GetPolarData(AllZarrList,Freq, FreqLo, FreqHi))
I find it is taking so long to run (over 3hrs) on a 40 core, 256 GB RAM
Is there a way to make this runs faster?
Thank you
It seems like you could take advantage of parallelization here : each array is only read once, and they are all processed independently of each other.
XArray and others may do computation in parallel but for your application, using the multiprocessing library could help sharing the work among different cores more evenly.
The best tool to achieve good performances is the profile library, which can show the most time-consuming parts of your code. I suggest you run it on a single-process version of your code : it will be easier to use.

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

Reading BSD500 groundTruth with scipy

I am trying to load using scipy loadmat a ground truth file, it return numpy ndarray of type object (dtype='O').
From that object I arrive to access to each element that are also ndarrays but I am struggling from that point to access to either the segmentation or the boundaries image.
I would like a to transform this a list of list of ndarray of numerical types how can I do that ?
Thanks in advance for any help
I found a way to fix my issue.
I do not think it is optimal but it works.
def load_bsd_gt(filename):
gt = loadmat(filename)
gt = gt['groundTruth']
cols = gt.shape[1]
what = ['Segmentation','Boundaries']
ret = list()
for i in range(cols):
j=0
tmp = list()
for w in what:
tmp.append(gt[0][j][w][0][0][:])
j+=1
ret.append(tmp)
return ret
If someone have a better way to do it please feel free to add a comment or an answer.

GSVD for python Generalized Singular Value Decomposition

MATLAB has a gsvd function to perform the generalised SVD. Since 2013 I think there has been a lot of discussion on the github pages regarding putting it in scipy and some pages have code that I can use such as here which is super complicated for a novice like me(to get it running).
I also found LJWilliams github page with an implementation. This is of no good as has lot of bugs when transferred to python 3. Attempted correcting the simple ones such as assert and print. It quickly gets complicated.
Can someone help me with a gsvd code for python or show me how to use the ones that are online?
Also, This is what I get with the LJWilliams implementation, once the print and assert statements are corrected. The code looks complicated and I am not sure spending time on it is the best thing to do! Also some people have reported issues on the same github page which I am not sure are fixed or connected.
n = 10
m = 6
p = 6
A = np.random.rand(m,n)
B = np.random.rand(p,n)
gsvd(A,B)
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 260,
in gsvd
U, V, Z, C, S = csd(Q[0:m,:],Q[m:m+n,:])
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 107,
in csd
Q,R = scipy.linalg.qr(S[q:n,m:p])
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 141, in qr
overwrite_a=overwrite_a)
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 19, in safecall
ret = f(*args, **kwargs)
ValueError: failed to create intent(cache|hide)|optional array-- must
have defined dimensions but got (0,)
If you want to work from the LJWillams implementation on github, there are a couple of bugs. However, to understand the technique fully, I'd probably recommend having a go at implementing it yourself. I looked up what Octave (MATLAB free software equivalent) do and their "code is a wrapper to the corresponding Lapack dggsvd and zggsvd routines.", which is what scipy should do IMHO.
I'll post up the bugs I found, but I'm not going to post the code in full working order, because I'm not sure how that stands with regard to copyright, given the copyrighted MATLAB implementation from which it is translated.
Caveat : I am not an expert on the Generalised SVD and have approached this only from the perspective of debugging, not whether the underlying algorithm is correct. I have had this working on your original random arrays and the test case already present in the Python file.
Bugs
Setting k
Around line 63, the conditions for setting k and a misunderstanding of numpy.argparse (particularly in comparison to MATLAB's find) seem to set k wrong in some circumstances. Change that code to
if q == 1:
k = 0
elif m < p:
k = n;
else:
k = max([0,sum((np.diag(C) <= 1/np.sqrt(2)))])
line 79
S[1,1] should be S[0,0], I think (Python 0-indexed arrays)
lines 83 onwards
The numpy matrix slicing around here seems wrong. I got the code working by changing lines 83-95 to read:
UT, ST, VT = scipy.linalg.svd(slice_matrix(S,i,j))
ST = add_zeros(ST,np.zeros([n-k,r-k]))
if k > 0:
print('Zeroing elements of S in row indices > r, to be replaced by ST')
S[0:k,k:r] = 0
S[k:n,k:r] = ST
C[:,j] = np.dot(C[:,j],VT)
V[:,i] = np.dot(V[:,i],UT)
Z[:,j] = np.dot(Z[:,j],VT)
i = np.arange(k,q)
Q,R = scipy.linalg.qr(C[k:q,k:r])
C[i,j] = np.diag(diagf(R))
U[:,k:q] = np.dot(U[:,k:q],Q)
in diagp()
There are two matrix multiplications using X*Y that should be np.dot(X,Y) instead (note * is element-wise multiplication in numpy, not matrix multiplication.)

Python: numpy.corrcoef Memory Error

I was trying to calculate the correlation between a large set of data read from a text. For extremely large data set the program give a memory error. Can anyone please tell me how to correct this problem. Thanks
The following is my code:
enter code here
import numpy
from numpy import *
from array import *
from decimal import *
import sys
Threshold = 0.8;
TopMostData = 10;
FileName = sys.argv[1]
File = open(FileName,'r')
SignalData = numpy.empty((1, 128));
SignalData[:][:] = 0;
for line in File:
TempLine = line.split();
TempInt = [float(i) for i in TempLine]
SignalData = vstack((SignalData,TempInt))
del TempLine;
del TempInt;
File.close();
TempData = SignalData;
SignalData = SignalData[1:,:]
SignalData = SignalData[:,65:128]
print "File Read | Data Stored" + " | Total Lines: " + str(len(SignalData))
CorrelationData = numpy.corrcoef(SignalData)
The following is the error:
Traceback (most recent call last):
File "Corelation.py", line 36, in <module>
CorrelationData = numpy.corrcoef(SignalData)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1824, in corrcoef
return c/sqrt(multiply.outer(d, d))
MemoryError
You run out of memory as the comments show. If that happens because you are using 32-bit Python, even the method below will fail. But for the 64-bit Python and not-so-much-RAM situation we can do a lot as calculating the correlations is easily done piecewise, as you only need two lines in the memory simultaneously.
So, you may split your input into, say, 1000 row chunks, and then the resulting 1000 x 1000 matrices are easy to keep in memory. Then you can assemble your result into the big output matrix which is not necessarily in the RAM. I recommend this approach even if you have a lot of RAM, because this is much more memory-friendly. Correlation coefficient calculation is not an operation where fast random accesses would help a lot if the input can be kept in RAM.
Unfortunately, the numpy.corrcoef does not do this automatically, and we'll have to roll our own correlation coefficient calculation. Fortunately, that is not as hard as it sounds.
Something along these lines:
import numpy as np
# number of rows in one chunk
SPLITROWS = 1000
# the big table, which is usually bigger
bigdata = numpy.random.random((27000, 128))
numrows = bigdata.shape[0]
# subtract means form the input data
bigdata -= np.mean(bigdata, axis=1)[:,None]
# normalize the data
bigdata /= np.sqrt(np.sum(bigdata*bigdata, axis=1))[:,None]
# reserve the resulting table onto HDD
res = np.memmap("/tmp/mydata.dat", 'float64', mode='w+', shape=(numrows, numrows))
for r in range(0, numrows, SPLITROWS):
for c in range(0, numrows, SPLITROWS):
r1 = r + SPLITROWS
c1 = c + SPLITROWS
chunk1 = bigdata[r:r1]
chunk2 = bigdata[c:c1]
res[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
Some notes:
the code above is tested above np.corrcoef(bigdata)
if you have complex values, you'll need to create a complex output array res and take the complex conjugate of chunk2.T
the code garbles bigdata to maintain performance and minimize memory use; if you need to preserve it, make a copy
The above code takes about 85 s to run on my machine, but the data will mostly fit in RAM, and I have a SSD disk. The algorithm is coded in such order to avoid too random access into the HDD, i.e. the access is reasonably sequential. In comparison, the non-memmapped standard version is not significantly faster even if you have a lot of memory. (Actually, it took a lot more time in my case, but I suspect I ran out of my 16 GiB and then there was a lot of swapping going on.)
You can make the actual calculations faster by omitting half of the matrix, because res.T == res. In practice, you can omit all blocks where c > r and then mirror them later on. On the other hand, the performance is most likely limited by the HDD preformance, so other optimizations do not necessarily bring much more speed.
Of course, this approach is easy to make parallel, as the chunk calculations are completely independent. Also the memmapped array can be shared between threads rather easily.

Categories

Resources