how do I compress matrix to avoid memory error? - python

Following is the numpy array I have. I need to create a matrix containing zeors for an instance like np.zeroes([1,1]).
newEdges =
array([['0', 'Firm'],
['1', 'Firm'],
['2', 'Firm'],
...,
['binA', 'year2017_bin'],
['binA', 'year2017_bin'],
['binA', 'year2017_bin']],
dtype='<U21')
newEdges.shape
#(63673218, 2)
newEdges.size
#127346436
However, based on the size of my matrix (as you can see above, that is, (63673218, 2)), if I run syntax to generate the zeroes matrix I get a Memory Error.
He is full syntax:
print(newEdges)
unique_Bin = np.unique(newEdges[:,0])
n_unique_Bin = len(unique_Bin)
unique_Bin
n_unique_Bin
#3351248
Q = np.zeros([n_unique_Bin,n_unique_Bin])
--------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-16-581dfaca2eab> in <module>()
----> 1 Q = np.zeros([n_unique_Bin,n_unique_Bin])
MemoryError:
How do I resolve this error? Or, how would I safely convert this huge matrix to a sparse matrix for further calculation done below:
for n, employer_employee in enumerate(newEdges):
#print(employer_employee)
#copy the array for the original o be intact
eee = np.copy(newEdges)
#sustitue the current tuple with a empty one to avoid self comparing
eee[n] = (None,None)
#get the index for the current employee, the one on the y axis
employee_index = np.where(employer_employee[0] != unique_Bin)
#get the indexes where the the employees letter match
eq_index = np.where(eee[:,1] == employer_employee[1])[0]
eq_employee = eee[eq_index,0]
#add at the final array Q by index
for emp in eq_employee:
#print(np.unique(emp))
emp_index = np.where(unique_Bin == emp)
#print(emp)
Q[employee_index,emp_index]+= 1
# print(Q)
print(Q)
I have 24GB left in the memory for this calculation.

Just to point this out you are trying to create an array which is 3,351,248 x 3,351,248 in size. This is 11,230,863,157,504 entries! 11.2 trillion entries! The fact that you tried to print this makes me think you hadn't realised how big it was. I don't think you will be able to do this without sparse matrices. First of all you should probably make sure you need to do this and see if there is some other way.
Otherwise you can create a sparse matrix using scipy
import numpy as np
import scipy
Q = scipy.sparse.csr_matrix((n_unique_Bin,n_unique_Bin), dtype = np.int8)
Then go from there.

Related

What is the equivalent of coo_matrix in Matlab?

I'm trying to translatethe following lines of code from Python to MATLAB. V, Id, and J are of size (6400,) which in MATLAB are 1 -by- 6400 row vectors. pts is of size 242.
My Python code
A = coo_matrix((V, (Id, J)), shape=(pts.size, pts.size)).tocsr()
A = A.tobsr(blocksize=(2, 2))
I translated the first line as follows to MATLAB
A = sparse(V,Id,J,242,242);
However, I got the error
Error using sparse
Index into matrix must be an integer.
How can I translate this code to MATLAB?
The MATLAB sparse function has several forms:
S = sparse(A)
S = sparse(m,n)
S = sparse(i,j,v)
S = sparse(i,j,v,m,n)
S = sparse(i,j,v,m,n,nz)
The form you are most likely looking for is the fourth one: S = sparse(i,j,v,m,n), and will want to call it (using your use case) as:
A = sparse(Id, J, V, 242, 242);
I think your error is that MATLAB wants the I and J indices first, followed by the value and you are making the value the first argument.

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

Multiple Matrix Multiplications with Numpy

I have about 650 csv-based matrices. I plan on loading each one using Numpy as in the following example:
m1 = numpy.loadtext(open("matrix1.txt", "rb"), delimiter=",", skiprows=1)
There are matrix2.txt, matrix3.txt, ..., matrix650.txt files that I need to process.
My end goal is to multiply each matrix by each other, meaning I don't necessarily have to maintain 650 matrices but rather just 2 (1 ongoing and 1 that I am currently multiplying my ongoing by.)
Here is an example of what I mean with matrices defined from 1 to n: M1, M2, M3, .., Mn.
M1*M2*M3*...*Mn
The dimensions on all the matrices are the same. The matrices are not square. There are 197 rows and 11 columns. None of the matrices are sparse and every cell comes into play.
What is the best/most efficient way to do this in python?
EDIT: I took what was suggested and got it to work by taking the transpose since it isn't a square matrix. As an addendum to the question, is there a way in Numpy to do element by element multiplication?
A Python3 solution, if "each matrix by each other" actually means just multiplying them in a row and the matrices have compatible dimensions ( (n, m) · (m, o) · (o, p) · ... ), which you hint at with "(1 ongoing and 1 that...)", then use (if available):
from functools import partial
fnames = map("matrix{}.txt".format, range(1, 651))
np.linalg.multi_dot(map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames))
or:
from functools import reduce, partial
fnames = map("matrix{}.txt".format, range(1, 651))
matrices = map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames)
res = reduce(np.dot, matrices)
Maps etc. are lazy in python3, so files are read as needed. Loadtxt doesn't require a pre-opened file, a filename will do.
Doing all the combinations lazily, given that the matrices have the same shape (will do a lot of rereading of data):
from functools import partial
from itertools import starmap, combinations
map_loadtxt = partial(map, partial(np.loadtxt, delimiter=',', skiprows=1))
fname_combs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
res = list(starmap(np.dot, map(map_loadtxt, fname_combs)))
Using a bit of grouping to reduce reloading of files:
from itertools import groupby, combinations, chain
from functools import partial
from operator import itemgetter
loader = partial(np.loadtxt, delimiter=',', skiprows=1)
fname_pairs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
groups = groupby(fname_pairs, itemgetter(0))
res = list(chain.from_iterable(
map(loader(k).dot, map(loader, map(itemgetter(1), g)))
for k, g in groups
))
Since the matrices are not square, but have the same dimensions, you would have to add transposes before multiplication to match the dimensions. For example either loader(k).T.dot or map(np.transpose, map(loader, ...)).
If on the other hand the question actually was meant to address element wise multiplication, replace np.dot with np.multiply.
1. Variant: Nice code but reads all matrices at once
matrixFileCount = 3
matrices = [np.loadtxt(open("matrix%s.txt" % i ), delimiter=",", skiprows=1) for i in range(1,matrixFileCount+1)]
allC = itertools.combinations([x for x in range(matrixFileCount)], 2)
allCMultiply = [np.dot(matrices[c[0]], matrices[c[1]]) for c in allC]
print allCMultiply
2. Variant: Only load 2 Files at once, nice code but a lot of reloading
allCMulitply = []
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
3. Variant: like the second but avoid loading every time. But only 2 matrix at one point in memory
Cause the permutations created with itertools are like (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4) you can avoid somtimes loading both of the 2 matrices.
matrixFileCount = 3
allCMulitply = []
mLoaded = {'file' : None, 'matrix' : None}
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
if c[0] is mLoaded['file']:
m = [mLoaded['matrix'], np.loadtxt(open(c[1]), delimiter=",", skiprows=1)]
else:
mLoaded = {'file' : None, 'matrix' : None}
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
mLoaded = {'file' : c[0], 'matrix' : m[0]}
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
Performance
If you can load all Matrix at once in the memory, the first part is faster then the second, cause in the second you reload matrices a lot. Third part slower than first, but faster than second, cause it avoids sometimes to reloading matrices.
0.943613052368 (Part 1: 10 Matrices a 2,2 with 1000 executions)
7.75622487068 (Part 2: 10 Matrices a 2,2 with 1000 executions)
4.83783197403 (Part 3: 10 Matrices a 2,2 with 1000 executions)
Kordi's answer loads all of the matrices before doing the multiplication. And that's fine if you know the matrices are going to be small. If you want to conserve memory, however, I'd do the following:
import numpy as np
def get_dot_product(fnames):
assert len(fnames) > 0
accum_val = np.loadtxt(fnames[0], delimiter=',', skiprows=1)
return reduce(_product_from_file, fnames[1:], initializer=accum_val)
def _product_from_file(running_product, fname):
return running_product.dot(np.loadtxt(fname, delimiter=',', skiprows=1))
If the matrices are large and irregular in shape (not square), there are also optimization algorithms for determining the optimal associative groupings (i.e., where to put the parentheses), but in most cases I doubt it would be worth the overhead of loading and unloading each file twice, once to figure out the associative groupings and then once to carry it out. NumPy is surprisingly fast even on pretty big matrices.
How about a really simple solution avoiding map, reduce and the like? The default numpy array object does element-wise multiplication by default.
size = (197, 11)
result = numpy.ones(size)
for i in range(1, 651):
result *= numpy.loadtext(open("matrix{}.txt".format(i), "rb"),
delimiter=",", skiprows=1)

applying healpy mask to array of maps

I have a series of maps with two different indices, i and j. Let this be indexed like map_series[i][j].
EDIT 1/21: A minimal working example would be something like
map_series=np.array([np.array([np.arange(12) + 0.1*(i+1) + 0.01*(j+1) for j in range(3)]) for i in range(5)])
I'd like to apply the same mask to each; if map_series is one-dimensional, these each work.
I can imagine a few different ways of applying these maps:
(A) Applying the mask to the whole array:
map_series_ma = hp.ma(map_series)
map_series_ma.mask = predefined_mask
(B1) Applying the mask to each element of the array:
map_series_ma = np.zeros_like(map_series)
for i in range(len(map_series)):
for j in range(len(map_series[0])):
temp = hp.ma(map_series[i][j])
temp.mask = predefined_mask
map_series_ma[i][j] = temp
(B2) Applying the mask to each element of the array:
map_series_ma = np.zeros_like(map_series)
for i in range(len(map_series)):
for j in range(len(map_series[0])):
map_series_ma[i][j] = hp.ma(map_series[i][j])
map_series_ma[i][j].mask = predefined_mask
(C) Pythonically enumerating the list:
map_series_ma = np.array([hp.ma(map_series[i][j]) for j in range(j_max) for i in range(i_max)])
map_series_ma.mask = predetermined_mask
All of these fail to give my desired output, however.
Upon trying (A) or (C) I get an error after the first step, telling me TypeError: bad number of pixels.
Upon trying (B1) I don't get an error, but I also none of the elements of the maps_series_ma have masks; in fact, they do not even appear to be hp.ma objects. Oddly enough, though: when I return temp it does have the appropriate mask.
Upon trying (B2) I get the error
AttributeError: 'numpy.ndarray' object has no attribute 'mask' (which, after looking at my syntax, I totally understand!)
I'm a little confused how to go about this. Both (A) and (B1) seem acceptable to me...
Any help is much appreciated,
Thanks,
Sam
this works for me:
import numpy as np
import healpy as hp
map_series=np.array([np.array([np.arange(12) + 0.1*(i+1) + 0.01*(j+1) for j in range(3)]) for i in range(5)])
map_series_ma = map(lambda x: hp.ma(x), map_series)
pm=[True, True,True,True,True,True,False,False,False,False,False,False]
for m in map_series_ma:
for mm in m:
mm.mask=pm

Can vtk InsertValue take float arguments?

I have a question about InsertValue
If I understand it only takes integer arguements. I was wondering if there is a way to have it take float values? Or maybe some other function that does the job of InsertValue but takes float values? I know there is InsertNextValue, but I am not sure if it'll be efficient in my case since my array is a very big array (~ 100.000 by 120)
Below is my code and in my code I am making the entries of fl values integers to make it work for now but ideally it'll be great if I don't have to do that.
Thanks in advance :)
import vtk
import math
from vtk import vtkStructuredGrid, vtkPoints, vtkFloatArray, vtkXMLStructuredGridWriter
import scipy.io
import numpy
import os
#loading the matlab files
mats = scipy.io.loadmat('/home/lusine/data/3DDA/donut_for_vtk/20130228_050000_3D_E=1.mat')
#x,y,z coordinate, fl flux values
xx = mats['xvect']
yy = mats['yvect']
zz = mats['zvect']
fl = mats['fluxmesh3d'] #3d matrix
nx = xx.shape[1]
ny = yy.shape[1]
nz = zz.shape[1]
fl = numpy.nan_to_num(fl)
inx = numpy.nonzero(fl)
l = len(inx[1])
grid = vtk.vtkStructuredGrid()
grid.SetDimensions(nx,ny,nz) # sets the dimensions of the grid
pts = vtk.vtkPoints() # represents 3D points, The data model for vtkPoints is an array of vx-vy-vz triplets accessible by (point or cell) id.
pts.SetNumberOfPoints(nx*ny*nz) # Specify the number of points for this object to hold.
p=0
for i in range(l):
pts.InsertPoint(p, xx[0][inx[0][i]], yy[0][inx[1][i]], zz[0][inx[2][i]])
p = p + 1
SetPoint()
grid.SetPoints(pts)
cdata = vtk.vtkFloatArray()
cdata.SetNumberOfComponents(1)
cdata.SetNumberOfTuples((nx-1)*(ny-1)*(nz-1))
cdata.SetName('cellData')
p=0
for i in range(l-1):
cdata.InsertValue(p,inx[0][i]+inx[1][i]+inx[2][i])
p = p+1
grid.GetCellData().SetScalars(cdata)
pdata = vtk.vtkFloatArray()
pdata.SetNumberOfComponents(1)
#Get the number of tuples (a component group) in the array
pdata.SetNumberOfTuples(nx*ny*nz)
#Sets the array name
pdata.SetName('pointData')
for i in range(l):
pdata.InsertValue(int(fl[inx[0][i]][inx[1][i]][inx[2][i]]), inx[0][i]+inx[1][i]+inx[2][i])
grid.GetPointData().SetScalars(pdata)
writer = vtk.vtkXMLStructuredGridWriter()
writer.SetFileName('new_grid.vts')
#writer.SetInput(grid)
writer.SetInputData(grid)
writer.Update()
print 'end'
The first argument of InsertValue requires an integer because it's the index where the value is going to be inserted. If instead of a vtkFloatArray pdata you had a numpy array called p, this would be the equivalent of your instruction:
pdata.InsertValue(a,b) becomes p[a]=b
p[0.1] wouldn't make sense, it a must be an integer!
But I am a bit lost on the data. What do you mean that your array is (~ 100.000 by 120)..do you have 100.000 points, and each point has a vector of 120 components? In such a case, your pdata should have 120 components, and for each point point_index you call
pdata.SetTuple[point_index,[v0,v1...,v119]
or
pdata.SetComponent[point_index,0,v0]
...
pdata.SetComponent[point_index,119,v119]
If not, are you sure that you have to access pdata based on fl values (you have to be sure that fl is int, 0 <= fl < ntuples, and that you are not going to have holes). Check if you can do the same thing that you do for cdata (btw in your code p is always equal to i, you can just use i)
It's also possible to copy a numpy array directly to vtk , see http://vtk.1045678.n5.nabble.com/vtk-to-numpy-how-to-get-a-vtk-array-tp1244891p1244895.html , but you have to be very careful with the structure of your data

Categories

Resources