Multiple Matrix Multiplications with Numpy - python

I have about 650 csv-based matrices. I plan on loading each one using Numpy as in the following example:
m1 = numpy.loadtext(open("matrix1.txt", "rb"), delimiter=",", skiprows=1)
There are matrix2.txt, matrix3.txt, ..., matrix650.txt files that I need to process.
My end goal is to multiply each matrix by each other, meaning I don't necessarily have to maintain 650 matrices but rather just 2 (1 ongoing and 1 that I am currently multiplying my ongoing by.)
Here is an example of what I mean with matrices defined from 1 to n: M1, M2, M3, .., Mn.
M1*M2*M3*...*Mn
The dimensions on all the matrices are the same. The matrices are not square. There are 197 rows and 11 columns. None of the matrices are sparse and every cell comes into play.
What is the best/most efficient way to do this in python?
EDIT: I took what was suggested and got it to work by taking the transpose since it isn't a square matrix. As an addendum to the question, is there a way in Numpy to do element by element multiplication?

A Python3 solution, if "each matrix by each other" actually means just multiplying them in a row and the matrices have compatible dimensions ( (n, m) · (m, o) · (o, p) · ... ), which you hint at with "(1 ongoing and 1 that...)", then use (if available):
from functools import partial
fnames = map("matrix{}.txt".format, range(1, 651))
np.linalg.multi_dot(map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames))
or:
from functools import reduce, partial
fnames = map("matrix{}.txt".format, range(1, 651))
matrices = map(partial(np.loadtxt, delimiter=',', skiprows=1), fnames)
res = reduce(np.dot, matrices)
Maps etc. are lazy in python3, so files are read as needed. Loadtxt doesn't require a pre-opened file, a filename will do.
Doing all the combinations lazily, given that the matrices have the same shape (will do a lot of rereading of data):
from functools import partial
from itertools import starmap, combinations
map_loadtxt = partial(map, partial(np.loadtxt, delimiter=',', skiprows=1))
fname_combs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
res = list(starmap(np.dot, map(map_loadtxt, fname_combs)))
Using a bit of grouping to reduce reloading of files:
from itertools import groupby, combinations, chain
from functools import partial
from operator import itemgetter
loader = partial(np.loadtxt, delimiter=',', skiprows=1)
fname_pairs = combinations(map("matrix{}.txt".format, range(1, 651)), 2)
groups = groupby(fname_pairs, itemgetter(0))
res = list(chain.from_iterable(
map(loader(k).dot, map(loader, map(itemgetter(1), g)))
for k, g in groups
))
Since the matrices are not square, but have the same dimensions, you would have to add transposes before multiplication to match the dimensions. For example either loader(k).T.dot or map(np.transpose, map(loader, ...)).
If on the other hand the question actually was meant to address element wise multiplication, replace np.dot with np.multiply.

1. Variant: Nice code but reads all matrices at once
matrixFileCount = 3
matrices = [np.loadtxt(open("matrix%s.txt" % i ), delimiter=",", skiprows=1) for i in range(1,matrixFileCount+1)]
allC = itertools.combinations([x for x in range(matrixFileCount)], 2)
allCMultiply = [np.dot(matrices[c[0]], matrices[c[1]]) for c in allC]
print allCMultiply
2. Variant: Only load 2 Files at once, nice code but a lot of reloading
allCMulitply = []
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
3. Variant: like the second but avoid loading every time. But only 2 matrix at one point in memory
Cause the permutations created with itertools are like (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4) you can avoid somtimes loading both of the 2 matrices.
matrixFileCount = 3
allCMulitply = []
mLoaded = {'file' : None, 'matrix' : None}
fileList = ["matrix%s.txt" % x for x in range(1,matrixFileCount+1)]
allC = itertools.combinations(fileList, 2)
for c in allC:
if c[0] is mLoaded['file']:
m = [mLoaded['matrix'], np.loadtxt(open(c[1]), delimiter=",", skiprows=1)]
else:
mLoaded = {'file' : None, 'matrix' : None}
m = [np.loadtxt(open(file), delimiter=",", skiprows=1) for file in c]
mLoaded = {'file' : c[0], 'matrix' : m[0]}
allCMulitply.append(np.dot(m[0], m[1]))
print allCMulitply
Performance
If you can load all Matrix at once in the memory, the first part is faster then the second, cause in the second you reload matrices a lot. Third part slower than first, but faster than second, cause it avoids sometimes to reloading matrices.
0.943613052368 (Part 1: 10 Matrices a 2,2 with 1000 executions)
7.75622487068 (Part 2: 10 Matrices a 2,2 with 1000 executions)
4.83783197403 (Part 3: 10 Matrices a 2,2 with 1000 executions)

Kordi's answer loads all of the matrices before doing the multiplication. And that's fine if you know the matrices are going to be small. If you want to conserve memory, however, I'd do the following:
import numpy as np
def get_dot_product(fnames):
assert len(fnames) > 0
accum_val = np.loadtxt(fnames[0], delimiter=',', skiprows=1)
return reduce(_product_from_file, fnames[1:], initializer=accum_val)
def _product_from_file(running_product, fname):
return running_product.dot(np.loadtxt(fname, delimiter=',', skiprows=1))
If the matrices are large and irregular in shape (not square), there are also optimization algorithms for determining the optimal associative groupings (i.e., where to put the parentheses), but in most cases I doubt it would be worth the overhead of loading and unloading each file twice, once to figure out the associative groupings and then once to carry it out. NumPy is surprisingly fast even on pretty big matrices.

How about a really simple solution avoiding map, reduce and the like? The default numpy array object does element-wise multiplication by default.
size = (197, 11)
result = numpy.ones(size)
for i in range(1, 651):
result *= numpy.loadtext(open("matrix{}.txt".format(i), "rb"),
delimiter=",", skiprows=1)

Related

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

how do I compress matrix to avoid memory error?

Following is the numpy array I have. I need to create a matrix containing zeors for an instance like np.zeroes([1,1]).
newEdges =
array([['0', 'Firm'],
['1', 'Firm'],
['2', 'Firm'],
...,
['binA', 'year2017_bin'],
['binA', 'year2017_bin'],
['binA', 'year2017_bin']],
dtype='<U21')
newEdges.shape
#(63673218, 2)
newEdges.size
#127346436
However, based on the size of my matrix (as you can see above, that is, (63673218, 2)), if I run syntax to generate the zeroes matrix I get a Memory Error.
He is full syntax:
print(newEdges)
unique_Bin = np.unique(newEdges[:,0])
n_unique_Bin = len(unique_Bin)
unique_Bin
n_unique_Bin
#3351248
Q = np.zeros([n_unique_Bin,n_unique_Bin])
--------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-16-581dfaca2eab> in <module>()
----> 1 Q = np.zeros([n_unique_Bin,n_unique_Bin])
MemoryError:
How do I resolve this error? Or, how would I safely convert this huge matrix to a sparse matrix for further calculation done below:
for n, employer_employee in enumerate(newEdges):
#print(employer_employee)
#copy the array for the original o be intact
eee = np.copy(newEdges)
#sustitue the current tuple with a empty one to avoid self comparing
eee[n] = (None,None)
#get the index for the current employee, the one on the y axis
employee_index = np.where(employer_employee[0] != unique_Bin)
#get the indexes where the the employees letter match
eq_index = np.where(eee[:,1] == employer_employee[1])[0]
eq_employee = eee[eq_index,0]
#add at the final array Q by index
for emp in eq_employee:
#print(np.unique(emp))
emp_index = np.where(unique_Bin == emp)
#print(emp)
Q[employee_index,emp_index]+= 1
# print(Q)
print(Q)
I have 24GB left in the memory for this calculation.
Just to point this out you are trying to create an array which is 3,351,248 x 3,351,248 in size. This is 11,230,863,157,504 entries! 11.2 trillion entries! The fact that you tried to print this makes me think you hadn't realised how big it was. I don't think you will be able to do this without sparse matrices. First of all you should probably make sure you need to do this and see if there is some other way.
Otherwise you can create a sparse matrix using scipy
import numpy as np
import scipy
Q = scipy.sparse.csr_matrix((n_unique_Bin,n_unique_Bin), dtype = np.int8)
Then go from there.

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

Numpy: Efficient access to sub-arrays generated by numpy.split

I have the following code that generates a list of sub-arrays based on the split function. Here, I just compare the first value of each tuple and based on the difference I generate the sub-arrays. So far so good.
import numpy as np
f = np.genfromtxt("d_n_isogro_ms.txt", names=True, dtype=None, usecols=(1,-1))
dm = np.absolute(np.diff(f['mz']))
pos = np.where(dm > 2)[0] + 1
fsplit = np.array_split(f, pos)
This is how the sample input looks like (only an excerpt):
[(270.0332, 472) (271.0376, 1936) (272.0443, 11188) (273.0495, 65874)
(274.0517, 8582) (275.0485, 4081) (276.0523, 659) (286.058, 1078)
(287.0624, 4927) (288.0696, 22481) (289.0757, 84001) (290.078, 13688)
(291.0746, 5402) (430.1533, 13995) (431.1577, 2992) (432.1685, 504)]
<type 'numpy.ndarray'>
The position for this particular data is then computed as:
pos = [7,12]
And here is my sample output:
[array([(270.0332, 472), (271.0376, 1936), (272.0443, 11188),
(273.0495, 65874), (274.0517, 8582), (275.0485, 4081),
(276.0523, 659)], dtype=[('mz', '<f8'), ('I', '<i8')]),
array([(286.058, 1078), (287.0624, 4927), (288.0696, 22481),
(289.0757, 84001), (290.078, 13688), (291.0746, 5402)],
dtype=[('mz', '<f8'), ('I', '<i8')]),
array([(430.1533, 13995),
(431.1577, 2992), (432.1685, 504)],
dtype=[('mz', '<f8'), ('I', '<i8')])]
I would like to perform the weighted average on each of the arrays. Is there an efficient way of doing this with numpy? I basically fail with the indexing. Preferably, I would like to use the dtype to identify weights and numbers.
Maybe one could do the whole operation on the fly
Thank you very much for your help in advance.
Best,
Christian
The output of np.array_split is a Python list containing arrays of unequal lenghts. The best you can do in that case is a Python loop:
result = [np.average(f_i['mz'], weights=f_i['I']) for f_i in fsplit]
But it is possible to come up with a completely vectorized solution, by using add.reduceat instead of array_split:
dm = np.abs(np.diff(f['mz']))
pos = np.flatnonzero(np.r_[True, dm > 2])
totals = np.add.reduceat(f['mz']*f['I'], pos)
counts = np.add.reduceat(f['I'], pos)
result = totals / counts

Can vtk InsertValue take float arguments?

I have a question about InsertValue
If I understand it only takes integer arguements. I was wondering if there is a way to have it take float values? Or maybe some other function that does the job of InsertValue but takes float values? I know there is InsertNextValue, but I am not sure if it'll be efficient in my case since my array is a very big array (~ 100.000 by 120)
Below is my code and in my code I am making the entries of fl values integers to make it work for now but ideally it'll be great if I don't have to do that.
Thanks in advance :)
import vtk
import math
from vtk import vtkStructuredGrid, vtkPoints, vtkFloatArray, vtkXMLStructuredGridWriter
import scipy.io
import numpy
import os
#loading the matlab files
mats = scipy.io.loadmat('/home/lusine/data/3DDA/donut_for_vtk/20130228_050000_3D_E=1.mat')
#x,y,z coordinate, fl flux values
xx = mats['xvect']
yy = mats['yvect']
zz = mats['zvect']
fl = mats['fluxmesh3d'] #3d matrix
nx = xx.shape[1]
ny = yy.shape[1]
nz = zz.shape[1]
fl = numpy.nan_to_num(fl)
inx = numpy.nonzero(fl)
l = len(inx[1])
grid = vtk.vtkStructuredGrid()
grid.SetDimensions(nx,ny,nz) # sets the dimensions of the grid
pts = vtk.vtkPoints() # represents 3D points, The data model for vtkPoints is an array of vx-vy-vz triplets accessible by (point or cell) id.
pts.SetNumberOfPoints(nx*ny*nz) # Specify the number of points for this object to hold.
p=0
for i in range(l):
pts.InsertPoint(p, xx[0][inx[0][i]], yy[0][inx[1][i]], zz[0][inx[2][i]])
p = p + 1
SetPoint()
grid.SetPoints(pts)
cdata = vtk.vtkFloatArray()
cdata.SetNumberOfComponents(1)
cdata.SetNumberOfTuples((nx-1)*(ny-1)*(nz-1))
cdata.SetName('cellData')
p=0
for i in range(l-1):
cdata.InsertValue(p,inx[0][i]+inx[1][i]+inx[2][i])
p = p+1
grid.GetCellData().SetScalars(cdata)
pdata = vtk.vtkFloatArray()
pdata.SetNumberOfComponents(1)
#Get the number of tuples (a component group) in the array
pdata.SetNumberOfTuples(nx*ny*nz)
#Sets the array name
pdata.SetName('pointData')
for i in range(l):
pdata.InsertValue(int(fl[inx[0][i]][inx[1][i]][inx[2][i]]), inx[0][i]+inx[1][i]+inx[2][i])
grid.GetPointData().SetScalars(pdata)
writer = vtk.vtkXMLStructuredGridWriter()
writer.SetFileName('new_grid.vts')
#writer.SetInput(grid)
writer.SetInputData(grid)
writer.Update()
print 'end'
The first argument of InsertValue requires an integer because it's the index where the value is going to be inserted. If instead of a vtkFloatArray pdata you had a numpy array called p, this would be the equivalent of your instruction:
pdata.InsertValue(a,b) becomes p[a]=b
p[0.1] wouldn't make sense, it a must be an integer!
But I am a bit lost on the data. What do you mean that your array is (~ 100.000 by 120)..do you have 100.000 points, and each point has a vector of 120 components? In such a case, your pdata should have 120 components, and for each point point_index you call
pdata.SetTuple[point_index,[v0,v1...,v119]
or
pdata.SetComponent[point_index,0,v0]
...
pdata.SetComponent[point_index,119,v119]
If not, are you sure that you have to access pdata based on fl values (you have to be sure that fl is int, 0 <= fl < ntuples, and that you are not going to have holes). Check if you can do the same thing that you do for cdata (btw in your code p is always equal to i, you can just use i)
It's also possible to copy a numpy array directly to vtk , see http://vtk.1045678.n5.nabble.com/vtk-to-numpy-how-to-get-a-vtk-array-tp1244891p1244895.html , but you have to be very careful with the structure of your data

Categories

Resources