I am working with multiple NumPy 2-dimension arrays (matrices), and I want to get some rows, or columns, from them (same rows or columns indexes for each of the 3 matrices, each time). I was wondering if I should use dictionary or not.
If I do it with a dictionary, then each row of each matrix would be indexed by a word, and would a list of values that interest me. E.g, myDict['word'] would contain [1 5 2 49 0 2].
If I do it with an array myArray, for each i I would have an array contained within myArray[i]. E.g, myArray[5] would contain array([[1 2 4 9 1 23]]).
On these I need to implement basic get operations (get rows or get columns), some matrix multiplications but never sorting or insertions.
I know I can do it both ways, my question is mainly of performance. Which do you think would be the faster and simplier?
Thanks a lot!
For matrix operation, I strongly recommend numpy, to justify my choice, I want first to quote wikipedia:
http://en.wikipedia.org/wiki/NumPy
"... any algorithm that can be expressed primarily as operations on arrays and matrices can run almost as quickly as the equivalent C code."
Besides that I notice that you want to have matrix multiplication functionality. Numpy provides you that, and of course in an efficient way.
Related
I have a matrix NxM.
N is big enough N >> 10000.
I wonder if there is an algorithm to mix all the lines of a matrix to get a 100 matrix for example. My matrices C must not be identical.
Thoughts?
So, do you want to keep the shape of the matrix and just shuffle the rows or do you want to get subsets of the matrix?
For the first case I think the permutation algorithm from numpy could be your choice. Just create a permutation of a index list, like Souin propose.
For the second case just use the numpy choice funtion (also from the random module) without replacement if I understood your needs correctly.
I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]
I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).
I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).
I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.
I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.
Whats the best option to go for?
Thanks in advance
UPDATE: sparsity of the boolean matrices is <0.6%
If your matrices are relatively empty it might be worthwhile encoding them as a data structure of the non-False values. Say a list of tuples describing the location of the non-False values. Or a dictionary with the tuples as the keys.
If you use e.g. a list of tuples you could use a list comprehension to find the items in the second list that can be multiplied with an element from the first list.
a = [(0,0), (3,7), (5,2)] # et cetera
b = ... # idem
for r, c in a:
res = [(r, k) for j, k in b if k == j]
-- EDITED TO SATISFY BELOW COMMENT / DOWNVOTER --
You're asking how to multiply matrices fast and easy.
SOLUTION 1: This is a solved problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, are rather blazingly fast.
http://www.numpy.org/
http://www.scipy.org
also see:
Very large matrices using Python and NumPy
http://docs.scipy.org/doc/scipy/reference/sparse.html
SciPy and Numpy have sparse matrices and matrix multiplication. It doesn't use much memory since (at least if I wrote it in C) it probably uses linked lists, and thus will only use the memory required for the sum of the datapoints, plus some overhead. And, it will almost certainly be blazingly fast compared to pure python solution.
SOLUTION 2
Another answer here suggests storing values as tuples of (x, y), presuming value is False unless it exists, then it's true. Alternate to this is a numeric matrix with (x, y, value) tuples.
REGARDLESS: Multiplying these would be Nasty time-wise: find element one, decide which other array element to multiply by, then search the entire dataset for that specific tuple, and if it exists, multiply and insert the result into the result matrix.
SOLUTION 3 ( PREFERRED vs. Solution 2, IMHO )
I would prefer this because it's simpler / faster.
Represent your sparse matrix with a set of dictionaries. Matrix one is a dict with the element at (x, y) and value v being (with x1,y1, x2,y2, etc.):
matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... }
matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... }
Since a Python dict lookup is O(1) (okay, not really, probably closer to log(n)), it's fast. This does not require searching the entire second matrix's data for element presence before multiplication. So, it's fast. It's easy to write the multiply and easy to understand the representations.
SOLUTION 4 (if you are a glutton for punishment)
Code this solution by using a memory-mapped file of the required size. Initialize a file with null values of the required size. Compute the offsets yourself and write to the appropriate locations in the file as you do the multiplication. Linux has a VMM which will page in and out for you with little overhead or work on your part. This is a solution for very, very large matrices that are NOT SPARSE and thus won't fit in memory.
Note this solves the complaint of the below complainer that it won't fit in memory. However, the OP did say sparse, which implies very few actual datapoints spread out in giant arrays, and Numpy / SciPy handle this natively and thus nicely (lots of people at Fermilab use Numpy / SciPy regularly, I'm confident the sparse matrix code is well tested).
I am using Scipy to construct a large, sparse (250k X 250k) co-occurrence matrix using scipy.sparse.lil_matrix. Co-occurrence matrices are triangular; that is, M[i,j] == M[j,i]. Since it would be highly inefficient (and in my case, impossible) to store all the data twice, I'm currently storing data at the coordinate (i,j) where i is always smaller than j. So in other words, I have a value stored at (2,3) and no value stored at (3,2), even though (3,2) in my model should be equal to (2,3). (See the matrix below for an example)
My problem is that I need to be able to randomly extract the data corresponding to a given index, but, at least the way, I'm currently doing it, half the data is in the row and half is in the column, like so:
M =
[1 2 3 4
0 5 6 7
0 0 8 9
0 0 0 10]
So, given the above matrix, I want to be able to do a query like M[1], and get back [2,5,6,7]. I have two questions:
1) Is there a more efficient (preferably built-in) way to do this than first querying the row, and then the column, and then concatenating the two? This is bad because whether I use CSC (column-based) or CSR (row-based) internal representation, one of the two queries is highly inefficient.
2) Am I even using the right part of Scipy? I have seen a few functions in the Scipy library that mention triangular matrices, but they seem to revolve around getting triangular matrices from a full matrix. In my case, (I think) I already have a triangular matrix, and want to manipulate it.
Many thanks.
I would say that you can't have the cake and eat it too: if you want efficient storage, you cannot store full rows (as you say); if you want efficient row access, I'd say that you have to store full rows.
While real performances depend on your application, you could check whether the following approach works for you:
You use Scipy's sparse matrices for efficient storage.
You automatically symmetrize your matrix (there is a small recipe on StackOverflow, that works at least on regular matrices).
You can then access its rows (or columns); whether this is efficient depends on the implementation of sparse matrices…
i want to create a matrix of size 1234*5678 with it being filled with 1 to 5678 in row major order?>..!!
I think you will need to use numpy to hold such a big matrix efficiently , not just computation. You have ~5e6 items of 4/8 bytes means 20/40 Mb in pure C already, several times of that in python without an efficient data structure (a list of rows, each row a list).
Now, concerning your question:
import numpy as np
a = np.empty((1234, 5678), dtype=np.int)
a[:] = np.linspace(1, 5678, 5678)
You first create an array of the requested size, with type int (I assume you know you want 4 bytes integer, which is what np.int will give you on most platforms). The 3rd line uses broadcasting so that each row (a[0], a[1], ... a[1233]) is assigned the values of the np.linspace line (which gives you an array of [1, ....., 5678]). If you want F storage, that is column major:
a = np.empty((1234, 4567), dtype=np.int, order='F')
...
The matrix a will takes only a tiny amount of memory more than an array in C, and for computation at least, the indexing capabilities of arrays are much better than python lists.
A nitpick: numeric is the name of the old numerical package for python - the recommended name is numpy.
Or just use Numerical Python if you want to do some mathematical stuff on matrix too (like multiplication, ...). If they use row major order for the matrix layout in memory I can't tell you but it gets coverd in their documentation
Here's a forum post that has some code examples of what you are trying to achieve.