numpy: efficiently reading a large array - python

I have a binary file that contains a dense n*m matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy array?
The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m bytes in length) and must be easy to produce from non-Python code.
edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)

NumPy provides fromfile() to read binary data.
a = numpy.fromfile("filename", dtype=numpy.float32)
will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:
a = a.reshape((n, m), order="FORTRAN")
[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use
a = a.reshape((m, n)).T
Thanks to Joe Kingtion for pointing this out.]
But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).
HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...

Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.
For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.
At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A').
Also see the question How to force numpy array order to fortran style?
An example with the default C indexing (row-major order):
>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])
>>> a.strides
(32, 8)
Indexing using Fortran order (column-major order):
>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
>>> a[1]
array([ 1, 4, 7, 10])
>>> a.strides
(8, 24)
The other view
Also, you can always get the other kind of view using the parameter T of an array:
>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
You can also manually set the strides:
>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])

Related

Accessing chunks at once in a numpy array

Provided a numpy array:
arr = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
I wonder how access chosen size chunks with chosen separation, both concatenated and in slices:
E.g.: obtain chunks of size 3 separated by two values:
arr_chunk_3_sep_2 = np.array([0,1,2,5,6,7,10,11,12])
arr_chunk_3_sep_2_in_slices = np.array([[0,1,2],[5,6,7],[10,11,12])
Wha is the most efficient way to do it? If possible, I would like to avoid copying or creating new objects as much as possible. Maybe Memoryviews could be of help here?
Approach #1
Here's one with masking -
def slice_grps(a, chunk, sep):
N = chunk + sep
return a[np.arange(len(a))%N < chunk]
Sample run -
In [223]: arr
Out[223]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
In [224]: slice_grps(arr, chunk=3, sep=2)
Out[224]: array([ 0, 1, 2, 5, 6, 7, 10, 11, 12])
Approach #2
If the input array is such that the last chunk would have enough runway, we could , we could leverage np.lib.stride_tricks.as_strided, inspired by this post to select m elements off each block of n elements -
# https://stackoverflow.com/a/51640641/ #Divakar
def skipped_view(a, m, n):
s = a.strides[0]
strided = np.lib.stride_tricks.as_strided
shp = ((a.size+n-1)//n,n)
return strided(a,shape=shp,strides=(n*s,s), writeable=False)[:,:m]
out = skipped_view(arr,chunk,chunk+sep)
Note that the output would be a view into the input array and as such no extra memory overhead and virtually free!
Sample run to make things clear -
In [255]: arr
Out[255]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
In [256]: chunk = 3
In [257]: sep = 2
In [258]: skipped_view(arr,chunk,chunk+sep)
Out[258]:
array([[ 0, 1, 2],
[ 5, 6, 7],
[10, 11, 12]])
# Let's prove that the output is a view indeed
In [259]: np.shares_memory(arr, skipped_view(arr,chunk,chunk+sep))
Out[259]: True
How about a reshape and slice?
In [444]: arr = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12])
In [445]: arr.reshape(-1,5)
...
ValueError: cannot reshape array of size 13 into shape (5)
Ah a problem - your array isn't big enough for this reshape - so we have to pad it:
In [446]: np.concatenate((arr,np.zeros(2,int))).reshape(-1,5)
Out[446]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 0, 0]])
In [447]: np.concatenate((arr,np.zeros(2,int))).reshape(-1,5)[:,:-2]
Out[447]:
array([[ 0, 1, 2],
[ 5, 6, 7],
[10, 11, 12]])
as_strided can get a way with this by including bytes outside the databuffer. Usually that's seen as a bug, though here it can be an asset - provided you really do throw that garbage away.
Or throwing away the last incomplete line:
In [452]: arr[:-3].reshape(-1,5)[:,:3]
Out[452]:
array([[0, 1, 2],
[5, 6, 7]])

Split Numpy array into equal-length sub-arrays

I have a very huge numpy array like this:
np.array([1, 2, 3, 4, 5, 6, 7 , ... , 12345])
I need to create subgroups of n elements (in the example n = 3) in another array like this:
np.array([[1, 2, 3],[4, 5, 6], [6, 7, 8], [...], [12340, 12341, 12342], [12343, 12344, 12345]])
I did accomplish that using normal python lists, just appending the subgroups to another list. But, I'm having a hard time trying to do that in numpy.
Any ideas how can I do that?
Thanks!
You can use np.reshape(-1, 3), where the -1 means "whatever's left".
>>> array = np.arange(1, 12346)
>>> array
array([ 1, 2, 3, ..., 12343, 12344, 12345])
>>> array.reshape(-1, 3)
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
...,
[12337, 12338, 12339],
[12340, 12341, 12342],
[12343, 12344, 12345]])
You can use np.reshape():
From the documentation (link in title):
numpy.reshape(a, newshape, order='C')
Gives a new shape to an array without changing its data.
Here is an example of how you can apply it to your situation:
>>> import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 12345])
>>> a.reshape((int(len(a)/3), 3))
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 12345]], dtype=object)
Note that obviously, the length of the array (len(a)) has to be a multiple of 3 to be able to reshape it into a 2-dimensional numpy array, because they must be rectangular.

Trouble reshaping 3-d NumPy array into 2-d NumPy array

I'm working on a problem with image processing, and my data is presented as a 3-dimensional NumPy array, where the (x, y, z) entry is the (x, y) pixel (numerical intensity value) of image z. There are 100000 images and each image is 25x25. Thus, the data matrix is of size 25x25x10000. I am trying to convert this into a 2-dimensional matrix of size 10000x625, where each row is a linearization of the pixels in the image. For example, suppose that instead the images were 3x3, we would have the following:
1 2 3
4 5 6 ------> [1, 2, 3, 4, 5, 6, 7, 8, 9]
7 8 9
I am attempting to do this by calling data.reshape((10000, 625)), but the data is no longer aligned properly after doing so. I have tried transposing the matrix in valid stages of reshaping, but that does not seem to fix it.
Does anyone know how to fix this?
If you want the data to be aligned you need to do data.reshape((625, 10000)).
If you want a different layout try np.rollaxis:
data_rolled = np.rollaxis(data, 2, 0) # This is Shape (10000, 25, 25)
data_reshaped = data_rolled.reshape(10000, 625) # Now you can do your reshape.
Numpy needs you to know which elements belong together during reshaping, so only "merge" dimensions that belong together.
The problem is that you aren't respecting the standard index order in your reshape call. The data will only be aligned if the two dimensions you want to combine are in the same position in the new array ((25, 25, 10000) -> (625, 10000)).
Then, to get the shape you want, you can transpose. It's easier to visualize with a smaller example -- when you run into problems like this, always try out a smaller example in the REPL if you can.
>>> a = numpy.arange(12)
>>> a = a.reshape(2, 2, 3)
>>> a
array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])
>>> a.reshape(4, 3)
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
>>> a.reshape(4, 3).T
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
No need to rollaxis!
Notice how the print layout that numpy uses makes this kind of reasoning easier. The differences between the first and the second step are only in the bracket positions; the numbers all stay in the same place, which often helps when you want to think through shape issues.

Numpy array, specifiyng what elements to return

Say I have the following 5x5 numpy array called A
array([[6, 7, 7, 7, 8],
[4, 2, 5, 5, 9],
[1, 2, 4, 7, 4],
[0, 7, 3, 6, 8],
[4, 9, 6, 1, 6]])
and this 5x5 array called F
array([[1,0,0,0,0],
[1,0,0,0,0],
[1,0,0,0,0],
[1,0,0,0,0],
[0,0,0,0,0]])
I've been trying to use np.copyto, but I can't wrap my head around why it is not working/how it works.ValueError: could not broadcast input array from shape (5,5) into shape (2)
Is there a easy way to get the values of only the matching integers that have a corresponding 1 in F when laid over A? e.i it would return, 6,4,1,0
you can just do this little trick: A[F==1]
In [8]:
A[F==1]
Out[8]:
array([6, 4, 1, 0])
Check out Boolean indexing
To use np.copyto make sure that the destination array is np.empty.
This basically solved my problem.

How can I find the dimensions of a matrix in Python?

How can I find the dimensions of a matrix in Python. Len(A) returns only one variable.
Edit:
close = dataobj.get_data(timestamps, symbols, closefield)
Is (I assume) generating a matrix of integers (less likely strings). I need to find the size of that matrix, so I can run some tests without having to iterate through all of the elements. As far as the data type goes, I assume it's an array of arrays (or list of lists).
The number of rows of a list of lists would be: len(A) and the number of columns len(A[0]) given that all rows have the same number of columns, i.e. all lists in each index are of the same size.
If you are using NumPy arrays, shape can be used.
For example
>>> a = numpy.array([[[1,2,3],[1,2,3]],[[12,3,4],[2,1,3]]])
>>> a
array([[[ 1, 2, 3],
[ 1, 2, 3]],
[[12, 3, 4],
[ 2, 1, 3]]])
>>> a.shape
(2, 2, 3)
As Ayman farhat mentioned
you can use the simple method len(matrix) to get the length of rows and get the length of the first row to get the no. of columns using len(matrix[0]) :
>>> a=[[1,5,6,8],[1,2,5,9],[7,5,6,2]]
>>> len(a)
3
>>> len(a[0])
4
Also you can use a library that helps you with matrices "numpy":
>>> import numpy
>>> numpy.shape(a)
(3,4)
To get just a correct number of dimensions in NumPy:
len(a.shape)
In the first case:
import numpy as np
a = np.array([[[1,2,3],[1,2,3]],[[12,3,4],[2,1,3]]])
print("shape = ",np.shape(a))
print("dimensions = ",len(a.shape))
The output will be:
shape = (2, 2, 3)
dimensions = 3
m = [[1, 1, 1, 0],[0, 5, 0, 1],[2, 1, 3, 10]]
print(len(m),len(m[0]))
Output
(3 4)
The correct answer is the following:
import numpy
numpy.shape(a)
Suppose you have a which is an array. to get the dimensions of an array you should use shape.
import numpy as np
a = np.array([[3,20,99],[-13,4.5,26],[0,-1,20],[5,78,-19]])
a.shape
The output of this will be
(4,3)
You may use as following to get Height and Weight of an Numpy array:
int height = arr.shape[0]
int width = arr.shape[1]
If your array has multiple dimensions, you can increase the index to access them.
You simply can find a matrix dimension by using Numpy:
import numpy as np
x = np.arange(24).reshape((6, 4))
x.ndim
output will be:
2
It means this matrix is a 2 dimensional matrix.
x.shape
Will show you the size of each dimension. The shape for x is equal to:
(6, 4)
A simple way I look at it:
example:
h=np.array([[[[1,2,3],[3,4,5]],[[5,6,7],[7,8,9]],[[9,10,11],[12,13,14]]]])
h.ndim
4
h
array([[[[ 1, 2, 3],
[ 3, 4, 5]],
[[ 5, 6, 7],
[ 7, 8, 9]],
[[ 9, 10, 11],
[12, 13, 14]]]])
If you closely observe, the number of opening square brackets at the beginning is what defines the dimension of the array.
In the above array to access 7, the below indexing is used,
h[0,1,1,0]
However if we change the array to 3 dimensions as below,
h=np.array([[[1,2,3],[3,4,5]],[[5,6,7],[7,8,9]],[[9,10,11],[12,13,14]]])
h.ndim
3
h
array([[[ 1, 2, 3],
[ 3, 4, 5]],
[[ 5, 6, 7],
[ 7, 8, 9]],
[[ 9, 10, 11],
[12, 13, 14]]])
To access element 7 in the above array, the index is h[1,1,0]

Categories

Resources