Large numpy array time/memory issues

Large numpy array time/memory issues - python

My supervisor uses IDL but I've been using Python as I'm more familiar with it. I'm performing an interpolation and have saved the lower/upper bound values. Is there a quicker way of doing this?
Variables
Inputs:
sed = numpy array [6,221,6900]
it0, it1, iz0, iz1 = numpy arrays [341499]
snapshot (38 of them)
Outputs:
sed1, sed2, sed3, sed4 = numpy arrays [341499]
MWE
I want to loop through the 38 snapshots, then within that loop through the 341499 particles, and then assign the resulting numpy array [6900] given below.
sed1 = sed[iz0, it0]
sed2 = sed[iz1, it0]
sed3 = sed[iz0, it1]
sed4 = sed[iz1, it1]
What I've Tried
I cannot initialise an array of the required size i.e. numpy [38, 341499, 4, 6900] as this gives a memory error. Meaning can't assign using vector [:] operations
I've tried initialising a numpy dtype object array of size [38, 341499] but this very slow

Related

Creating Mask that Applies to Vectors in 3D Array

I could not find a previous post that specifically addressed how to create masks that work against vectors in a 3D array. I have only found previous questions and answers that either address only how masks can be applied to individual elements in a 3D array or vectors in a 2D array. So as the title states, that is exactly what I wish to do here. I want to remove all zero vectors from a 3D (x,y,z) array and the only method I can think of is to create two for loops that run over both x and (y,:) as shown in the code below. However, this does not work either because of the error message I receive when I try to run this.
'list' object cannot be safely interpreted as an integer
Moreover, even if I do get this method to work somehow, I know that using a double for loop will make this masking process very time consuming because eventually I want to apply this to array sizes in the millions. So this develops into my main question; What would be the fastest method to accomplish this?
Code:
import numpy as np
data = np.array([[[0,0,0],[1,2,3],[4,5,6],[0,0,0]],[[7,8,9],[0,0,0],[0,0,0],[10,11,12]]],dtype=float)
datanonzero = np.empty([[],[]],dtype=float)
for maskclear1 in range(0,2):
for maskclear2 in range(0,4):
datanonzero[maskclear1,maskclear2,:] = data[~np.all(data[maskclear1,maskclear2,0:3] == 0, axis=0)

import numpy as np
data = np.array([[[0,0,0],[1,2,3],[4,5,6],[0,0,0]],[[7,8,9],[0,0,0],[0,0,0],[10,11,12]]],dtype=float)
flatten_data = data.reshape(-1, 3)
datanonzero = [ data[~np.all(vec == 0, axis=0)] for vec in flatten_data ]
datanonzero = np.reshape(datanonzero, (2,-1))

set numpy array to 0 without copy

I am setting an existing array to zeros using the numpy.zeros_like function as follows:
import numpy as np
x = np.random.rand(3, 3, 3, 3, 3) # Some random data
x = np.zeros_like(x.shape)
I think the way I am doing it involves creating a new array of zeros and updating the reference to it. I was wondering if there is an efficient way to set everything to zeros without this initialization. The reason I need it is because it is called in an optimisation routine which gets called quite a few times.

You can do the following:
x[:] = 0

My data input is 5 GB of numpy arrays, yet running through a function takes 20 GB. Why?

The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.

This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.

It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.

Fill in a numpy array without creating list

I would like to create a numpy array without creating a list first.
At the moment I've got this:
import pandas as pd
import numpy as np
dfa = pd.read_csv('csva.csv')
dfb = pd.read_csv('csvb.csv')
pa = np.array(dfa['location'])
pb = np.array(dfb['location'])
ra = [(pa[i+1] - pa[i]) / float(pa[i]) for i in range(9999)]
rb = [(pb[i+1] - pb[i]) / float(pb[i]) for i in range(9999)]
ra = np.array(ra)
rb = np.array(rb)
Is there any elegant way to do in one step the last fill in of this np array without creating the list first ?
Thanks

You can calculate with vectors in numpy, without the need of lists:
ra = (pa[1:] - pa[:-1]) / pa[:-1]
rb = (pb[1:] - pb[:-1]) / pb[:-1]

The title of your question and what you need to do in your specific case are actually two slighly different things.
To create a numpy array without "casting" a list (or other iterable) you can use one of the several methods defined by numpy itself that returns array:
np.empty, np.zeros, np.ones, np.full to create arrays of given size with fixed values
np.random.* (where * can be various distributions, like normal, uniform, exponential ...), to create arrays of given size with random values
In general, read this: Array creation routines
In your case, you already have numpy arrays (pa and pb) and you don't have to create lists to calculate the new arrays (ra and rb), you can directly operate on the numpy arrays (which is the entire point of numpy: you can do operations on arrays way faster that would be iterating over each element!). Copied from #Daniel's answer:
ra = (pa[1:] - pa[:-1]) / pa[:-1]
rb = (pb[1:] - pb[:-1]) / pb[:-1]
This will be much faster than you're current implementation, not only because you avoid converting a list to ndarray, but because numpy arrays are order of magnuitude faster for mathematical and batch operations than iteration

numpy.zeros
Return a new array of given shape and type, filled with zeros.
or
numpy.ones
Return a new array of given shape and type, filled with ones.
or
numpy.empty
Return a new array of given shape and type, without initializing
entries.

Convert a list of 2D numpy arrays to one 3D numpy array?

I have a list of several hundred 10x10 arrays that I want to stack together into a single Nx10x10 array. At first I tried a simple
newarray = np.array(mylist)
But that returned with "ValueError: setting an array element with a sequence."
Then I found the online documentation for dstack(), which looked perfect: "...This is a simple way to stack 2D arrays (images) into a single 3D array for processing." Which is exactly what I'm trying to do. However,
newarray = np.dstack(mylist)
tells me "ValueError: array dimensions must agree except for d_0", which is odd because all my arrays are 10x10. I thought maybe the problem was that dstack() expects a tuple instead of a list, but
newarray = np.dstack(tuple(mylist))
produced the same result.
At this point I've spent about two hours searching here and elsewhere to find out what I'm doing wrong and/or how to go about this correctly. I've even tried converting my list of arrays into a list of lists of lists and then back into a 3D array, but that didn't work either (I ended up with lists of lists of arrays, followed by the "setting array element as sequence" error again).
Any help would be appreciated.

newarray = np.dstack(mylist)
should work. For example:
import numpy as np
# Here is a list of five 10x10 arrays:
x = [np.random.random((10,10)) for _ in range(5)]
y = np.dstack(x)
print(y.shape)
# (10, 10, 5)
# To get the shape to be Nx10x10, you could use rollaxis:
y = np.rollaxis(y,-1)
print(y.shape)
# (5, 10, 10)
np.dstack returns a new array. Thus, using np.dstack requires as much additional memory as the input arrays. If you are tight on memory, an alternative to np.dstack which requires less memory is to
allocate space for the final array first, and then pour the input arrays into it one at a time.
For example, if you had 58 arrays of shape (159459, 2380), then you could use
y = np.empty((159459, 2380, 58))
for i in range(58):
# instantiate the input arrays one at a time
x = np.random.random((159459, 2380))
# copy x into y
y[..., i] = x

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large numpy array time/memory issues - python

Related

Creating Mask that Applies to Vectors in 3D Array

set numpy array to 0 without copy

My data input is 5 GB of numpy arrays, yet running through a function takes 20 GB. Why?

Fill in a numpy array without creating list

Convert a list of 2D numpy arrays to one 3D numpy array?

Categories

Resources