How to save numpy masked array to file - python

What is the most efficient way of saving a numpy masked array? Unfortunately numpy.save doesn't work:
import numpy as np
a = np.ma.zeros((500, 500))
np.save('test', a)
This gives a:
NotImplementedError: Not implemented yet, sorry...
One way seems to be using pickle, but that unfortunately is not very efficient (huge file sizes), and not platform-independent. Also, netcdf4 seems to work, but it has a large overhead just to save a simple array.
Anyone has had this problem before? I'm tempted just to do numpy.save of array.data and another for the mask.

import numpy as np
a = np.ma.zeros((500, 500))
a.dump('test')
then read it with
a = np.load('test')

The current accepted answer is somewhat obsolete, and badly inefficient if the array being stored is sparse (it relies on uncompressed pickling of the array).
A better way to save/load a masked array would be to use an npz file:
import numpy as np
# Saving masked array 'arr':
np.savez_compressed('test.npz', data=arr.data, mask=arr.mask)
# Loading array back
with np.load('test.npz') as npz:
arr = np.ma.MaskedArray(**npz)

If you have a fixed mask that doesn't need to be saved, then you can just save the valid values:
a = np.ma.MaskedArray(values,mask)
np.save('test', a.compressed())
You can then recover it doing something like:
compressed = np.load('test')
values = np.zeros_like(mask, dtype=compressed.dtype)
np.place(values, ~mask, compressed)
a = np.ma.MaskedArray(values, mask)

A simple way to do it would be to save the data and mask of the masked array separately:
np.save('DIN_WOA09.npy',DIN_woa.data)
np.save('mask_WOA09.npy',DIN_woa.mask)
Then later, you can reconstruct the masked array from the data and mask.

Saving it inside a dictionary will allow you to keep its original format and mask without any trouble. Something like:
b={}
b['a'] = a
np.save('b', b)
should work fine.

Related

Performance in reading dicom files with SimpleITK and PyTorch

I want to directly load image from memory to python in pytorch tensor format.
I modified GetArrayViewFromImage() function by replacing those lines:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = numpy.asarray(image_memory_view).view(dtype = dtype)
by:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = torch.as_tensor(image_memory_view, dtype = dtype)
in practise it is so slow I replaced it with:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = numpy.asarray(image_memory_view).view(dtype = dtype)
array_view = torch.as_tensor(array_view)
Now I have two questions:
it is much slower, and I don't really know why reading it with numpy and converting it is faster.
even though I add the dtype argument and it returns a tensor with a correct dtype it reads it wrong (ex. -1000 in numpy is read as 252 no matter what torch.dtype I choose) which is not a problem when reading with numpy and converting, why is that happening?
While this does not directly answer your question, I strongly recommend using the torchio package, instead of dealing with these IO issues yourself (torchio uses SimpleITK under the hood).

How to save a list of numpy arrays into a single file and load file back to original form [duplicate]

This question already has answers here:
NumPy save some arrays at once
(2 answers)
Closed 3 years ago.
I am currently trying to save a list of numpy arrays into a single file, an example of such a list can be of the form below
import numpy as np
np_list = []
for i in range(10):
if i % 2 == 0:
np_list.append(np.random.randn(64))
else:
np_list.append(np.random.randn(32, 64))
I can combine all of them using into a single file using savez by iterating through list but is there any other way? I am trying to save weights returned by the function model.get_weights(), which is a list of ndarray and after retrieving the weights from the saved file I intend to load those weights into another model using model.set_weights(np_list). Therefore the format of the list must remain the same. Let me know if anyone has an elegant way of doing this.
I would go with np.save and np.load because it's platform-independent, faster than savetxt and works with lists of arrays, for example:
import numpy as np
a = [
np.arange(100),
np.arange(200)
]
np.save('a.npy', np.array(a, dtype=object), allow_pickle=True)
b = np.load('a.npy', allow_pickle=True)
This is the documentation for np.save and np.load. And in this answer you can find a better discussion How to save and load numpy.array() data properly?
Edit
Like #AlexP mentioned numpy >= v1.24.2 does not support arrays with different sizes and types, so that's why the casting is necessary.

How to effectively store a very large list in python

Question:I have a big 3D image collection that i would like to store into one file. How should I effectively do it?
Background: The dataset has about 1,000 3D MRI images with a size of 256 by 256 by 156. To avoid frequent files open and close, I was trying to store all of them into one big list and export it.
So far I tried reading each MRI in as 3D numpy array and append it to a list. When i tried to save it using numpy.save, it consumed all my memory and exited with "Memory Error".
Here is the code i tried:
import numpy as np
import nibabel as nib
import os
file_list = os.listdir('path/to/files')
for file in file_list:
mri = nib.load(os.path.join('path/to/files',file))
mri_array = np.array(mri.dataobj)
data.append(mri_array)
np.save('imported.npy',data)
Expected Outcome:
Is there a better way to store such dataset without consuming too much memory?
Using HDF5 file format or Numpy's memmap are the two options that I would go to first if you want to jam all your data into one file. These options do not load all the data into memory.
Python has the h5py package to handle HDF5 files. These have a lot of features, and I would generally lean toward this option. It would look something like this:
import h5py
with h5py.File('data.h5') as h5file:
for n, image in enumerate(mri_images):
h5file[f'image{n}'] = image
memmap works with binary files, so not really feature rich at all. This would look something like:
import numpy as np
bin_file = np.memmap('data.bin', mode='w+', dtype=int, shape=(1000, 256, 256, 156))
for n, image in enumerate(mri_images):
bin_file[n] = image
del bin_file # dumps data to file

Numpy fromfile function does not work properly

So i'm working on a code where i need to write and read files. I'm using python and numpy, but the numpy fromfile function does not seem to work propperly. First i create an array with 500 elements, and save it with the savetxt function. I check the file and it is all right, just how i wanted.
import numpy as np
w = np.zeros(500, float)
np.savetxt("weights.txt", weight, '%.100f', )
print(weight[2])
But after i change the line where i create the array with the one where i read it from a file a problem accours. The zeros turn into really small numbers. I can't guess why. Here is the line where i read from file:
weight = np.fromfile("weights.txt", float, -1)
Should i write a custom function that turns files into arrays or is there a way to make it work?

Building a huge numpy array using pytables

How can I create a huge numpy array using pytables. I tried this but gives me the "ValueError: array is too big." error:
import numpy as np
import tables as tb
ndim = 60000
h5file = tb.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", np.zeros((ndim,ndim), dtype=float))
h5file.close()
Piggybacking off of #b1r3k's response, to create an array that you are not going to access all at once (i.e. bring the whole thing into memory), you want to use a CArray (Chunked Array). The idea is that you would then fill and access it incrementally:
import numpy as np
import tables as tb
ndim = 60000
h5file = tb.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
x = h5file.createCArray(root,'x',tb.Float64Atom(),shape=(ndim,ndim))
x[:100,:100] = np.random.random(size=(100,100)) # Now put in some data
h5file.close()
You could try to use tables.CArray class as it supports compression but...
I think questions is more about numpy than pytables because you are creating array using numpy before storing it with pytables.
In that way you need a lot of ram to execute np.zeros((ndim,ndim) - and this is probably the place where exception: "ValueError: array is too big." is raised.
If matrix/array is not dense then you could use sparse matrix representation available in scipy: http://docs.scipy.org/doc/scipy/reference/sparse.html
Another solution is to try to access your array via chunks if it you don't need whole array at once - check out this thread: Very large matrices using Python and NumPy

Categories

Resources