Performance in reading dicom files with SimpleITK and PyTorch - python

I want to directly load image from memory to python in pytorch tensor format.
I modified GetArrayViewFromImage() function by replacing those lines:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = numpy.asarray(image_memory_view).view(dtype = dtype)
by:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = torch.as_tensor(image_memory_view, dtype = dtype)
in practise it is so slow I replaced it with:
image_memory_view = _GetMemoryViewFromImage(image)
array_view = numpy.asarray(image_memory_view).view(dtype = dtype)
array_view = torch.as_tensor(array_view)
Now I have two questions:
it is much slower, and I don't really know why reading it with numpy and converting it is faster.
even though I add the dtype argument and it returns a tensor with a correct dtype it reads it wrong (ex. -1000 in numpy is read as 252 no matter what torch.dtype I choose) which is not a problem when reading with numpy and converting, why is that happening?

While this does not directly answer your question, I strongly recommend using the torchio package, instead of dealing with these IO issues yourself (torchio uses SimpleITK under the hood).

Related

What to do when the data is too big to be stored in memory

I want to train a neural network, I work with Python (3.6.9) and Tensorflow (2.4.0) and my problem is that my dataset is too big to be stored in memory.
A bit of context :
My network takes in input a small complex matrix of dimension 64 by 32.
My dataset is stored in the form of a very large ".mat" file generated by a matlab code.
In the mat file, the samples are stored in a large cell array.
I use the h5py library to open the mat file.
Example of python code to load only one sample from the file :
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
sample = f[refs[0]][()].view(np.complex) # load the first sample
Currently, I load only a small part of the dataset that I store in a tensorflow dataset (ds = tf.data.Dataset.from_tensor_slices(datas)).
I would like to take advantage of the possibility offered by the h5py library to be able to load each example individually to load the examples on the fly during network training.
I tried the following approach:
f = h5py.File('dataset.mat', 'r')
refs = f['data'] # array of reference of each sample
ds_index = tf.data.Dataset.range(len(refs))
ds = ds_index.map(lambda i : f[refs[i]][()].view(np.complex))
but, I have the following error :
NotImplementedError: in user code:
<ipython-input-66-6cf802c8359a>:15 __call__ *
return self._f[self._rs[i]]['channel'][()].view(np.complex).astype(np.complex64).T
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:855 __array__
" a NumPy call, which is not supported".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (args_0:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
Do you know how to fix this error or can it be a better way to load examples on the fly ?

Is there any way to use arithmetic ops on FITS files in Python?

I'm fairly new to Python, and I have been trying to recreate a working IDL program to Python, but I'm stuck and keep getting errors. I haven't been able to find a solution yet.
The program requires 4 FITS files in total (img and correctional images dark, flat1, flat2). The operations are as follows:
flat12 = (flat1 + flat2)/2
img1 = (img - dark)/flat12
The said files have dimensions (1024,1024,1). I have resized them to (1024,1024) to be able to even use im_show() function.
I have also tried using cv2.add(), but I get this:
TypeError: Expected Ptr for argument 'src1'
Is there any workaround for this? Thanks in advance.
To read your FITS files use astropy.io.fits: http://docs.astropy.org/en/latest/io/fits/index.html
This will give you Numpy arrays (and FITS headers if needed, there are different ways to do this, as explained in the documentation), so you could do something like:
>>> from astropy.io import fits
>>> img = fits.getdata('image.fits', ext=0) # extension number depends on your FITS files
>>> dark = fits.getdata('dark.fits') # by default it reads the first "data" extension
>>> darksub = img - dark
>>> fits.writeto('out.fits', darksub) # save output
If your data has an extra dimension, as shown with the (1024,1024,1) shape, and if you want to remove that axis, you can use the normal Numpy array slicing syntax: darksub = img[0] - dark[0].
Otherwise in the example above it will produce and save a (1024,1024,1) image.

Numpy fromfile function does not work properly

So i'm working on a code where i need to write and read files. I'm using python and numpy, but the numpy fromfile function does not seem to work propperly. First i create an array with 500 elements, and save it with the savetxt function. I check the file and it is all right, just how i wanted.
import numpy as np
w = np.zeros(500, float)
np.savetxt("weights.txt", weight, '%.100f', )
print(weight[2])
But after i change the line where i create the array with the one where i read it from a file a problem accours. The zeros turn into really small numbers. I can't guess why. Here is the line where i read from file:
weight = np.fromfile("weights.txt", float, -1)
Should i write a custom function that turns files into arrays or is there a way to make it work?

How to save numpy masked array to file

What is the most efficient way of saving a numpy masked array? Unfortunately numpy.save doesn't work:
import numpy as np
a = np.ma.zeros((500, 500))
np.save('test', a)
This gives a:
NotImplementedError: Not implemented yet, sorry...
One way seems to be using pickle, but that unfortunately is not very efficient (huge file sizes), and not platform-independent. Also, netcdf4 seems to work, but it has a large overhead just to save a simple array.
Anyone has had this problem before? I'm tempted just to do numpy.save of array.data and another for the mask.
import numpy as np
a = np.ma.zeros((500, 500))
a.dump('test')
then read it with
a = np.load('test')
The current accepted answer is somewhat obsolete, and badly inefficient if the array being stored is sparse (it relies on uncompressed pickling of the array).
A better way to save/load a masked array would be to use an npz file:
import numpy as np
# Saving masked array 'arr':
np.savez_compressed('test.npz', data=arr.data, mask=arr.mask)
# Loading array back
with np.load('test.npz') as npz:
arr = np.ma.MaskedArray(**npz)
If you have a fixed mask that doesn't need to be saved, then you can just save the valid values:
a = np.ma.MaskedArray(values,mask)
np.save('test', a.compressed())
You can then recover it doing something like:
compressed = np.load('test')
values = np.zeros_like(mask, dtype=compressed.dtype)
np.place(values, ~mask, compressed)
a = np.ma.MaskedArray(values, mask)
A simple way to do it would be to save the data and mask of the masked array separately:
np.save('DIN_WOA09.npy',DIN_woa.data)
np.save('mask_WOA09.npy',DIN_woa.mask)
Then later, you can reconstruct the masked array from the data and mask.
Saving it inside a dictionary will allow you to keep its original format and mask without any trouble. Something like:
b={}
b['a'] = a
np.save('b', b)
should work fine.

Building a huge numpy array using pytables

How can I create a huge numpy array using pytables. I tried this but gives me the "ValueError: array is too big." error:
import numpy as np
import tables as tb
ndim = 60000
h5file = tb.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", np.zeros((ndim,ndim), dtype=float))
h5file.close()
Piggybacking off of #b1r3k's response, to create an array that you are not going to access all at once (i.e. bring the whole thing into memory), you want to use a CArray (Chunked Array). The idea is that you would then fill and access it incrementally:
import numpy as np
import tables as tb
ndim = 60000
h5file = tb.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
x = h5file.createCArray(root,'x',tb.Float64Atom(),shape=(ndim,ndim))
x[:100,:100] = np.random.random(size=(100,100)) # Now put in some data
h5file.close()
You could try to use tables.CArray class as it supports compression but...
I think questions is more about numpy than pytables because you are creating array using numpy before storing it with pytables.
In that way you need a lot of ram to execute np.zeros((ndim,ndim) - and this is probably the place where exception: "ValueError: array is too big." is raised.
If matrix/array is not dense then you could use sparse matrix representation available in scipy: http://docs.scipy.org/doc/scipy/reference/sparse.html
Another solution is to try to access your array via chunks if it you don't need whole array at once - check out this thread: Very large matrices using Python and NumPy

Categories

Resources