store/load numpy array from binary files - python

I would like to store and load numpy arrays from binary files. For that purposes, I created two small functions. Each binary file should contain the dimensionality of the given matrix.
def saveArrayToFile(data, fileName):
with open(fileName, 'w') as file:
a = array.array('f')
nSamples, ndim = data.shape
a.extend([nSamples, ndim]) # write number of elements and dimensions
a.fromstring(data.tostring())
a.tofile(file)
def readArrayFromFile(fileName):
_featDesc = np.fromfile(fileName, 'f')
_ndesc = int(_featDesc[0])
_ndim = int(_featDesc[1])
_featDesc = _featDesc[2:]
_featDesc = _featDesc.reshape([_ndesc, _ndim])
return _featDesc, _ndesc, _ndim
An example on how to use the functions is:
myarr=np.array([[7, 4],[3, 9],[1, 3]])
saveArrayToFile(myarr,'myfile.txt')
_featDesc, _ndesc, _ndim = readArrayFromFile('myfile.txt')
However, an error message of 'ValueError: total size of new array must be unchanged' is shown. My arrays can be of size MxN and MxM. Any suggestions are more than welcomed.
I think the problem might be in the saveArrayToFile function.
Best wishes,
Javier

Use numpy.save (and numpy.load) to dump (retrieve) numpy arrays to (from) a binary file.

Related

Saving 3 dimensional array to a file in Python

I want to save a 3 dimensional arrays values to a txt or csv file in python.
dCx, dCy
I used:
numpy.savetxt('C:/Users/musa/Desktop/LOCO_All_tests/FODO_Example/AllQ/dCx.csv',dCx,delimiter=',')
numpy.savetxt('C:/Users/musa/Desktop/LOCO_All_tests/FODO_Example/AllQ/dCy.csv',dCy,delimiter=',')
And to load it again:
dCx = numpy.genfromtxt('C:/Users/musa/Desktop/LOCO_All_tests/FODO_Example/AllQ/dCx.csv', delimiter=',')
dCy = numpy.genfromtxt('C:/Users/musa/Desktop/LOCO_All_tests/FODO_Example/AllQ/dCy.csv', delimiter=',')
But i got the error massage
"Expected 1D or 2D array, got 3D array instead"
Si i wanted to change the 3d arrays first to 2 arrays and then save it to the files, and when uploaded again i convert it back to 3d for example:
dCx2 = np.array(dCx).reshape(np.array(dCx).shape[0], -1)
dCy2 = np.array(dCy).reshape(np.array(dCy).shape[0], -1)
and after loaded to variable named dCx3 and dCy3 i used:
dCx = np.array(dCx3).reshape(
np.array(dCx3).shape[0], np.array(dCx3).shape[1] // np.array(dCx).shape[2], np.array(dCx).shape[2])
#dCy = np.array(dCy3).reshape(
# np.array(dCy3).shape[0], np.array(dCy3).shape[1] // np.array(dCy).shape[2], np.array(dCy).shape[2])
I am looking for a better method that i can used in the saving the 3d arrays to file, or a method to convert the 2d into 3d without having to measure the original arrays every time as it is used in this line:
np.array(dCy).shape[2], np.array(dCy).shape[2])
Use numpy.save(filepath, data) and data = numpy.load(filepath).
These are binary file formats, and generic for any type of NumPy data
Try tofile. it works for in my case. but array will write in 1D
import numpy as np
arr=np.arange(0,21).reshape(7,3)
arr.tofile('file.txt',sep=',')
arr2=np.fromfile('file.txt',sep=',')

No conversion path for dtype ('<U1') issue

I have a 2d list (Data_set) that contain a 3d array and a label(0 or 1), I want to make the h5py file with two datasets one for 3d array and the other for the label, this is my code for doing that:
`
data = []
label = []
for i in range(len(Data_set)):
data.append(Data_set[i][0])# 3d array
label.append(Data_set[i][1])#label
data = np.array(data)
label = np.array(label)
dt = np.dtype('int16')
with h5py.File(output_path+'dataset.h5', 'w') as hf:
hf.create_dataset('data',dtype=dt ,data=data, compression='lzf')
hf.create_dataset('label', dtype=dt, data=label, compression='lzf')
`
the content of the 2d list is shown in the image below:
but when I run the code it gives me an error: see the image below
please help me to solve the problem?
Your labels are not integers, they are strings, that's a problem for HDF5. Your error message relates to an array consisting of strings of length 1. See Strings in HDF5 for more details.
You can convert to integers before or after you construct your NumPy array, here are a couple of examples:
label = np.array(label).astype(int)
# or, label = np.array(list(map(int, label)))
Alternatively, since your values are 0 or 1, choosing bool may be more efficient:
label = np.array(label).astype(int).astype(bool)
Also, consider holding meta-data as attributes.

Convert 3D numpy Array in Python to 3D Matrix in Matlab [duplicate]

I am looking for a way to pass NumPy arrays to Matlab.
I've managed to do this by storing the array into an image using scipy.misc.imsave and then loading it using imread, but this of course causes the matrix to contain values between 0 and 256 instead of the 'real' values.
Taking the product of this matrix divided by 256, and the maximum value in the original NumPy array gives me the correct matrix, but I feel that this is a bit tedious.
is there a simpler way?
Sure, just use scipy.io.savemat
As an example:
import numpy as np
import scipy.io
x = np.linspace(0, 2 * np.pi, 100)
y = np.cos(x)
scipy.io.savemat('test.mat', dict(x=x, y=y))
Similarly, there's scipy.io.loadmat.
You then load this in matlab with load test.
Alteratively, as #JAB suggested, you could just save things to an ascii tab delimited file (e.g. numpy.savetxt). However, you'll be limited to 2 dimensions if you go this route. On the other hand, ascii is the universial exchange format. Pretty much anything will handle a delimited text file.
A simple solution, without passing data by file or external libs.
Numpy has a method to transform ndarrays to list and matlab data types can be defined from lists. So, when can transform like:
np_a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
mat_a = matlab.double(np_a.tolist())
From matlab to python requires more attention. There is no built-in function to convert the type directly to lists. But we can access the raw data, which isn't shaped, but plain. So, we use reshape (to format correctly) and transpose (because of the different way MATLAB and numpy store data). That's really important to stress: Test it in your project, mainly if you are using matrices with more than 2 dimensions. It works for MATLAB 2015a and 2 dims.
np_a = np.array(mat_a._data.tolist())
np_a = np_a.reshape(mat_a.size).transpose()
Here's a solution that avoids iterating in python, or using file IO - at the expense of relying on (ugly) matlab internals:
import matlab
# This is actually `matlab._internal`, but matlab/__init__.py
# mangles the path making it appear as `_internal`.
# Importing it under a different name would be a bad idea.
from _internal.mlarray_utils import _get_strides, _get_mlsize
def _wrapper__init__(self, arr):
assert arr.dtype == type(self)._numpy_type
self._python_type = type(arr.dtype.type().item())
self._is_complex = np.issubdtype(arr.dtype, np.complexfloating)
self._size = _get_mlsize(arr.shape)
self._strides = _get_strides(self._size)[:-1]
self._start = 0
if self._is_complex:
self._real = arr.real.ravel(order='F')
self._imag = arr.imag.ravel(order='F')
else:
self._data = arr.ravel(order='F')
_wrappers = {}
def _define_wrapper(matlab_type, numpy_type):
t = type(matlab_type.__name__, (matlab_type,), dict(
__init__=_wrapper__init__,
_numpy_type=numpy_type
))
# this tricks matlab into accepting our new type
t.__module__ = matlab_type.__module__
_wrappers[numpy_type] = t
_define_wrapper(matlab.double, np.double)
_define_wrapper(matlab.single, np.single)
_define_wrapper(matlab.uint8, np.uint8)
_define_wrapper(matlab.int8, np.int8)
_define_wrapper(matlab.uint16, np.uint16)
_define_wrapper(matlab.int16, np.int16)
_define_wrapper(matlab.uint32, np.uint32)
_define_wrapper(matlab.int32, np.int32)
_define_wrapper(matlab.uint64, np.uint64)
_define_wrapper(matlab.int64, np.int64)
_define_wrapper(matlab.logical, np.bool_)
def as_matlab(arr):
try:
cls = _wrappers[arr.dtype.type]
except KeyError:
raise TypeError("Unsupported data type")
return cls(arr)
The observations necessary to get here were:
Matlab seems to only look at type(x).__name__ and type(x).__module__ to determine if it understands the type
It seems that any indexable object can be placed in the ._data attribute
Unfortunately, matlab is not using the _data attribute efficiently internally, and is iterating over it one item at a time rather than using the python memoryview protocol :(. So the speed gain is marginal with this approach.
scipy.io.savemat or scipy.io.loadmat does NOT work for matlab arrays --v7.3. But the good part is that matlab --v7.3 files are hdf5 datasets. So they can be read using a number of tools, including numpy.
For python, you will need the h5py extension, which requires HDF5 on your system.
import numpy as np, h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to numpy array
Some time ago I faced the same problem and wrote the following scripts to allow easy copy and pasting of arrays back and forth from interactive sessions. Obviously only practical for small arrays, but I found it more convenient than saving/loading through a file every time:
Matlab -> Python
Python -> Matlab
Not sure if it counts as "simpler" but I found a solution to move data from a numpy arrray created in a python script which is called by matlab quite fast:
dump_reader.py (python source):
import numpy
def matlab_test2():
np_a = numpy.random.uniform(low = 0.0, high = 30000.0, size = (1000,1000))
return np_a
dump_read.m (matlab script):
clear classes
mod = py.importlib.import_module('dump_reader');
py.importlib.reload(mod);
if count(py.sys.path,'') == 0
insert(py.sys.path,int32(0),'');
end
tic
A = py.dump_reader.matlab_test2();
toc
shape = cellfun(#int64,cell(A.shape));
ls = py.array.array('d',A.flatten('F').tolist());
p = double(ls);
toc
C = reshape(p,shape);
toc
It relies on the fact that matlabs double seems be working efficiently on arrays compared to cells/matrices. Second trick is to pass the data to matlabs double in an efficient way (via pythons native array.array).
P.S. sorry for necroposting but I struggled a lot with its and this topic was one of the closest hits. Maybe it helps someone to shorten the time of struggling.
P.P.S. tested with Matlab R2016b + python 3.5.4 (64bit)
The python library Darr allows you to save your Python numpy arrays in a self-documenting and widely readable format, consisting of just binary and text files. When saving your array, it will include code to read that array in a variety of languages, including Matlab. So in essence, it is just one line to save your numpy array to disk in Python, and then copy-paste the code from the README.txt to load it into Matlab.
Disclosure: I wrote the library.
From MATLAB R2022a on, matlab.double (and matlab.int8, matlab.uint8, etc.) objects implement the buffer protocol. This means that you can pass them into NumPy array constructors. Construction in the opposite direction (which is the subject of the question here) is supported as well. That is, matlab objects can be constructed from objects that implement the buffer protocol. Thus, for instance, a matlab.double can be constructed from a NumPy double array.
UPDATE: Furthermore, from MATLAB R2022b on, objects that implement the buffer protocol (such as NumPy objects) can be passed directly into MATLAB functions that are called via Python. From the MATLAB Release Notes for R2022b, under the "External Language Interfaces" section:
import matlab.engine
import numpy
eng = matlab.engine.start_matlab()
buf = numpy.array([[1, 2, 3], [4, 5, 6]], dtype='uint16')
# Supported in R2022a and earlier: must initialize a matlab.uint16 from
# the numpy array and pass it to the function
array_as_matlab_uint16 = matlab.uint16(buf)
res = eng.sum(array_as_matlab_uint16, 1, 'native')
print(res)
# Supported as of R2022b: can pass the numpy array
# directly to the function
res = eng.sum(buf, 1, 'native')
print(res)
Let use say you have a 2D daily data with shape (365,10) for five years saved in np array np3Darrat that will have a shape (5,365,10). In python save your np array:
import scipy.io as sio #SciPy module to load and save mat-files
m['np3Darray']=np3Darray #shape(5,365,10)
sio.savemat('file.mat',m) #Save np 3D array
Then in MATLAB convert np 3D array to MATLAB 3D matix:
load('file.mat','np3Darray')
M3D=permute(np3Darray, [2 3 1]); %Convert numpy array with shape (5,365,10) to MATLAB matrix with shape (365,10,5)
In latest R2021a, you can pass a python numpy ndarray to double() and it will convert to a native matlab matrix, even when calling in console the numpy array it will suggest at the bottom "Use double function to convert to a MATLAB array"

How to efficiently write a binary file containing mixed label and image data

The cifar10 tutorial deals with binary files as input. Each record/example on these CIFAR10 datafiles contain mixed label (first element) and image data information. The first answer in this page shows how to write binary file from a numpy array (which accumulates the label and image data information in each row) using ndarray.tofile() as follows:
import numpy as np
images_and_labels_array = np.array([[...], ...], dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This is perfect for me when the maximum number of classes is 256 as the uint8 datatype is sufficient. However, when the maximum number of classes is more than 256, then I have to change the dtype=np.uint16 in the images_and_labels_array. The consequence is just doubling the size. I would like to know if there is an efficient way to overcome it. If yes, please provide an example.
When I write binary files I usually just use the python module struct, which works somehow like this:
import struct
import numpy as np
image = np.zeros([2, 300, 300], dtype=np.uint8)
label = np.zeros([2, 1], dtype=np.uint16)
with open('data.bin', 'w') as fo:
s = image.shape
for k in range(s[0]):
# write label as uint16
fo.write(struct.pack('H', label[k, 0]))
# write image as uint8
for i in range(s[1]):
for j in range(s[2]):
fo.write(struct.pack('B', image[k, i, j]))
This should result in a 300*300*2 + 2*1*2 = 180004 bytes big binary file.
Its probably not the fastest way to get the job done, but for me it worked sufficiently fast so far. For other datatypes see the documentation

Attach a queue to a numpy array in tensorflow for data fetch instead of files?

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)

Categories

Resources