How to add data to H5py data? [duplicate]

How to add data to H5py data? [duplicate] - python

Does any one have an idea for updating hdf5 datasets from h5py?
Assuming we create a dataset like:
import h5py
import numpy
f = h5py.File('myfile.hdf5')
dset = f.create_dataset('mydataset', data=numpy.ones((2,2),"=i4"))
new_dset_value=numpy.zeros((3,3),"=i4")
Is it possible to extend the dset to a 3x3 numpy array?

You need to create the dataset with the "extendable" property. It's not possible to change this after the initial creation of the dataset. To do this, you need to use the "maxshape" keyword. A value of None in the maxshape tuple means that that dimension can be of unlimited size. So, if f is an HDF5 file:
dset = f.create_dataset('mydataset', (2,2), maxshape=(None,3))
creates a dataset of size (2,2), which may be extended indefinitely along the first dimension and to 3 along the second. Now, you can extend the dataset with resize:
dset.resize((3,3))
dset[:,:] = np.zeros((3,3),"=i4")
The first dimension can be increased as much as you like:
dset.resize((10,3))

Related

What dimensions should my Numpy Array be ? Obspy Traces

I currently have seismic data with 175x events with 3 traces for each event (traces are numpy arrays of seismic data). I have classification labels for whether the seismic data is an earthquake or not for each of those 175 samples. I'm looking to format my data into numpy arrays for modelling. I've tried placing into a dataframe of numpy arrays with each column being a different trace. So columns would be 'Trace one' 'Trace two' 'Trace three'. This did not work. I have tried lots of different methods of arranging the data to use with keras.
I'm now looking to create a numpy matrix for the data to go into and to then use for modelling.
I had thought that the shape may be (175,3,7501) as (#number of events, #number of traces,#number of samples in trace), however I then iterate through and try to add the three traces to the numpy matrix and have failed. I'm used to using dataframes and not numpy for inputting to Keras.
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
data = numpy.vstack([data, newrow])
The data shape is (175,3,7510). The newrow shape is (3,1,7510) and does not allow me to add newrow to data.
The form in which I receive the data is in obspy streams and each stream has the 3 trace objects. With each trace object, it holds the trace data in numpy arrays and so I'm having to access and append those to a dataframe for modelling as obviously I can't feed a stream or trace object to keras model.

If I understand your data correctly you can try one of the following method:
If your data shape is (175, 3, 7510) define newrow as follows newrow = np.array([trace_copy_1,trace_copy_2,trace_copy_3]) with trace_copy_x being a numpy array with shape 7510.
Use the reshape function (either with numpy.reshape(new_row, (3, 7510)) or new_row.reshape((3, 7510))
If you're familiar with dataframes you can still use pandas dataframes by reducing the dimension of your data (you can for example add the different traces at the end of one another on the same row, something you often see when working with images). Here it could be something like pandas.DataFrame(data.reshape((175, 3*7510)))
In addition to that I recommend using numpy.concatenate instead of numpy.vstack (more general).
I hope it will works.
Cheers

Thanks for the answers. The way I solved this was I created the NumPy array of the desired fit shape. (index or number of events, number of traces (or number of arrays), then sample amount (or amount of values in each array)
I then created a new row. I then reshaped and added. Following this, I then split the data to remove the original data before I started appending my new data.
data = np.zeros(shape=(175,3,7501))
newrow = [[trace_copy_1],[trace_copy_2],[trace_copy_3]]
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
newrow = newrow.reshape((1,3,7501))

An error while writing matrix into a raster file with rasterio

The original two datasets are two .tiff image file with same coordinate, height, width and transform information (let's say data1 and data2). I need to perform a simple math on them, but they both have null values, so I first use masked=True:
new1 = data1.read(1,masked=True)
new2 = data2.read(1,masked=True)
Then do the math:
target = new1 - new2
When I get the target, I try codes as below:
target.width
target.height
target.transform
target.crs
They all return the same error:
'MaskedArray' object has no attribute 'xxx'(xxx represents all the attribute above: width, height, etc.)
It seems the target loses all the information after math, I need to write this new result into a new raster file, what should I do to solve it?

When using the read method of a dataset, it will return a numpy array (masked in your case).
The numpy array is not a rasterio dataset, so it doesn't have those attributes.
To write it to disk you need to create a new profie (or copy the source dataset one) and use rasterio.open to create a new raster file:
profile = data1.profile
band_number = 1
# to update the dtype
profile.update(dtype=target.dtype)
with rasterio.open('raster.tif', 'w', **profile) as dst:
dst.write(target, band_number)
See the docs for a more detailed example

Applying a simple function to CSV and save multiple csv files

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)

Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

Attach a queue to a numpy array in tensorflow for data fetch instead of files?

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.

Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.

In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})

I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)

sort/lexsort/bincount/etc. for arrays that don't fit in memory

I'm trying to scale up a library written in numpy so that it can process arrays that don't fit in memory (~10 arrays of 10 billion elements)
hdf5 (h5py) was a temporary solution, but I rely heavily on sorting and indexing (b = a[a>5]), which are both not available in h5py and are a pain to write.
Is there a library that would made these tools available?
Specifically I need basic math, sort, lexsort, argsort, bincount, np.diff, and indexing (both boolean and with the array of indices).

PyTables is designed precisely for this (also based on hdf5). First store your array to disk
import numpy as np
import tables as tb
# Write big numpy array to disk
rows, cols = 80000000, 2
h5file = tb.open_file('test.h5', mode='w', title="Test Array")
root = h5file.root
array_on_disk = h5file.create_carray(root,
'array_on_disk',tb.Float64Atom(),shape=(rows,cols))
# Fill part of the array
rand_array = np.random.rand(1000)
array_on_disk[10055:11055] = rand_array
array_on_disk[12020:13020] = 2.*rand_array
h5file.close()
Then perform your computation directly on the array (or part of it) contained in the file
h5file = tb.open_file('disk_array.h5', mode='r')
print h5file.root.array_on_disk[10050:10065,0]
# in-place sort
h5file.root.array_on_disk[100000:10000000,:].sort(axis=0)
h5file.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.