Column stack and row stack with H5py to existing datasets

Column stack and row stack with H5py to existing datasets - python

I am trying to use Python to column stack and row stack data I have in an HDF5 file with additional data. I am recording images from a camera and saving them to individual files. Then I want to be able to generate a single file with all of the images patched together. Therefore, I would like to be able to make one dataset in a new file and stack together all of the arrays from each image file into the single file.
I know that h5py allows me to use the datasets like numPy arrays, but I do not know how to tell h5py to save the data to the file again. Below I have a very simple example.
My question is how can I column stack the data from the HDF5 file with the second array (arr2) such that arr2 is saved to the file?
(Note: In my actual application, the data in the file will be much larger than in the example. Therefore, importing the data into the memory, column stacking, and then rewriting it to the file is out of the question.)
import h5py
import numpy
arr1 = numpy.random.random((2000,2000))
with h5py.File("Plot0.h5", "w") as f:
dset = f.create_dataset("Plot", data = arr1)
arr2 = numpy.random.random((2000,2000))
with h5py.File("Plot0.h5", "r+") as f:
dset = f["Plot"]
dset = numpy.column_stack((dset, arr2))
It seems like a trivial issue, but all of my searches have been unsuccessful. Thanks in advance.

After rereading some of the documentation on H5py, I realized my mistake. Here is my new script structure that allows me to stack arrays in the HDF5 file:
import h5py
import numpy
arr1 = numpy.random.random((2000,2000))
with h5py.File("Plot0.h5", "w") as f:
dset = f.create_dataset("Plot", data = arr1, maxshape=(None,None))
dsetX, dsetY = 2000,2000
go = ""
while go == "":
go = raw_input("Current Size: " + str(dsetX) + " " + str(dsetY) + " Continue?")
arr2 = numpy.random.random((2000,2000))
with h5py.File("Plot0.h5", "r+") as f:
dset = f["Plot"]
print len(arr2[:])
print len(arr2[0][:])
change = "column"
dsetX, dsetY = dset.shape
if change == "column":
x1 = dsetX
x2 = len(arr2[:]) + dsetX
y1 = 0
y2 = len(arr2[0][:])
dset.shape = (x2, y2)
else:
x1 = 0
x2 = len(arr2[:])
y1 = dsetY
y2 = len(arr2[0][:]) + dsetY
dset.shape = (x2, y2)
print "x1", x1
print "x2", x2
print "y1", y1
print "y2", y2
print dset.shape
dset[x1:x2,y1:y2] = arr2
print arr2
print "\n"
print dset[x1:x2,y1:y2]
dsetX, dsetY = dset.shape
I hope this can help someone else. And of course, better solutions to this problem are welcome.

Related

How to automate loading multiple files into numpy arrays using a simple "for" loop?

I usually load my data, that -in most cases- consists of only two columns using np.loadtxt cammand as follows:
x0, y0 = np.loadtxt('file_0.txt', delimiter='\t', unpack=True)
x1, y1 = np.loadtxt('file_1.txt', delimiter='\t', unpack=True)
.
.
xn, yn = np.loadtxt('file_n.txt', delimiter='\t', unpack=True)
then plot each pair on its own, which is not ideal!
I want to make a simple "for" loop that goes for all text files in the same directory, load the files and plot them on the same figure.

import os
import matplotlib.pyplot as plt
# A list of all file names that end with .txt
myfiles = [myfile for myfile in os.listdir() if myfile.endswith(".txt")]
# Create a new figure
plt.figure()
# iterate over the file names
for myfile in myfiles:
# load the x, y
x, y = np.loadtxt(myfile, delimiter='\t', unpack=True)
# plot the values
plt.plot(x, y)
# show the figure after iterating over all files and plotting.
plt.show()

Load all the files in a dictionary using:
d = {}
for i in range(n):
d[i] = np.loadtxt('file_' + str(i) + '.txt', delimiter='\t', unpack=True)
Now, to access kth file, use d[k] or:
xk, yk = d[k]
Since, you have not mentioned about the data in the files and the plot you want to create, it's hard to tell what to do. But for plotting, you can refer Mttplotlib or Seaborn libraries.

You can also use glob to get all the files -
from glob import glob
import numpy as np
import os
res = []
file_path = "YOUR PATH"
file_pattern = "file_*.txt"
files_list = glob(os.path.join(file_path,file_pattern))
for f in files_list:
print(f'----- Loading {f} -----')
x, y = np.loadtxt(f, delimiter='\t', unpack=True)
res += [(x,y)]
res will contain your file contents at each index value corresponding to f

How can I make graph using matplotlib with json (or yaml) files?

I am dealing with some struggles making a script, which could make a graph with inputs from yaml or json files. Firstly, I tried it with yaml files, but failed - now I have my yaml files converted into json files and trying it with them.
And what particular I want to?
I have 400+ json files containing specific information about specific structure (there are 400+ structures, thus there are 400+ json files). I am interested in one data from json files and it is 'Resolution': 0.00 (it is some float), because I want to know how many of the structures have what resolution (so, x axis should be resolution values which are occured in structures and y axis should be amount of structures, which have this resolution.
I have never used matplotlib before, so I am failing right at the beginning of script, defining x and y axis and how to implement information from json file.
import matplotlib.pyplot as plt
import numpy as np
import json
#this is preparation for my inputs, nothing important for my current question
cif_ID = "aaa, bbb" #there are 400+ IDs
cif_input = cif_ID.split(", ")
for ID in cif_input:
json_path = "/home/ME/something/something/{}/" .format(ID)
json_file = json_path + "{}.json" .format(ID)
with open(json_file, "r") as f:
doc = json.load(f)
resolution = doc['Resolution']
print(resolution) #this print me all resolution values (float) from all 400+ json files
x = (doc['Resolution']) #I know this is wrong way, but I want to just show that I tried

Based on your suggested json file format here is the sample code. this code will first generate sample json files in './data' directory and then it will read sample resolution values from those files.
import os
import json
import numpy as np
from matplotlib import pyplot as plt
master_data_dir = './data'
os.makedirs(master_data_dir,exist_ok = True)
# +++++++++++++++++++++++++
# generate json file
# +++++++++++++++++++++++++
# some JSON Format in file:
# x1 = '{ "x":"J", "Resolution":17, "y":"51"}'
# x2 = '{ "x":"J", "Resolution":11, "y":"52"}'
# x3 = '{ "x":"J", "Resolution":10, "y":"53"}'
x1 = { "x":"J1", "Resolution":17, "y":"51"}
x2 = { "x":"J2", "Resolution":11, "y":"52"}
x3 = { "x":"J3", "Resolution":10, "y":"53"}
x4 = { "x":"J4", "Resolution":10, "y":"54"}
x5 = { "x":"J5", "Resolution":10, "y":"55"}
x6 = { "x":"J6", "Resolution":11, "y":"56"}
x7 = { "x":"J7", "Resolution":13, "y":"57"}
x8 = { "x":"J8", "Resolution":15, "y":"58"}
x9 = { "x":"J9", "Resolution":17, "y":"59"}
x10 = { "x":"J10", "Resolution":15, "y":"60"}
filename = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
for i,j in enumerate([x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]):
with open(master_data_dir+'/'+filename[i]+".json", 'w') as f:
json.dump(j,f)
f.close()
# ++++++++++++++++++++++++++++++++++++
# Ends here
# ++++++++++++++++++++++++++++++++++++
# ++++++++++++++++++++++++++++++++++++++++++++++
# Read json data to findout resolution frequency
# ++++++++++++++++++++++++++++++++++++++++++++++
list_files_in_data_dir = os.listdir(master_data_dir)
resoltuion_values = []
for json_file_name in list_files_in_data_dir:
json_file_path = os.path.join(master_data_dir,json_file_name)
with open(json_file_path, 'r') as f:
jsondata = json.load(f)
resoltuion_values.append(jsondata['Resolution'])
f.close()
resoltuion_values = np.array(resoltuion_values)
unique_resolutions, frequency = np.unique(resoltuion_values, return_counts=True)
print("Unique Resolution : " , unique_resolutions)
print("Frequency Count : ", frequency)
plt.bar(unique_resolutions,frequency)
plt.title("unique_resolutions vs frequency Plot")
plt.xticks(unique_resolutions, fontsize=10)
plt.xlabel('Unique Resolutions', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.show()
Ouput of this code is a bar plot.
Hope this helps. Feel free to ask any query.

How do I select a particular column of a Numpy array read from a CSV?

I'm trying:
import numpy as np
housing_data = np.loadtxt('Housing.csv', delimiter=',')
print(housing_data)
print(housing_data.shape)
x1 = housing_data[:,:,0]
x2 = housing_data[:,:,1]
y = housing_data[:,:,2]
print(x1)
print(x2)
print(y)
My data has shape (47, 3) and looks like:
[[2.104e+03 3.000e+00 3.999e+05]
[1.600e+03 3.000e+00 3.299e+05]
[2.400e+03 3.000e+00 3.690e+05]
....
I am trying to set the first column to x1, the second to x2 and the third to y. But my code doesn't appear to work. What am I doing wrong?

I have created a dummy *csv file with random data. I would do it like this:
import numpy as np
import pandas as pd
# read file using pandas, without header and convert it to numpy arrays
housing_data = pd.read_csv('Housing.csv', header=None).values
# print housing data
print(housing_data)
print(housing_data.shape)
# slice through the data
x1 = housing_data[:,0]
x2 = housing_data[:,1]
y = housing_data[:,2]
print(x1)
print(x2)
print(y)
The output looks like this:

selection with Numpy & Python you can use :
#Shape (2,2) from top right corner
data[:2,1:]
#Shape bottom row
data[2]
#Shape bottom row
data[2,:]
or with conditions :
data[data>2]
Maybe you could check your .csv file and datatypes:
data.astype(float)
data = np.arange(3, dtype=np.uint8)

Python MNIST Digit Data Testing Failure

Hello I am using Python to try to read the digit data provided by MNIST into a data structure I can use to train a neural network. I am testing to ensure the data was read properly by creating an image using PIL. The image that is being created is horribly wrong, and I am not sure if it is because I am using PIL incorrectly or my data structures and methods are not right.
The format of the two data files is described here:
http://yann.lecun.com/exdb/mnist/
Here are the applicable functions:
read_image_data reads the pixel data organizing it into a list of 2D array numpy arrays
def read_image_data():
fd = open("train-images.idx3-ubyte", "rb")
images_bin_string = fd.read()
num_images = struct.unpack(">i", images_bin_string[4:8])[0]
image_data_bank = []
uint32_num_bytes = 4
current_index = 8
num_rows = struct.unpack(">I", \
images_bin_string[current_index: current_index + uint32_num_bytes])[0]
num_cols = struct.unpack(">I", \
images_bin_string[current_index + uint32_num_bytes: \
current_index + uint32_num_bytes * 2])[0]
current_index += 8
i = 0
while i < num_images:
image_data = np.zeros([num_rows, num_cols])
for j in range(num_rows - 1):
for k in range(num_cols - 1):
image_data[j][k] = images_bin_string[current_index + j * k]
current_index += num_rows * num_cols
i += 1
image_data_bank.append(image_data)
return image_data_bank
read_label_data reads the corresponding labels into a list
def read_label_data():
fd = open("train-labels.idx1-ubyte", "rb")
images_bin_string = fd.read()
num_images = struct.unpack(">i", images_bin_string[4:8])[0]
image_data_bank = []
current_index = 8
i = 0
while i < num_images:
image_data_bank.append(images_bin_string[current_index])
current_index += 1
i += 1
return image_data_bank
collect_data zips the structures together
def collect_data():
print("Reading image data...")
image_data = read_image_data()
print("Reading label data...")
label_data = read_label_data()
print("Zipping data sets...")
all_data = np.array(list(zip(image_data, label_data)))
return all_data
lastly run_test uses PIL to print the pixels from the first 28x28 np structure created by read_image_data
def run_test(data):
example = data[0]
pixel_data = example[0]
number = example[1]
print(number)
im = Image.fromarray(pixel_data)
im.show()
When I run the script:
Collecting data... Reading image data... Reading label data... Zipping
data sets... 5
I must be messing something up with the PIL library, but I do not know what.
That is a really weird looking 5. I am guessing that I went wrong somewhere in my organization of the data. The directions did say "Pixels are organized row-wise.", but I think I covered that by having my outer loop as the row loop then the inner as the column loop
UPDATE
I reversed the order of the row and column index in the np.arrays in read_image_data and it is making no difference.
image_data[k][j] = images_bin_string[current_index + j * k]
UPDATE
Ran quick test with matplotlib
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
imgplot = plt.imshow(pixel_data)
plt.show()
Here is what I got from matplotlib
That means it is definitely a problem with my code and not the library. The question is if it is the way I am passing the pixels to the imaging libraries or how I structured the data. If anyone can find the mistake, I would greatly appreciate.

having problems with matplotlib and spectroscopy data

I am trying to plot a .dat file from an stellar catalog using this code
try:
import pyfits
noPyfits=False
except:
noPyfits=True
import matplotlib.pyplot as plt
import numpy as np
f2 = open('/home/mcditoos/Desktop/Astrophysics_programs/Data_LAFT/ESPECTROS/165401.dat', 'r')
lines = f2.readlines()
f2.close()
x1 = []
y1 = []
for line in lines:
p = line.split()
x1.append(float(p[0]))
y1.append(float(p[1]))
xv = np.array(x1)
yv = np.array(y1)
plt.plot(xv, yv)
plt.show()
however i get the following error:
x1.append(float(p[0]))
IndexError: list index out of range
also i wanted to know if there is anyway of making it a program capable of opening the next .dat file given an input

I may not understand fully your question but why don't you use
X, Y = numpy.genfromtxt('yourfile', dtype='str')
X = X.astype('float')
Y = Y.astype('float')
If in your file you have 2 columns you can transpose your table with
X, Y = numpy.genfromtxt('yourfile', dtype='str').T

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Column stack and row stack with H5py to existing datasets - python

Related

How to automate loading multiple files into numpy arrays using a simple "for" loop?

How can I make graph using matplotlib with json (or yaml) files?

How do I select a particular column of a Numpy array read from a CSV?

Python MNIST Digit Data Testing Failure

having problems with matplotlib and spectroscopy data

Categories

Resources