I have a boatload of images in a hdf5-file that I would like to load and analyse. Each image is 1920x1920 uint16 and loading all off them into the memory crashes the computer. I have been told that others work around that by slicing the image, e.g. if the data is 1920x1920x100 (100 images) then they read in the first 80 rows of each images and analyse that slice, then move to next slice. This I can do without problems, but when I try to create a dataset in the hdf5 file, it get a TypeError: Can't convert element 0 ... to hsize_t
I can recreate the problem with this very simplified code:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
data_set = f.create_dataset('data', data, dtype='uint16')
which gives the output:
TypeError: Can't convert element 0 ([[29 50 75...4 50 28 36 13 72]]) to hsize_t
I have also tried omitting the "data_set =" and the "dtype='uint16'", but I still get the same error. The code is then:
with h5py.File('h5file.hdf5','w') as f:
data = np.random.randint(100, size=(15,15,20))
f.create_dataset('data', data)
Can anyone give me any hints to what the problem is?
Cheers!
The second parameter of create_dataset is the shape parameter (see the docs), but you pass the entire array. If you want to initialize the dataset with an existing array, you must specify this with the data keyword, like this:
data_set = f.create_dataset('data', data=data, dtype="uint16")
Related
Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).
Then whenever I need to delete a some row I just zero-it (many):
ds[ix,:] = 0
What is the fastest way to vacuum-zeroth-rows and resize the array ?
Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix
{ name : ds_ix }..
What is the fastest way to vacuum and keep the correct ds_ix ?
Did you mean resize the dataset when you asked 'resize the array?' (Also, I assume you meant maxshape=(None,1000).) If so, you use the .resize() method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.
Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.
Here are both approaches with a very small dataset.
Create a HDF5 file:
with h5py.File('SO_73353006.h5','w') as h5f:
a0, a1 = 10, 5
arr = np.arange(a0*a1).reshape(a0,a1)
ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))
Method 1: move data, then resize dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
#ds[idx,:] = 0 # Not required since we will overwrite the row
a0 = ds.shape[0]
ds[idx:a0-1] = ds[idx+1:a0]
ds.resize(a0-1,axis=0)
Method 2: extract array, delete row and copy data to resized dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
a0 = ds.shape[0]
a1 = ds.shape[1]
# read dataset into array and delete row
ds_arr = ds[()]
ds_arr = np.delete(ds_arr, obj=idx, axis=0)
# Resize dataset and load array
ds.resize(a0-1,axis=0) # same as above
ds[:] = ds_arr[:]
# Create a new dataset for comparison
ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))
I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?
numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5
I have this matlab file that is of shape 70x10,000,000 (10,000,000 columns 70 rows)
Whats annoying is that when I run this line of code which is supposed to print that chunk of data,
f = h5py.File(filepath, 'r')
item = list(f.items())[0][1]
print(item)
it reshapes it into 10,000,000x70 (10,000,000 rows, 70 columns)
Is there a way to keep the original shape ?
h5py returns HDF5 data as Numpy arrays. So, the key to using h5py is using Numpy methods when needed. You can easily transpose an array using np.transpose(). A simple example is provided below. It creates a HDF5 file with 2 datasets: 1) an array with shape (20,5), and 2) the transposed array with shape (5,20). Then it extracts the 2 arrays and uses np.transpose() to switch row/column order.
with h5py.File('SO_67031436','w') as h5w:
arr = np.arange(100.).reshape(20,5)
h5w.create_dataset('ds_1',data=arr)
h5w.create_dataset('ds_1t',data=np.transpose(arr))
with h5py.File('SO_67031436','r') as h5r:
for name in h5r:
print(name,',shape=',h5r[name].shape)
arr=np.transpose(h5r[name][:])
print('transposed shape=',arr.shape)
First of all apologies. I am very new to pandas, scikit learn and python. So I am sure I am doing something silly. Let me give a little background.
I am trying to run KNeighborsClassifier from scikit learn (python)
Following is my strategy
#Reading the Training set
data = pd.read_csv('Path_TO_File\\Train_Set.csv', sep=',') # reading CSV File
X = data[['Attribute 1','Attribute 2']]
y = data['Target_Column'] # the output is a Dataframe of single column with many rows
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X,y)
Next I try to read Test data
test = pd.read_csv('PATH_TO_FILE\\Test.csv', sep=',')
t = test[['Attribute 1','Attribute 2']]
pred = neigh.predict(t)
actual = test['Target_Column']
Next I try to check the accuracy by following function which is throwing error.
accuracy=neigh.score(actual,pred)
ERROR: ValueError: could not convert string to float: N
I checked actual and pred both and they are of following data type and content
actual
Out[161]:
Target_Column
0 Y
1 N
:
[614 rows x 1 columns]
pred
Out[162]:
array(['Y', 'N', .....'N'], dtype=object)
N.B : pred has 614 values.
I tried to convert "actual" variable to 1D array I might be able to execute the function however, I am not successful.
I think I need to do following two things, however, was not able to do so (after googling it)
1) Convert actual into 1Dimen array
2) Making a transpose of the 1Dimen array since the pred has 614 columns.
Please let me know how to correct the function.
Thanks in advance !
Raj
Thanks Vivek and Thornhale
Indeed I was doing two wrong things.
As pointed by you guys, I should have been using 1, 0 in stead of Y,
N.
I was giving wrong parameters to the function score. It should be
accuracy=neigh.score(t, actual) , where t is test feature set and
actual is test label information.
You could convert your series which is what you get when you do "test[COLUMN_NAME]" into an array like so:
actual = np.array(test['Target_Column'])
To then reshape an np array, you would emply this command:
actual.reshape(1, 612) # <- Could be the other way around as well.
Your main issue though is that your Series needs to be boolean (as in 0,1).
I want to train a SVM to perform a classification of samples. I have a csv file with me that has 3 columns with headers: feature 1,feature 2, class label and 20 rows(= number of samples).
Now I quote from the Scikit-Learn documentation
" As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:"
I understand that I need to obtain two arrays(one 2d & one 1d array) in order to feed data into the SVM. However I am unable to understand how to obtain the required array from the csv file.
I have tried the following code
import numpy as np
data = np.loadtxt('test.csv', delimiter=',')
print data
However it is showing an error
"ValueError: could not convert string to float: ��ࡱ�"
There are no column headers in the csv. Am I making any mistake in calling the function np.loadtxt or should something else be used?
Update:
Here's how my .csv file looks like.
12 122 34
12234 54 23
23 34 23
You passed the param delimiter=',' but your csv was not comma separated.
So the following works:
In [378]:
data = np.loadtxt(path_to_data)
data
Out[378]:
array([[ 1.20000000e+01, 1.22000000e+02, 3.40000000e+01],
[ 1.22340000e+04, 5.40000000e+01, 2.30000000e+01],
[ 2.30000000e+01, 3.40000000e+01, 2.30000000e+01]])
The docs show that by default the delimiter is None and so treats whitespace as the delimiter:
delimiter : str, optional The string used to separate values. By
default, this is any whitespace.
The issue was with the csv file rather than the loadtxt() function. The format in which I saved was not giving a proper .csv file(dont know why!-maybe I didnt save it at all). But there is a way to verify whether the csv file is saved in the right format or not. Open the .csv file using notepad. If the data has commas between them, then it is saved properly. And loadtxt() will work. If it shows some gibberish, then create it again and then check.