I'm trying to assimilate a bunch of information into a usable array like this:
for (dirpath, dirnames, filenames) in walk('E:/Machin Lerning/Econ/full_set'):
ndata.extend(filenames)
for i in ndata:
currfile = open('E:/Machin Lerning/Econ/full_set/' + str(i),'r')
rawdata.append(currfile.read().splitlines())
currfile.close()
rawdata = numpy.array(rawdata)
for order,file in enumerate(rawdata[:10]):
for i in rawdata[order]:
r = i.split(',')
pdata.append(r)
fdata.append(pdata)
pdata = []
fdata = numpy.array(fdata)
plt.figure(1)
plt.plot(fdata[:,1,3])
EDIT: After printing ftada.shape when using the first 10 txt files
for order,file in enumerate(rawdata[:10]):
I see it is (10, 500, 7). But if i do not limit the size of this, and instead say
for order,file in enumerate(rawdata):
Then the fdata.shape is just (447,)
It seems like this happens whenever I increase the number of elements i look through in the rawdata array to above 13... It's not any specific location either - I changed it to
for order,file in enumerate(rawdata[11:24):
and that worked fine. aaaaahhh
In case it's useful: here's what a sample of what the text files looks like:
20080225,A,31.42,31.79,31.2,31.5,30575
20080225,AA,36.64,38.95,36.48,38.85,225008
20080225,AAPL,118.59,120.17,116.664,119.74,448847
Looks like fdata is an array, and the error is in fdata[:,1,3]. That tries to index fdata with 3 indices, the slice, 1, and 3. But if fdata is a 2d array, this will produce this error - too many indices.
When you get 'indexing' errors, figure out the shape of the offending array. Don't just guess. Add a debug statement print(fdata.shape).
===================
Taking your file sample, as a list of lines:
In [822]: txt=b"""20080225,A,31.42,31.79,31.2,31.5,30575
...: 20080225,AA,36.64,38.95,36.48,38.85,225008
...: 20080225,AAPL,118.59,120.17,116.664,119.74,448847 """
In [823]: txt=txt.splitlines()
In [826]: fdata=[]
In [827]: pdata=[]
read one 'file':
In [828]: for i in txt:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
...:
...:
In [829]: fdata
Out[829]:
[[[b'20080225', b'A', b'31.42', b'31.79', b'31.2', b'31.5', b'30575 '],
....]]]
In [830]: np.array(fdata)
Out[830]:
array([[[b'20080225', b'A', b'31.42', b'31.79', b'31.2', b'31.5',
b'30575 '],
...]]],
dtype='|S8')
In [831]: _.shape
Out[831]: (1, 3, 7)
Read an 'identical file"
In [832]: for i in txt:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
In [833]: len(fdata)
Out[833]: 2
In [834]: np.array(fdata).shape
Out[834]: (2, 6, 7)
In [835]: np.array(fdata).dtype
Out[835]: dtype('S8')
Note the dtype - a string of 8 characters. Since on value per line is a string, it can't convert the whole thing to numbers.
Now read a slightly different 'file' (one less line, one less value)
In [836]: txt1=b"""20080225,A,31.42,31.79,31.2,31.5,30575
...: 20080225,AA,36.64,38.95,36.48,38.85 """
In [837]: txt1=txt1.splitlines()
In [838]: for i in txt1:
...: r=i.split(b',')
...: pdata.append(r)
...: fdata.append(pdata)
In [839]: len(fdata)
Out[839]: 3
In [840]: np.array(fdata).shape
Out[840]: (3, 8)
In [841]: np.array(fdata).dtype
Out[841]: dtype('O')
Now lets add an 'empty' file - no rows so pdata is []
In [842]: fdata.append([])
In [843]: np.array(fdata).shape
Out[843]: (4,)
In [844]: np.array(fdata).dtype
Out[844]: dtype('O')
Array shape and dtype have totally changed. It can no longer create a uniform 3d array from the lines.
The shape after 10 files, (10, 500, 7), means 10 files, 500 lines each, 7 columns each line. But one file or more of the full 400 is different. My last iteration suggests one is empty.
Related
I have a block of string as below. How do I read this into a numpy array?
5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00
3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00
2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00
3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00
7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00
3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00
8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00
5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00
2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00
6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01
1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01
5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01
3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01
9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01
9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01
5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01
3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01
6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01
I am trying to use numpy native methods so as to speed up the data reading. I am trying to read in couple of GBs of data from a custom file format. I am able to seek and reach the area where a block of text as shown above will appear. Doing regular python string operations on this is always possible, however, I wanted to know if there is any native numpy methods to read in fixed width format.
I tried using np.frombuffer with dtype=float which did not work. It seems to read if I use dtype='S15' however, shows up as bytes and not numbers.
In [294]: txt = """5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2
...: .828410E+03 9.620957E+02 1.0000000E+00
...: 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.57425
...: 8E+03 7.268591E+02 2.0000000E+00
...: 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.08906
...: 2E+03 8.829263E+02 3.0000000E+00
...: """
With this copy-n-paste I'm assuming your block is a multiline string.
Treating it like a csv file.
In [296]: np.loadtxt(txt.splitlines())
Out[296]:
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
[3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
[2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])
There's a lot going on under the covers, so this isn't particularly fast. pandas has a faster csv reader.
fromstring works, but returns 1d. You can reshape the result
n [299]: np.fromstring(txt, sep=' ')
Out[299]:
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])
This is a string, not a buffer, so frombuffer is wrong.
This list comprehension works:
np.array([row.strip().split(' ') for row in txt.strip().splitlines()], float)
I had to add strip to clear out excess blanks that produced empty lists or strings.
At least with this small sample, the list comprehension isn't that much slower than the fromstring, and still a lot better than the more general loadtxt.
You could use several string operations to convert the the data to a string which is convertible to float. Such as:
import numpy as np
with open('data.txt', 'r') as f:
data = f.readlines()
result = []
for line in data:
splitted_data = line.split(' ')
splitted_data = [item for item in splitted_data if item]
splitted_data = [item.replace('E+', 'e') for item in splitted_data]
result.append(splitted_data)
result = np.array(result, dtype = 'float64')
Where data.txt is the data you pasted in your question.
I just did a regular python split and assigned the dtype to np.float32
>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00, 3.09737207e+03,
3.88515991e+03, 5.43267822e+03, 8.06062793e+03,
2.76845703e+03, 6.57425781e+03, 7.26859070e+02,
2.00000000e+00, 2.06142896e+03, 4.66528223e+03,
8.21411914e+03, 3.57937988e+03, 8.54205664e+03,
2.08906201e+03, 8.82926270e+02, 3.00000000e+00], dtype=float32)
P.S. I copied a chunk of your sample data and assigned it to variable “x”
Ok, this doesn’t rely on any blank spaces or use split(), except for the lines, and maintains the shape of the array but does still use non Numpy python.
>>> n=15
>>> x=' 5.780326E+03 7.261185E+03 7.749190E+03 8.488770E+03 5.406134E+03 2.828410E+03 9.620957E+02 1.0000000E+00\n 3.097372E+03 3.885160E+03 5.432678E+03 8.060628E+03 2.768457E+03 6.574258E+03 7.268591E+02 2.0000000E+00\n 2.061429E+03 4.665282E+03 8.214119E+03 3.579380E+03 8.542057E+03 2.089062E+03 8.829263E+02 3.0000000E+00\n 3.572444E+03 9.920473E+03 3.573251E+03 6.423813E+03 2.469338E+03 4.652253E+03 8.211962E+02 4.0000000E+00\n 7.460966E+03 7.691966E+03 7.501826E+03 3.414511E+03 8.590221E+03 6.737868E+03 8.586273E+02 5.0000000E+00\n 3.250046E+03 9.611985E+03 9.195165E+03 1.064800E+03 7.944535E+03 2.685740E+03 8.212849E+02 6.0000000E+00\n 8.069926E+03 9.208576E+03 4.267749E+03 2.491888E+03 9.036555E+03 5.001732E+03 7.202407E+02 7.0000000E+00\n 5.691460E+03 3.868344E+03 3.103342E+03 6.567618E+03 7.274860E+03 8.393253E+03 5.628069E+02 8.0000000E+00\n 2.887292E+03 9.081563E+02 6.955551E+03 6.763133E+03 2.146178E+03 2.033861E+03 9.725472E+02 9.0000000E+00\n 6.127778E+03 8.065057E+02 7.474341E+03 4.185868E+03 4.516230E+03 8.714840E+03 8.254562E+02 1.0000000E+01\n 1.594643E+03 6.060956E+03 2.137153E+03 3.505950E+03 7.714227E+03 6.249693E+03 5.724376E+02 1.1000000E+01\n 5.039059E+03 3.138161E+03 5.570104E+03 4.594189E+03 7.889644E+03 1.891062E+03 7.085753E+02 1.2000000E+01\n 3.263593E+03 6.085087E+03 7.136061E+03 9.895028E+03 6.139666E+03 6.670919E+03 5.018248E+02 1.3000000E+01\n 9.954830E+03 6.777074E+03 3.013747E+03 3.638458E+03 4.357685E+03 1.876539E+03 5.969378E+02 1.4000000E+01\n 9.920853E+03 3.414156E+03 5.534430E+03 2.011815E+03 7.791122E+03 3.893439E+03 5.229754E+02 1.5000000E+01\n 5.447470E+03 7.184321E+03 1.382575E+03 9.134295E+03 7.883753E+02 9.160537E+03 7.521197E+02 1.6000000E+01\n 3.344917E+03 8.151884E+03 3.596052E+03 3.953284E+03 7.456115E+03 7.749632E+03 9.773521E+02 1.7000000E+01\n 6.310496E+03 1.472792E+03 1.812452E+03 9.535100E+03 1.581263E+03 3.649150E+03 6.562440E+02 1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[ 5.78032617e+03, 7.26118506e+03, 7.74918994e+03,
8.48876953e+03, 5.40613379e+03, 2.82840991e+03,
9.62095703e+02, 1.00000000e+00],
[ 3.09737207e+03, 3.88515991e+03, 5.43267822e+03,
8.06062793e+03, 2.76845703e+03, 6.57425781e+03,
7.26859070e+02, 2.00000000e+00],
[ 2.06142896e+03, 4.66528223e+03, 8.21411914e+03,
3.57937988e+03, 8.54205664e+03, 2.08906201e+03,
8.82926270e+02, 3.00000000e+00],
[ 3.57244409e+03, 9.92047266e+03, 3.57325098e+03,
6.42381299e+03, 2.46933789e+03, 4.65225293e+03,
8.21196228e+02, 4.00000000e+00],
[ 7.46096582e+03, 7.69196582e+03, 7.50182617e+03,
3.41451099e+03, 8.59022070e+03, 6.73786816e+03,
8.58627319e+02, 5.00000000e+00],
[ 3.25004590e+03, 9.61198535e+03, 9.19516504e+03,
1.06480005e+03, 7.94453516e+03, 2.68573999e+03,
8.21284912e+02, 6.00000000e+00],
[ 8.06992578e+03, 9.20857617e+03, 4.26774902e+03,
2.49188794e+03, 9.03655469e+03, 5.00173193e+03,
7.20240723e+02, 7.00000000e+00],
[ 5.69145996e+03, 3.86834399e+03, 3.10334204e+03,
6.56761816e+03, 7.27485986e+03, 8.39325293e+03,
5.62806885e+02, 8.00000000e+00],
[ 2.88729199e+03, 9.08156311e+02, 6.95555078e+03,
6.76313281e+03, 2.14617798e+03, 2.03386096e+03,
9.72547180e+02, 9.00000000e+00],
[ 6.12777783e+03, 8.06505676e+02, 7.47434082e+03,
4.18586816e+03, 4.51622998e+03, 8.71483984e+03,
8.25456177e+02, 1.00000000e+01],
[ 1.59464294e+03, 6.06095605e+03, 2.13715308e+03,
3.50594995e+03, 7.71422705e+03, 6.24969287e+03,
5.72437622e+02, 1.10000000e+01],
[ 5.03905908e+03, 3.13816089e+03, 5.57010400e+03,
4.59418896e+03, 7.88964404e+03, 1.89106201e+03,
7.08575317e+02, 1.20000000e+01],
[ 3.26359302e+03, 6.08508691e+03, 7.13606104e+03,
9.89502832e+03, 6.13966602e+03, 6.67091895e+03,
5.01824799e+02, 1.30000000e+01],
[ 9.95483008e+03, 6.77707422e+03, 3.01374707e+03,
3.63845801e+03, 4.35768506e+03, 1.87653894e+03,
5.96937805e+02, 1.40000000e+01],
[ 9.92085254e+03, 3.41415601e+03, 5.53443018e+03,
2.01181494e+03, 7.79112207e+03, 3.89343896e+03,
5.22975403e+02, 1.50000000e+01],
[ 5.44747021e+03, 7.18432080e+03, 1.38257495e+03,
9.13429492e+03, 7.88375305e+02, 9.16053711e+03,
7.52119690e+02, 1.60000000e+01],
[ 3.34491699e+03, 8.15188379e+03, 3.59605200e+03,
3.95328394e+03, 7.45611523e+03, 7.74963184e+03,
9.77352112e+02, 1.70000000e+01],
[ 6.31049609e+03, 1.47279199e+03, 1.81245203e+03,
9.53509961e+03, 1.58126294e+03, 3.64914990e+03,
6.56244019e+02, 1.80000000e+01]], dtype=float32)
Thanks to #hpaulj's comments. Here's the answer I ended up with.
data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)
More explanation
Since I am reading this from a custom file format, I will post how I'm doing the whole thing as well.
I do some initial processing of the file to identify the positions where the block of text is residing and end up with an array of 'locations' where I can seek to start the reading process and then I use the above method to read the 'block' of text.
data = np.array([])
r = 18 # rows per block
c = 8 # columns per block
w = 15 # width of a column
with open('mycustomfile.xyz') as f:
for location in locations:
f.seek(location)
data = np.append(data, np.genfromtxt(f, delimiter=[w]*c, max_rows=r))
data = data.reshape((r*len(locations),c))
If you want an array with dtype=float you have to convert your string to float beforehand.
import numpy as np
string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype
I have some code which creates an x variable (frames) and a y variable (pixel intensity) in an infinite loop until the program ends. I would like to append these values every loop into a txt.file so that I can later work with the data. The data comes out as numpy arrays.
Say for example after 5 loops(5 frames) I get these values
1 2 3 4 5 (x values)
0 0 8 0 0 (y values)
I would like it to append these into a file every loop so I get after closing the program this:
1, 0
2, 0
3, 8
4, 0
5, 0
What would be the fastest way to implement this?
So far I have tried np.savetxt('data.txt', x) but this only saves the last value in the loop and doesn't add the data each loop. Is there a way to change this function or another function I could use that adds the data into the txt document.
First I will zip the values into (x,y) coordinate form and put them into a list so it is easier to append them to a text file, in your program you won't need to do this since you will have generated the x and y already within the loop prior.
x = [1, 2, 3, 4 ,5] #(x values)
y = [0, 0, 8, 0, 0] #(y values)
coordinate = list(zip(x,y))
print(coordinate)
So I used the Zip function, to store the sample results as (x_n, y_n) to a list for later.
Here is what I am appending to the text file with the below for loop (in the terminal display)
With in the loop itself you can use:
for element in coordinate: #you wouldn't need to write this since you are already in a loop
file1 = open("file.txt","a")
file1.write(f"{element} \n")
file1.close()
Output:
You can do something like this -- it is not complete because it will just append to the old file. The other issue with this is that it will not actually write the file until you close it. If you really need the file saved each time in the loop, then another solution is required.
import numpy as np
variable_to_ID_file = 3.
file_name_str = 'Foo_Var{0:.0f}.txt'.format(variable_to_ID_file)
# Need code here to delete old file if it is there
# with proper error checking, or write a blank file, then open it in append mode.
f_id = open(file_name_str, 'a')
for ii in range(4):
# Pick the delimiter you desire. I prefer tab '/t'
np.savetxt(f_id, np.column_stack((ii, 3*ii/4)), delimiter=', ', newline ='\n')
f_id.close()
If you do not need to write the file for each step in the loop, I recommend this option. It requires the Numpy arrays to be the same size.
import numpy as np
array1 = np.arange(1,5,1)
array2 = np.zeros(array1.size)
variable_to_ID_file = 3.
file_name_str = 'Foo_Var{0:.0f}.txt'.format(variable_to_ID_file)
# Pick the delimiter you desire. I prefer tab '/t'
np.savetxt(file_name_str, np.column_stack((array1, array2)), delimiter=', ')
I have one big file saved using numpy in append mode, i.e., it contains maybe 5000 arrays, each with shape, e.g. [1, 224, 224, 3], like this way:
filepath = 'hello'
for some loop:
...
with open(filepath, 'ab') as f:
np.save(f, ndarray)
I need to load the data in the file, maybe all arrays, or maybe in some generating mode, like reading the first 100, then the next 100, and so on. Is there any method to do this properly? Now, I only know if I use np.load, I can only get one array each time, and I don't know how to read the 100 to 199 arrays.
loading arrays saved using numpy.save in append mode
This question talk about something on this, but seems not what I want.
One solution, although ugly and can only get all arrays in the file (and thus risk the out of memory error) is as following:
a = []
with open(filepath, 'rb') as f:
while True:
try:
a.append(np.load(f))
except:
break
np.stack(a)
This is more of a hack (given your situation).
Anyway, here is the one that created the files with np.save in append mode:
import numpy as np
numpy_arrays = [np.array ([1, 2, 3]), np.array([0, 9])]
print numpy_arrays[0], numpy_arrays[1]
print type(numpy_arrays[0]), type(numpy_arrays[1])
for numpy_array in numpy_arrays:
with open ("./my-numpy-arrays.bin", 'ab') as f:
np.save(f, numpy_array)
[1 2 3] [0 9]
<type 'numpy.ndarray'> <type 'numpy.ndarray'>
... and here is the code that checks IOException (and other errors) while looping through.
with open ("./my-numpy-arrays.bin", 'rb') as f:
while True:
try :
numpy_array = np.load(f)
print numpy_array
except :
break
[1 2 3]
[0 9]
Not very pretty but ... it works.
I have a list of list of list in the following format:
[ [[a1_1, a1_2, a1_3, a1_4], [b1_1, b1_2, b1_3, b1_4]],
[[a2_1, a2_2, a2_3, a2_4], [b2_1, b2_2, b2_3, b2_4]],
:
:
[[a10_1, a10_2, a10_3, a10_4], [b10_1, b10_2, b10_3, b10_4]] ]
Except iterate over each element and add it to the new structure, is there an elegant way to accomplish the following:
Restructure the list to:
[ [[ a1_1, b1_1], [a1_2, b1_2], [a1_3, b1_3], [a1_4, b1_4]],
[[ a2_1, b2_1], [a2_2, b2_2], [a2_3, b2_3], [a2_4, b2_4]],
:
:
[[ a10_1, b10_1], [a10_2, b10_2], [a10_3, b10_3], [a10_4, b10_4]] ]
Then convert the above list of list of list to numpy structure in the shape of 10 x 4 x 2. Thanks!
You can use tranpose here:
import numpy as np
ar = np.array(data)
and then:
ar.transpose((0,2,1))
or equivalent:
ar.transpose(0,2,1)
If I write strings into the variables, and then use your sample data, I get:
>>> ar
array([[['a_1_1', 'a_1_2', 'a_1_3', 'a_1_4'],
['b_1_1', 'b_1_2', 'b_1_3', 'b_1_4']],
[['a_2_1', 'a_2_2', 'a_2_3', 'a_2_4'],
['b_2_1', 'b_2_2', 'b_2_3', 'b_2_4']],
[['a_10_1', 'a_10_2', 'a_10_3', 'a_10_4'],
['b_10_1', 'b_10_2', 'b_10_3', 'b_10_4']]],
dtype='<U6')
>>> ar.transpose((0,2,1))
array([[['a_1_1', 'b_1_1'],
['a_1_2', 'b_1_2'],
['a_1_3', 'b_1_3'],
['a_1_4', 'b_1_4']],
[['a_2_1', 'b_2_1'],
['a_2_2', 'b_2_2'],
['a_2_3', 'b_2_3'],
['a_2_4', 'b_2_4']],
[['a_10_1', 'b_10_1'],
['a_10_2', 'b_10_2'],
['a_10_3', 'b_10_3'],
['a_10_4', 'b_10_4']]],
dtype='<U6')
transpose takes as input an array and a list of indices. It rearanges the indices such that (in case we give it (0,2,1)), the old first (0) dimension; is the new first dimension, the old third (2) dimension, is the new second dimension, and the old second (1) dimension is the new third dimension.
If you already have a list, you should be able to accomplish this relatively painlessly, just use the zip transpose idiom on the sublists:
arr = np.array([list(zip(*sub)) for sub in my_list])
So, using only 3-rows...
In [1]: data = [ [['a1_1', 'a1_2', 'a1_3', 'a1_4'], ['b1_1', 'b1_2', 'b1_3', 'b1_4']],
...: [['a2_1', 'a2_2', 'a2_3', 'a2_4'], ['b2_1', 'b2_2', 'b2_3', 'b2_4']],
...: [['a10_1', 'a10_2', 'a10_3', 'a10_4'], ['b10_1', 'b10_2', 'b10_3', 'b10_4']] ]
In [2]: [list(zip(*sub)) for sub in data]
Out[2]:
[[('a1_1', 'b1_1'), ('a1_2', 'b1_2'), ('a1_3', 'b1_3'), ('a1_4', 'b1_4')],
[('a2_1', 'b2_1'), ('a2_2', 'b2_2'), ('a2_3', 'b2_3'), ('a2_4', 'b2_4')],
[('a10_1', 'b10_1'), ('a10_2', 'b10_2'), ('a10_3', 'b10_3'), ('a10_4', 'b10_4')]]
In [3]: import numpy as np
In [4]: np.array([list(zip(*sub)) for sub in data])
Out[4]:
array([[['a1_1', 'b1_1'],
['a1_2', 'b1_2'],
['a1_3', 'b1_3'],
['a1_4', 'b1_4']],
[['a2_1', 'b2_1'],
['a2_2', 'b2_2'],
['a2_3', 'b2_3'],
['a2_4', 'b2_4']],
[['a10_1', 'b10_1'],
['a10_2', 'b10_2'],
['a10_3', 'b10_3'],
['a10_4', 'b10_4']]],
dtype='<U5')
In [5]: np.array([list(zip(*sub)) for sub in data]).shape
Out[5]: (3, 4, 2)
I have an h5 file that contains 62 different attributes. I would like to access the data range of each one of them.
to explain more here what I'm doing
import h5py
the_file = h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()
the previous code gives me a list of attributes "U","T","H",.....etc
lets say I want to know what is the minimum and maximum value of "U". how can I do that ?
this is the output of running "h5dump -H"
HDF5 "myfile.h5" {
GROUP "/" {
GROUP "data" {
ATTRIBUTE "datafield_names" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 62 ) / ( 62 ) }
}
ATTRIBUTE "dimensions" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "time_variables" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
DATASET "Temperature" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
}
It might be a difference in terminology, but hdf5 attributes are access via the attrs attribute of a Dataset object. I call what you have variables or datasets. Anyway...
I'm guessing by your description that the attributes are just arrays, you should be able to do the following to get the data for each attribute and then calculate the min and max like any numpy array:
attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()
So if you wanted the min/max of each attribute you can just do a for loop over the attribute names or you could use
for attr_name,attr_value in data.items():
min = attr_value[:].min()
Edit to answer your first comment:
h5py's objects can be used like python dictionaries. So when you use 'keys()' you are not actually getting data, you are getting the name (or key) of that data. For example, if you run the_file.keys() you will get a list of every hdf5 dataset in the root path of that hdf5 file. If you continue along a path you will end up with the dataset that holds the actual binary data. So for example, you might start with (in an interpreter at first):
the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]
print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]
# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()
Edit 2 - Why do people format their hdf files this way? It defeats the purpose.
I think you may have to talk to the person who made this file if possible. If you made it, then you'll be able to answer my questions for yourself. First, are you sure that in your original example data.keys() returned "U","T",etc.? Unless h5py is doing something magical or if you didn't provide all of the output of the h5dump, that could not have been your output. I'll explain what the h5dump is telling me, but please try to understand what I am doing and not just copy and paste into your terminal.
# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()
# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()
As you can see from the h5dump, there are 62 datafield_names (strings), 4 dimensions (32-bit integers, I think), and 2 time_variables (64-bit floats). It also tells me that Temperature is a 3-dimensional array, 256 x 512 x 1024 (64-bit floats). Do you see where I'm getting this information? Now comes the hard part, you will need to determine how the datafield_names match up with the Temperature array. This was done by the person who made the file, so you'll have to figure out what each row/column in the Temperature array means. My first guess would be that each row in the Temperature array is one of the datafield_names, maybe 2 more for each time? But this doesn't work since there are too many rows in the array. Maybe the dimensions fit in there some how? Lastly here is how you get each of those pieces of information (continuing from before):
# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]
# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()
I'm sorry I can't be of more help, but without actually having the file and knowing what each field means this is about all I can do. Try to understand how I used h5py to read the information. Try to understand how I translated the header information (h5dump output) into information that I could actually use with h5py. If you know how the data is organized in the array you should be able to do what you want. Good luck, I'll help more if I can.
Since h5py arrays are closely related to numpy arrays, you can use the numpy.min and numpy.max functions to do this:
maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'
Note the ':', it is needed to convert the data to a numpy array.
You can call min and max (row-wise) on the DataFrame:
In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))
In [2]: df
Out[2]:
U T
0 1 6
1 5 2
2 4 3
In [3]: df.min(0)
Out[3]:
U 1
T 2
In [4]: df.max(0)
Out[4]:
U 5
T 6
Did you mean data.attrs rather than data itself? If so,
import h5py
with h5py.File("myfile.h5", "w") as the_file:
dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
dset.attrs['U'] = (0,1,2,3)
dset.attrs['T'] = (2,3,4,5)
with h5py.File("myfile.h5", "r") as the_file:
data = the_file["MyDataset"]
print({key:(min(value), max(value)) for key, value in data.attrs.items()})
yields
{u'U': (0, 3), u'T': (2, 5)}