How to use function like matlab 'fread' in python? - python

This is a .dat file.
In Matlab, I can use this code to read.
lonlatfile='NOM_ITG_2288_2288(0E0N)_LE.dat';
f=fopen(lonlatfile,'r');
lat_fy=fread(f,[2288*2288,1],'float32');
lon_fy=fread(f,[2288*2288,1],'float32')+86.5;
lon=reshape(lon_fy,2288,2288);
lat=reshape(lat_fy,2288,2288);
Here are some results of Matlab:
matalab
How to do in python to get the same result?
PS: My code is this:
def fromfileskip(fid,shape,counts,skip,dtype):
"""
fid : file object, Should be open binary file.
shape : tuple of ints, This is the desired shape of each data block.
For a 2d array with xdim,ydim = 3000,2000 and xdim = fastest
dimension, then shape = (2000,3000).
counts : int, Number of times to read a data block.
skip : int, Number of bytes to skip between reads.
dtype : np.dtype object, Type of each binary element.
"""
data = np.zeros((counts,) + shape)
for c in range(counts):
block = np.fromfile(fid,dtype=np.float32,count=np.product(shape))
data[c] = block.reshape(shape)
fid.seek( fid.tell() + skip)
return data
fid = open(r'NOM_ITG_2288_2288(0E0N)_LE.dat','rb')
data = fromfileskip(fid,(2288,2288),1,0,np.float32)
loncenter = 86.5 #Footpoint of FY2E
latcenter = 0
lon2e = data+loncenter
lat2e = data+latcenter
Lon = lon2e.reshape(2288,2288)
Lat = lat2e.reshape(2288,2288)
But, the result is different from that of Matlab.

You should be able to translate the code directly into Python with little change:
lonlatfile = 'NOM_ITG_2288_2288(0E0N)_LE.dat'
with open(lonlatfile, 'rb') as f:
lat_fy = np.fromfile(f, count=2288*2288, dtype='float32')
lon_fy = np.fromfile(f, count=2288*2288, dtype='float32')+86.5
lon = lon_ft.reshape([2288, 2288], order='F');
lat = lat_ft.reshape([2288, 2288], order='F');
Normally the numpy reshape would be transposed compared to the MATLAB result, due to different index orders. The order='F' part makes sure the final output has the same layout as the MATLAB version. It is optional, if you remember the different index order you can leave that off.
The with open() as f: opens the file in a safe manner, making sure it is closed again when you are done even if the program has an error or is cancelled for whatever reason. Strictly speaking it is not needed, but you really should always use it when opening a file.

Related

Behavior of Python numpy tofile and fromfile when saving int16

After appending data to an empty array and verifying it's shape, I use tofile to save it. When I read it back with fromfile, the shape is much larger (4x).
# Create empty array
rx_data = np.empty((0), dtype='int16')
# File name to write data
file_name = "Test_file.bin"
# Generate two sinewaves and add to rx_data array in interleaved fashion
for i in range(100):
I = np.sin(2*3.14*(i/100))
rx_data = np.append(rx_data, I)
Q = np.cos(2*3.14*(i/100))
rx_data = np.append(rx_data, Q)
print(rx_data.shape)
# Write array to .bin file
rx_data.tofile(file_name)
# Read back data from the .file
s_interleaved = np.fromfile(file_name, dtype=np.int16)
print(s_interleaved.shape)
The above code returns (200) before the array is saved, but return 800 when the array is read back in. Why does this happen?
The problem is that rx_data has a float64 data type before you save it. This is because I and Q are float64 arrays, so when you use np.append, the type will be promoted to be compatible with the float64 values.
Also, populating an array using np.append is an anti-pattern in numpy. It's common to do in standard python, but with numpy, it is often better to create an array of the shape and data type you need, then to fill in the values in the for loop. This is better because you only need to create the array once. With np.append, you create a new copy every time you call it.
rx_data = np.empty(200, dtype="float64")
for i in range(200):
rx_data[i] = np.sin(...)
Here is code that works, because it uses float64. But this pattern should be avoided in general. Prefer pre-allocating an array and replacing the values in the for loop.
# Create empty array
rx_data = np.empty((0), dtype='int16')
# File name to write data
file_name = "Test_file.bin"
# Generate two sinewaves and add to rx_data array in interleaved fashion
for i in range(100):
I = np.sin(2*3.14*(i/100))
rx_data = np.append(rx_data, I)
Q = np.cos(2*3.14*(i/100))
rx_data = np.append(rx_data, Q)
print(rx_data.shape)
# Write array to .bin file
print("rx_data data type:", rx_data.dtype)
rx_data.tofile(file_name)
# Read back data from the .file
s_interleaved = np.fromfile(file_name, dtype=np.float64)
print(s_interleaved.shape)
# Test that the arrays are equal.
np.allclose(rx_data, s_interleaved) # True

How to read and extract values from a binary file using python code?

I am relatively new to python. As part of my astronomy project work, I have to deal with binary files (which of course is again new to me). I was given a binary file and a python code which reads data from the binary file. I was then asked by my professor to understand how the code works on the binary file. I spent couple of days trying to figure out, but nothing helped. Can anyone here help me with the code?
# Read the binary opacity file
f = open(file, "r")
# read file dimension sizes
a = np.fromfile(f, dtype=np.int32, count=16)
NX, NY, NZ = a[1], a[4], a[7]
# read the time and time step
time, time_step = np.fromfile(f, dtype=np.float64, count=2)
# number of iterations
nite = np.fromfile(f, dtype=np.int32, count=1)
# radius array
trash = np.fromfile(f, dtype=np.float64, count=1)
rad = np.fromfile(f, dtype=np.float64, count=a[1])
# phi array
trash = np.fromfile(f, dtype=np.float64, count=1)
phi = np.fromfile(f, dtype=np.float64, count=a[4])
# close the file
f.close()
The binary file as far as I know contains several parameters (eg: radius, phi, sound speed, radiation energy) and its many values. The above code extract the values 2 parameters- radius and phi from the binary file. Both radius and phi have more than 100 values. The program works, but I am not able to understand how it works. Any help would be appreciated.
The binary file is essentially just a long list of continuous data; you need to tell np.fromfile() both where to look and what type of data to expect.
Perhaps it's easiest to understand if you create your own file:
import numpy as np
with open('numpy_testfile', 'w+') as f:
## we create a "header" line, which collects the lengths of all relevant arrays
## you can then use this header line to tell np.fromfile() *how long* the arrays are
dimensions=np.array([0,10,0,0,10,0,3,10],dtype=np.int32)
dimensions.tofile(f) ## write to file
a=np.arange(0,10,1) ## some fake data, length 10
a.tofile(f) ## write to file
print(a.dtype)
b=np.arange(30,40,1) ## more fake data, length 10
b.tofile(f) ## write to file
print(b.dtype)
## more interesting data, this time it's of type float, length 3
c=np.array([3.14,4.22,55.0],dtype=np.float64)
c.tofile(f) ## write to file
print(c.dtype)
a.tofile(f) ## just for fun, let's write "a" again
with open('numpy_testfile', 'r+b') as f:
### what's important to know about this step is that
# numpy is "seeking" the file automatically, i.e. it is considering
# the first count=8, than the next count=10, and so on
# as "continuous data"
dim=np.fromfile(f,dtype=np.int32,count=8)
print(dim) ## our header line: [ 0 10 0 0 10 0 3 10]
a=np.fromfile(f,dtype=np.int64,count=dim[1])## read the dim[1]=10 numbers
b=np.fromfile(f,dtype=np.int64,count=dim[4])## and the next 10
## now it's dim[6]=3, and the dtype is float 10
c=np.fromfile(f,dtype=np.float64,count=dim[6] )#count=30)
## read "the rest", unspecified length, let's hope it's all int64 actually!
d=np.fromfile(f,dtype=np.int64)
print(a)
print(b)
print(c)
print(d)
Addendum: the numpy documentation is quite explicit when it comes to discouraging the use of np.tofile() and np.fromfile():
Do not rely on the combination of tofile and fromfile for data storage, as the binary files generated are are not platform independent. In particular, no byte-order or data-type information is saved. Data can be stored in the platform independent .npy format using save and load instead.
Personal side note: if you spent a couple of days to understand this code, don't feel discouraged of learning python; we all start somewhere. I'd suggest to be honest about the obstacles you've hit to your Professor (if this comes up in conversation), as she/he should be able to correctly assert "where you're at" when it comes to programming. :-)
from astropy.io import ascii
data = ascii.read('/directory/filename')
column1data = data[nameofcolumn1]
column2data = data[nameofcolumn2]
ect.
column1data is now an array of all the values under that header
I use this method to import SourceExtractor dat files which are in the ASCII format.
I believe this a more elegant way to import data from ascii files.

Writing and reading a row array (nx1) to a binary file in Python with struct pack

I'm having a lot of trouble writing to and reading from a binary file when working with a nx1 row vector that has been written to a binary file using struct.pack. The file structure looks like this (given an argument data that is of type numpy.array) :
test.file
--------
[format_code = 3] : 4 bytes (the code 3 means a vector) - fid.write(struct.pack('i',3))
[rows] : 4 bytes (fid.write(struct.pack('i',sz[0])) where sz = data.shape
[cols] : 4 bytes (fid.write(struct.pack('i',sz[1]))
[data] : type double = 8 bytes * (rows * cols)
Unfortunately, since these files are mostly written in MATLAB, where I have a working class that reads and writes these fields, I can't only write the amount of rows (I need columns as well even if a column does only = 1).
I've tried a few ways to pack data, none of which have worked when trying to unpack it (assume I've opened my file denoted by fid in 'rb'/'wb' and have done some error checking):
# write data
sz = data.shape
datalen=8*sz[0]*sz[1]
fid.write(struct.pack('i',3)) # format code
fid.write(struct.pack('i',sz[0])) # rows
fid.write(struct.pack('i',sz[1])) # columns
### write attempt ###
for i in xrange(sz[0]):
for j in xrange(sz[1]):
fid.write(struct.pack('d',float(data[i][j]))) # write in 'c' convention, so we transpose
### read attempt ###
format_code = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
rows = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
cols = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
out_datalen = 8 * rows * cols # size of structure
output_data=numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
So far, when reading, my output has just seemingly been multiplied by random things. I don't know whats happening.
I found another similar question, and so I wrote my data as such:
fid.write(struct.pack('%sd' % len(data), *data))
However, when reading it back using:
numpy.array(struct.unpack('%sd' % out_datalen,fid.read(datalen)),dtype=float)
I get nothing in my array.
Similarly, just doing:
fid.write(struct.pack('%dd' % datalen, *data))
and reading it back with:
numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
also gives me an empty array. How can I fix this?

Reading several arrays in a binary file with numpy

I'm trying to read a binary file which is composed by several matrices of float numbers separated by a single int. The code in Matlab to achieve this is the following:
fid1=fopen(fname1,'r');
for i=1:xx
Rstart= fread(fid1,1,'int32'); #read blank at the begining
ZZ1 = fread(fid1,[Nx Ny],'real*4'); #read z
Rend = fread(fid1,1,'int32'); #read blank at the end
end
As you can see, each matrix size is Nx by Ny. Rstart and Rend are just dummy values. ZZ1 is the matrix I'm interested in.
I am trying to do the same in python, doing the following:
Rstart = np.fromfile(fname1,dtype='int32',count=1)
ZZ1 = np.fromfile(fname1,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fname1,dtype='int32',count=1)
Then, I have to iterate to read the subsequent matrices, but the function np.fromfile doesn't retain the pointer in the file.
Another option:
with open(fname1,'r') as f:
ZZ1=np.memmap(f, dtype='float32', mode='r', offset = 4,shape=(Ny1,Nx1))
plt.pcolor(ZZ1)
This works fine for the first array, but doesn't read the next matrices. Any idea how can I do this?
I searched for similar questions but didn't find a suitable answer.
Thanks
The cleanest way to read all your matrices in a single vectorized statement is to use a struct array:
dtype = [('start', np.int32), ('ZZ', np.float32, (Ny1, Nx1)), ('end', np.int32)]
with open(fname1, 'rb') as fh:
data = np.fromfile(fh, dtype)
print(data['ZZ'])
There are 2 solutions for this problem.
The first one:
for i in range(x):
ZZ1=np.memmap(fname1, dtype='float32', mode='r', offset = 4+8*i+(Nx1*Ny1)*4*i,shape=(Ny1,Nx1))
Where i is the array you want to get.
The second one:
fid=open('fname','rb')
for i in range(x):
Rstart = np.fromfile(fid,dtype='int32',count=1)
ZZ1 = np.fromfile(fid,dtype='float32',count=Ny1*Nx1).reshape(Ny1,Nx1)
Rend = np.fromfile(fid,dtype='int32',count=1)
So as morningsun points out, np.fromfile can receive a file object as an argument and keep track of the pointer. Notice that you must open the file in binary mode 'rb'.

Accessing data range with h5py

I have an h5 file that contains 62 different attributes. I would like to access the data range of each one of them.
to explain more here what I'm doing
import h5py
the_file = h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()
the previous code gives me a list of attributes "U","T","H",.....etc
lets say I want to know what is the minimum and maximum value of "U". how can I do that ?
this is the output of running "h5dump -H"
HDF5 "myfile.h5" {
GROUP "/" {
GROUP "data" {
ATTRIBUTE "datafield_names" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 62 ) / ( 62 ) }
}
ATTRIBUTE "dimensions" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "time_variables" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
DATASET "Temperature" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
}
It might be a difference in terminology, but hdf5 attributes are access via the attrs attribute of a Dataset object. I call what you have variables or datasets. Anyway...
I'm guessing by your description that the attributes are just arrays, you should be able to do the following to get the data for each attribute and then calculate the min and max like any numpy array:
attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()
So if you wanted the min/max of each attribute you can just do a for loop over the attribute names or you could use
for attr_name,attr_value in data.items():
min = attr_value[:].min()
Edit to answer your first comment:
h5py's objects can be used like python dictionaries. So when you use 'keys()' you are not actually getting data, you are getting the name (or key) of that data. For example, if you run the_file.keys() you will get a list of every hdf5 dataset in the root path of that hdf5 file. If you continue along a path you will end up with the dataset that holds the actual binary data. So for example, you might start with (in an interpreter at first):
the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]
print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]
# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()
Edit 2 - Why do people format their hdf files this way? It defeats the purpose.
I think you may have to talk to the person who made this file if possible. If you made it, then you'll be able to answer my questions for yourself. First, are you sure that in your original example data.keys() returned "U","T",etc.? Unless h5py is doing something magical or if you didn't provide all of the output of the h5dump, that could not have been your output. I'll explain what the h5dump is telling me, but please try to understand what I am doing and not just copy and paste into your terminal.
# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()
# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()
As you can see from the h5dump, there are 62 datafield_names (strings), 4 dimensions (32-bit integers, I think), and 2 time_variables (64-bit floats). It also tells me that Temperature is a 3-dimensional array, 256 x 512 x 1024 (64-bit floats). Do you see where I'm getting this information? Now comes the hard part, you will need to determine how the datafield_names match up with the Temperature array. This was done by the person who made the file, so you'll have to figure out what each row/column in the Temperature array means. My first guess would be that each row in the Temperature array is one of the datafield_names, maybe 2 more for each time? But this doesn't work since there are too many rows in the array. Maybe the dimensions fit in there some how? Lastly here is how you get each of those pieces of information (continuing from before):
# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]
# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()
I'm sorry I can't be of more help, but without actually having the file and knowing what each field means this is about all I can do. Try to understand how I used h5py to read the information. Try to understand how I translated the header information (h5dump output) into information that I could actually use with h5py. If you know how the data is organized in the array you should be able to do what you want. Good luck, I'll help more if I can.
Since h5py arrays are closely related to numpy arrays, you can use the numpy.min and numpy.max functions to do this:
maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'
Note the ':', it is needed to convert the data to a numpy array.
You can call min and max (row-wise) on the DataFrame:
In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))
In [2]: df
Out[2]:
U T
0 1 6
1 5 2
2 4 3
In [3]: df.min(0)
Out[3]:
U 1
T 2
In [4]: df.max(0)
Out[4]:
U 5
T 6
Did you mean data.attrs rather than data itself? If so,
import h5py
with h5py.File("myfile.h5", "w") as the_file:
dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
dset.attrs['U'] = (0,1,2,3)
dset.attrs['T'] = (2,3,4,5)
with h5py.File("myfile.h5", "r") as the_file:
data = the_file["MyDataset"]
print({key:(min(value), max(value)) for key, value in data.attrs.items()})
yields
{u'U': (0, 3), u'T': (2, 5)}

Categories

Resources