Python: Reading Fortran Binary file using numpy or scipy - python

I am trying to read a fortran file with headers as integers and then the actual data as 32 bit floats. Using numpy's fromfile('mydatafile', dtype=np.float32) it reads in the whole file as float32 but I need the headers to be in int32 for my output file. Using scipy's FortranFile it reads the headers:
f = FortranFile('mydatafile', 'r')
headers = f.read_ints(dtype=np.int32)
but when I do:
data = f.read_reals(dtype=np.float32)
it returns an empty array. I know it shouldn't be empty because using numpy's fromfile it reads all of the data. Oddly enough the scipy method worked for other files in my dataset, but not this one. Perhaps i'm not understanding the difference between each of the two read methods with numpy and scipy. Is there a way to isolate the headers (dtype=np.int32) and data (dtype=np.float32) when reading in the file with either method?

np.fromfile takes a "count" argument, which specifies how many items to read. If you know the number of integers in the header in advance, a simple way to do what you want without any type conversions would just be to read the header as integers, followed by the rest of the file as floats:
with open('filepath','r') as f:
header = np.fromfile(f, dtype=np.int, count=number_of_integers)
data = np.fromfile(f, dtype=np.float32)

#DavidTrevelyan has an quite okay way. Another way is to use the fortranfile package in combination with struct. Neither way is ideal, but then neither is scipy's FortranFile.
At least this way you can read mixed-type data. Here's an example:
from fortranfile import FortranFile
from struct import unpack
with FortranFile(to_open) as fh:
dat = fh.readRecord()
val_list = unpack('=4i20d'.format(ln), dat)
You can install it using pip install fortranfile. struct is standard, the (un)pack format is here.

Related

Reading binary file. Translate matlab to python

I'm going to translate the working matlab code for reading the binary file to python code. Is there an equivalent for
% open the file for reading
fid=fopen (filename,'rb','ieee-le');
% first read the signature
tmp=fread(fid,2,'char');
% read sizes
rows=fread(fid,1,'ushort');
cols=fread(fid,1,'ushort');
there's the struct module to do that, specifically the unpack function which accepts a buffer, but you'll have to read the required size from the input using struct.calcsize
import struct
endian = "<" # little endian
with open(filename,'rb') as f:
tmp = struct.unpack(f"{endian}cc",f.read(struct.calcsize("cc")))
tmp_int = [int.from_bytes(x,byteorder="little") for x in tmp]
rows = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
cols = struct.unpack(f"{endian}H",f.read(struct.calcsize("H")))[0]
you might want to use the struct.Struct class for reading the rest of the data in chunks, as it is going to be faster than decoding numbers one at a time.
ie:
data = []
reader = struct.Struct(endian + "i"*cols) # i for integer
row_size = reader.size
for row_number in range(rows):
row = reader.unpack(f.read(row_size))
data.append(row)
Edit: corrected the answer, and added an example for larger chuncks.
Edit2: okay, more improvement, assuming we are reading 1 GB file of shorts, storing it as python int makes no sense and will most likely give an out of memory error (or system will freeze), the proper way to do it is using numpy
import numpy as np
data = np.fromfile(f,dtype=endian+'H').reshape(cols,rows) # ushorts
this way it'll have the same space in memory as it did on disk.

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

Storing multiple GeoTiffs in HDF5 file in Python

I want to store multiple GeoTiff files in one HDF5 file to use it for further analysis since the function I am supposed to use can just deal with HDF5 (so basically like a raster stack in R but stored in a HDF5). I have to use Python. I am relatively new to HDF5 format (and geoanalysis in Python generally) and don't really know how to approach this issue. Especially keeping the geolocation/projection inforation seems tricky to me. So far I tried:
import h5py
import rasterio
r1 = rasterio.open("filename.tif")
r2 = rasterio.open("filename2.tif")
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1)
hdf.create_dataset('GeoTiff2', data=r2)
Yielding the following errror:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
I am pretty sure this not at all the correct approach and I'm happy about any suggestions.
What you can try is to do this:
import numpy as np
spec_dtype = h5py.special_dtype(vlen=np.dtype('float64'))
Just make a spec_dtype variable with float64 type then apply this to create_dataset:
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1,, dtype=spec_dtype)
hdf.create_dataset('GeoTiff2', data=r2,, dtype=spec_dtype)
Apply these and hopefully it will work.
Using HDFql in Python, your use-case could be solved as follows:
import HDFql
HDFql.execute("SHOW FILE SIZE filename.tif, filename2.tif")
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff1 AS OPAQUE(%d) VALUES FROM BINARY FILE filename.tif" % HDFql.cursor_get_bigint())
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff2 AS OPAQUE(%d) VALUES FROM BINARY FILE filename2.tif" % HDFql.cursor_get_bigint())

Import 64bit floating point value from binary file into python

Background:
Good day, I need to extract information from a binary file produced by an equipment. The equipment comes with a matlab function to import binary file. From what i understand from the manual, the binary file contains a 64-bit floating point value between 0 and 1.
function phase = importData(folder, qUnit);
fileName = sprintf('%s\\%s.PH', folder, qUnit);
file = fopen(fileName, 'rb');
fseek(file, 0, 'bof');
phase = fread(file, inf, 'float64');
Question: The matlab function works fine, but I wish to import the data into python. May I know how can this be done on Python? I did some research myself and tried something like this at the bottom. But when I do print(fileContent) to check the imported data, python simply stops responding in windows.
with open('sample.PH', mode='rb') as binary_file:
fileContent = binary_file.read()
You can't read binary files like that into a useful form. You should use numpy.fromfile. This will give you a numerical vector with the data, similar to what you would get in MATLAB. You don't even need to open it (although you can if you want). Just give it a filename and it will automatically open, read, then close the file for you.
import numpy as np
file_content = np.fromfile('sample.PH', np.float64)
Edit: here is how you can do multiple repeated values:
import numpy as np
file_content = np.fromfile('sample.PH', [('data', np.float32),
('time', np.float64)])
(I put the last line on two lines for clarity, it could be one line if you prefer, or 10 for that matter, python doesn't care)
This will give you the equivalent of a MATLAB struct where one field is the data and the other field is the time. You can then access the data using file_content['data'] and the time using file_content['time']. There is more information at the link I provided above.

get specific content from file python

I have a file test.txt which has an array:
array = [3,5,6,7,9,6,4,3,2,1,3,4,5,6,7,8,5,3,3,44,5,6,6,7]
Now what I want to do is get the content of array and perform some calculations with the array. But the problem is when I do open("test.txt") it outputs the content as the string. Actually the array is very big, and if I do a loop it might not be efficient. Is there any way to get the content without splitting , ? Any new ideas?
I recommend that you save the file as json instead, and read it in with the json module. Either that, or make it a .py file, and import it as python. A .txt file that looks like a python assignment is kind of odd.
Does your text file need to look like python syntax? A list of comma separated values would be the usual way to provide data:
1,2,3,4,5
Then you could read/write with the csv module or the numpy functions mentioned above. There's a lot of documentation about how to read csv data in efficiently. Once you had your csv reader data object set up, data could be stored with something like:
data = [ map( float, row) for row in csvreader]
If you want to store a python-like expression in a file, store only the expression (i.e. without array =) and parse it using ast.literal_eval().
However, consider using a different format such as JSON. Depending on the calculations you might also want to consider using a format where you do not need to load all data into memory at once.
Must the array be saved as a string? Could you use a pickle file and save it as a Python list?
If not, could you try lazy evaluation? Maybe only process sections of the array as needed.
Possibly, if there are calculations on the entire array that you must always do, it might be a good idea to pre-compute those results and store them in the txt file either in addition to the list or instead of the list.
You could also use numpy to load the data from the file using numpy.genfromtxt or numpy.loadtxt. Both are pretty fast and both have the ability to do the recasting on load. If the array is already loaded though, you can use numpy to convert it to an array of floats, and that is really fast.
import numpy as np
a = np.array(["1", "2", "3", "4"])
a = a.astype(np.float)
You could write a parser. They are very straightforward. And much much faster than regular expressions, please don't do that. Not that anyone suggested it.
# open up the file (r = read-only, b = binary)
stream = open("file_full_of_numbers.txt", "rb")
prefix = '' # end of the last chunk
full_number_list = []
# get a chunk of the file at a time
while True:
# just a small 1k chunk
buffer = stream.read(1024)
# no more data is left in the file
if '' == buffer:
break
# delemit this chunk of data by a comma
split_result = buffer.split(",")
# append the end of the last chunk to the first number
split_result[0] = prefix + split_result[0]
# save the end of the buffer (a partial number perhaps) for the next loop
prefix = split_result[-1]
# only work with full results, so skip the last one
numbers = split_result[0:-1]
# do something with the numbers we got (like save it into a full list)
full_number_list += numbers
# now full_number_list contains all the numbers in text format
You'll also have to add some logic to use the prefix when the buffer is blank. But I'll leave that code up to you.
OK, so the following methods ARE dangerous. Since they are used to attack systems by injecting code into them, used them at your own risk.
array = eval(open("test.txt", 'r').read().strip('array = '))
execfile('test.txt') # this is the fastest but most dangerous.
Safer methods.
import ast
array = ast.literal_eval(open("test.txt", 'r').read().strip('array = ')).
...
array = [float(value) for value in open('test.txt', 'r').read().strip('array = [').strip('\n]').split(',')]
The eassiest way to serialize python objects so you can load them later is to use pickle. Assuming you dont want a human readable format since this adds major head, either-wise, csv is fast and json is flexible.
import pickle
import random
array = random.sample(range(10**3), 20)
pickle.dump(array, open('test.obj', 'wb'))
loaded_array = pickle.load(open('test.obj', 'rb'))
assert array == loaded_array
pickle does have some overhead and if you need to serialize large objects you can specify the compression ratio, the default is 0 no compression, you can set it to pickle.HIGHEST_PROTOCOL pickle.dump(array, open('test.obj', 'wb'), pickle.HIGHEST_PROTOCOL)
If you are working with large numerical or scientific data sets then use numpy.tofile/numpy.fromfile or scipy.io.savemat/scipy.io.loadmat they have little overhead, but again only if you are already using numpy/scipy.
good luck.

Categories

Resources