Reading binary data in python - python

Firstly, before this question gets marked as duplicate, I'm aware others have asked similar questions but there doesn't seem to be a clear explanation. I'm trying to read in a binary file into an 2D array (documented well here http://nsidc.org/data/docs/daac/nsidc0051_gsfc_seaice.gd.html).
The header is a 300 byte array.
So far, I have;
import struct
with open("nt_197912_n07_v1.1_n.bin",mode='rb') as file:
filecontent = file.read()
x = struct.unpack("iiii",filecontent[:300])
Throws up an error of string argument length.

Reading the Data (Short Answer)
After you have determined the size of the grid (n_rowsxn_cols = 448x304) from your header (see below), you can simply read the data using numpy.frombuffer.
import numpy as np
#...
#Get data from Numpy buffer
dt = np.dtype(('>u1', (n_rows, n_cols)))
x = np.frombuffer(filecontent[300:], dt) #we know the data starts from idx 300 onwards
#Remove unnecessary dimension that numpy gave us
x = x[0,:,:]
The '>u1' specifies the format of the data, in this case unsigned integers of size 1-byte, that are big-endian format.
Plotting this with matplotlib.pyplot
import matplotlib.pyplot as plt
#...
plt.imshow(x, extent=[0,3,-3,3], aspect="auto")
plt.show()
The extent= option simply specifies the axis values, you can change these to lat/lon for example (parsed from your header)
Explanation of Error from .unpack()
From the docs for struct.unpack(fmt, string):
The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt))
You can determine the size specified in the format string (fmt) by looking at the Format Characters section.
Your fmt in struct.unpack("iiii",filecontent[:300]), specifies 4 int types (you can also use 4i = iiii for simplicity), each of which have size 4, requiring a string of length 16.
Your string (filecontent[:300]) is of length 300, whilst your fmt is asking for a string of length 16, hence the error.
Example Usage of .unpack()
As an example, reading your supplied document I extracted the first 21*6 bytes, which has format:
a 21-element array of 6-byte character strings that contain information such as polar stereographic grid characteristics
With:
x = struct.unpack("6s"*21, filecontent[:126])
This returns a tuple of 21 elements. Note the whitespace padding in some elements to meet the 6-byte requirement.
>> print x
# ('00255\x00', ' 304\x00', ' 448\x00', '1.799\x00', '39.43\x00', '45.00\x00', '558.4\x00', '154.0\x00', '234.0\x00', '
# SMMR\x00', '07 cn\x00', ' 336\x00', ' 0000\x00', ' 0034\x00', ' 364\x00', ' 0000\x00', ' 0046\x00', ' 1979\x00', ' 33
# 6\x00', ' 000\x00', '00250\x00')
Notes:
The first argument fmt, "6s"*21 is a string with 6s repeated 21
times. Each format-character 6s represents one string of 6-bytes
(see below), this will match the required format specified in your
document.
The number 126 in filecontent[:126] is calculated as 6*21 = 126.
Note that for the s (string) specifier, the preceding number does
not mean to repeat the format character 6 times (as it would
normally for other format characters). Instead, it specifies the size
of the string. s represents a 1-byte string, whilst 6s represents
a 6-byte string.
More Extensive Solution for Header Reading (Long)
Because the binary data must be manually specified, this may be tedious to do in source code. You can consider using some configuration file (like .ini file)
This function will read the header and store it in a dictionary, where the structure is given from a .ini file
# user configparser for Python 3x
import ConfigParser
def read_header(data, config_file):
"""
Read binary data specified by a INI file which specifies the structure
"""
with open(config_file) as fd:
#Init the config class
conf = ConfigParser.ConfigParser()
conf.readfp(fd)
#preallocate dictionary to store data
header = {}
#Iterate over the key-value pairs under the
#'Structure' section
for key in conf.options('structure'):
#determine the string properties
start_idx, end_idx = [int(x) for x in conf.get('structure', key).split(',')]
start_idx -= 1 #remember python is zero indexed!
strLength = end_idx - start_idx
#Get the data
header[key] = struct.unpack("%is" % strLength, data[start_idx:end_idx])
#Format the data
header[key] = [x.strip() for x in header[key]]
header[key] = [x.replace('\x00', '') for x in header[key]]
#Unmap from list-type
#use .items() for Python 3x
header = {k:v[0] for k, v in header.iteritems()}
return header
An example .ini file below. The key is the name to use when storing the data, and the values is a comma-separated pair of values, the first being the starting index and the second being the ending index. These values were taken from Table 1 in your document.
[structure]
missing_data: 1, 6
n_cols: 7, 12
n_rows: 13, 18
latitude_enclosed: 25, 30
This function can be used as follows:
header = read_header(filecontent, 'headerStructure.ini')
n_cols = int(header['n_cols'])

Related

Reading a line with scientific numbers (like 0.4E-03)

I would like to process the following line (output of a Fortran program) from a file, with Python:
74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540
and obtain an array such as:
[74,0.4131493371345440e-3,-0.4592776407685850E-03,-0.1725046324754540]
My previous attempts do not work. In particular, if I do the following :
with open(filename,"r") as myfile:
line=np.array(re.findall(r"[-+]?\d*\.*\d+",myfile.readline())).astype(float)
I have the following error :
ValueError: could not convert string to float: 'E-03'
Steps:
Get list of strings (str.split(' '))
Get rid of "\n" (del arr[-1])
Turn list of strings into numbers (Converting a string (with scientific notation) to an int in Python)
Code:
import decimal # you may also leave this out and use `float` instead of `decimal.Decimal()`
arr = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
arr = arr.split(' ')
del arr[-1]
arr = [decimal.Decimal(x) for x in arr]
# do your np stuff
Result:
>>> print(arr)
[Decimal('74'), Decimal('0.0004131493371345440'), Decimal('-0.0004592776407685850'), Decimal('-0.1725046324754540')]
PS:
I don't know if you wrote the file that gives the output in the first place, but if you did, you could just think about outputting an array of float() / decimal.Decimal() from that file instead.
#ant.kr Here is a possible solution:
# Initial data
a = "74 0.4131493371345440E-03 -0.4592776407685850E-03 -0.1725046324754540 \n"
# Given the structure of the initial data, we can proceed as follow:
# - split the initial at each white space; this will produce **list** with the last
# the element being **\n**
# - we can now convert each list element into a floating point data, store them in a
# numpy array.
line = np.array([float(i) for i in a.split(" ")[:-1]])

R readBin vs. Python struct

I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code:
x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
"nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
"tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
"nmin" = x[(3*length(x)/4 + 1):(length(x))])
With Python, I am trying the following code:
import struct
with open('file','rb') as f:
val = f.read(16)
while val != '':
print(struct.unpack('4f', val))
val = f.read(16)
I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below).
Python output:
R output:
I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python.
I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.
Deducing from the R code, the binary file first contains a certain number tmax's, then the same number of nmax's, then tmin's and nmin's. What the code does is reading the entire file, which is then chopped up in the 4 parts (tmax's, nmax's, etc..) using slicing.
To do the same in python:
import struct
# Read entire file into memory first. This is done so we can count
# number of bytes before parsing the bytes. It is not a very memory
# efficient way, but it's the easiest. The R-code as posted wastes even
# more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no
# matter how small the file may be.
#
data = open('data.bin','rb').read()
# Calculate number of points in the file. This is
# file-size / 16, because there are 4 numeric()'s per
# point, and they are 4 bytes each.
#
num = int(len(data) / 16)
# Now we know how much there are, we take all tmax numbers first, then
# all nmax's, tmin's and lastly all nmin's.
# First generate a format string because it depends on the number points
# there are in the file. It will look like: "fffff"
#
format_string = 'f' * num
# Then, for cleaner code, calculate chunk size of the bytes we need to
# slice off each time.
#
n = num * 4 # 4-byte floats
# Note that python has different interpretation of slicing indices
# than R, so no "+1" is needed here as it is in the R code.
#
tmax = struct.unpack(format_string, data[:n])
nmax = struct.unpack(format_string, data[n:2*n])
tmin = struct.unpack(format_string, data[2*n:3*n])
nmin = struct.unpack(format_string, data[3*n:])
print("tmax", tmax)
print("nmax", nmax)
print("tmin", tmin)
print("nmin", nmin)
If the goal is to have this data structured as a list of points(?) like (tmax,nmax,tmin,nmin), then append this to the code:
print()
print("Points:")
# Combine ("zip") all 4 lists into a list of (tmax,nmax,tmin,nmin) points.
# Python has a function to do this at once: zip()
#
i = 0
for point in zip(tmax, nmax, tmin, nmin):
print(i, ":", point)
i += 1
Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me)
My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's.
But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's.
Anyway, here's the second code:
import os
import struct
# load_data()
# data_store : object to append() data items (floats) to
# num : number of floats to read and store
# datafile : opened binary file object to read float data from
#
def load_data(data_store, num, datafile):
for i in range(num):
data = datafile.read(4) # process one float (=4 bytes) at a time
item = struct.unpack("<f", data)[0] # '<' means little endian
data_store.append(item)
# save_list() saves a list of float's as strings to a file
#
def save_list(filename, datalist):
output = open(filename, "wt")
for item in datalist:
output.write(str(item) + '\n')
output.close()
#### MAIN ####
datafile = open('data.bin','rb')
# Get file size so we can calculate number of points without reading
# the (large) file entirely into memory.
#
file_info = os.stat(datafile.fileno())
# Calculate number of points, i.e. number of each tmax's, nmax's,
# tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
# of points = file-size / (4*4)
#
num = int(file_info.st_size / 16)
tmax_list = list()
load_data(tmax_list, num, datafile)
save_list("tmax.txt", tmax_list)
del tmax_list # huge list, save memory
nmax_list = list()
load_data(nmax_list, num, datafile)
save_list("nmax.txt", nmax_list)
del nmax_list # huge list, save memory
tmin_list = list()
load_data(tmin_list, num, datafile)
save_list("tmin.txt", tmin_list)
del tmin_list # huge list, save memory
nmin_list = list()
load_data(nmin_list, num, datafile)
save_list("nmin.txt", nmin_list)
del nmin_list # huge list, save memory

Writing and reading a row array (nx1) to a binary file in Python with struct pack

I'm having a lot of trouble writing to and reading from a binary file when working with a nx1 row vector that has been written to a binary file using struct.pack. The file structure looks like this (given an argument data that is of type numpy.array) :
test.file
--------
[format_code = 3] : 4 bytes (the code 3 means a vector) - fid.write(struct.pack('i',3))
[rows] : 4 bytes (fid.write(struct.pack('i',sz[0])) where sz = data.shape
[cols] : 4 bytes (fid.write(struct.pack('i',sz[1]))
[data] : type double = 8 bytes * (rows * cols)
Unfortunately, since these files are mostly written in MATLAB, where I have a working class that reads and writes these fields, I can't only write the amount of rows (I need columns as well even if a column does only = 1).
I've tried a few ways to pack data, none of which have worked when trying to unpack it (assume I've opened my file denoted by fid in 'rb'/'wb' and have done some error checking):
# write data
sz = data.shape
datalen=8*sz[0]*sz[1]
fid.write(struct.pack('i',3)) # format code
fid.write(struct.pack('i',sz[0])) # rows
fid.write(struct.pack('i',sz[1])) # columns
### write attempt ###
for i in xrange(sz[0]):
for j in xrange(sz[1]):
fid.write(struct.pack('d',float(data[i][j]))) # write in 'c' convention, so we transpose
### read attempt ###
format_code = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
rows = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
cols = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
out_datalen = 8 * rows * cols # size of structure
output_data=numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
So far, when reading, my output has just seemingly been multiplied by random things. I don't know whats happening.
I found another similar question, and so I wrote my data as such:
fid.write(struct.pack('%sd' % len(data), *data))
However, when reading it back using:
numpy.array(struct.unpack('%sd' % out_datalen,fid.read(datalen)),dtype=float)
I get nothing in my array.
Similarly, just doing:
fid.write(struct.pack('%dd' % datalen, *data))
and reading it back with:
numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
also gives me an empty array. How can I fix this?

Reading a binary file using np.fromfile()

I have a binary file that has numerous sections. Each section has its own pattern (i.e. the placement of integers, floats, and strings).
The pattern of each section is known. However, the number of times that pattern occurs within the section is unknown. Each record is in between two same integers. These integers indicate the size of the record. The section name is in between two integer record length variables: 8 and 8. Also within each section, there are multiple records (which are known).
Header
---------------------
Known header pattern
---------------------
8 Section One 8
---------------------
Section One pattern repeating i times
---------------------
8 Section Two 8
---------------------
Section Two pattern repeating j times
---------------------
8 Section Three 8
---------------------
Section Three pattern repeating k times
---------------------
Here was my approach:
Loop through and read each record using f.read(record_length), if the record is 8 bytes, convert to string, this will be the section name.
Then i call: np.fromfile(file,dtype=section_pattern,count=n)
I am calling np.fromfile for each section.
The issue I am having is two fold:
How do I determine n for each section without doing a first pass read?
Reading each record to find a section name seems rather inefficient. Is there a more efficient way to accomplish this?
The section names are always between two integer record variables: 8 and 8.
Here is a sample code, note that in this case i do not have to specify count since the OES section is the last section:
with open('m13.op2', "rb") as f:
filesize = os.fstat(f.fileno()).st_size
f.seek(108,1) # skip header
while True:
rec_len_1 = unpack_int(f.read(4))
record_bytes = f.read(rec_len_1)
rec_len_2 = unpack_int(f.read(4))
record_num = record_num + 1
if rec_len_1==8:
tablename = unpack_string(record_bytes).strip()
if tablename == 'OES':
OES = [
# Top keys
('1','i4',1),('op2key7','i4',1),('2','i4',1),
('3','i4',1),('op2key8','i4',1),('4','i4',1),
('5','i4',1),('op2key9','i4',1),('6','i4',1),
# Record 2 -- IDENT
('7','i4',1),('IDENT','i4',1),('8','i4',1),
('9','i4',1),
('acode','i4',1),
('tcode','i4',1),
('element_type','i4',1),
('subcase','i4',1),
('LSDVMN','i4',1), # Load set number
('UNDEF(2)','i4',2), # Undefined
('LOADSET','i4',1), # Load set number or zero or random code identification number
('FCODE','i4',1), # Format code
('NUMWDE(C)','i4',1), # Number of words per entry in DATA record
('SCODE(C)','i4',1), # Stress/strain code
('UNDEF(11)','i4',11), # Undefined
('THERMAL(C)','i4',1), # =1 for heat transfer and 0 otherwise
('UNDEF(27)','i4',27), # Undefined
('TITLE(32)','S1',32*4), # Title
('SUBTITL(32)','S1',32*4), # Subtitle
('LABEL(32)','S1',32*4), # Label
('10','i4',1),
# Record 3 -- Data
('11','i4',1),('KEY1','i4',1),('12','i4',1),
('13','i4',1),('KEY2','i4',1),('14','i4',1),
('15','i4',1),('KEY3','i4',1),('16','i4',1),
('17','i4',1),('KEY4','i4',1),('18','i4',1),
('19','i4',1),
('EKEY','i4',1), #Element key = 10*EID+Device Code. EID = (Element key)//10
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1),
('20','i4',1)]
nparr = np.fromfile(f,dtype=OES)
if f.tell() == filesize:
break

Python format print with a list

Which is the most pythonic way to produce my output. Let me illustrate the behavior I'm trying to achieve.
For a project of my I'm building a function that takes different parameters to print an the output in columns.
Example of the list its receives.
[('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
** The size of items can differ (Only 3 items per tuple now, can be 4 for another list or any number**
The output should be something like this
Field Integer Hex
-------------------------------------------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
For working purposes I created a list which only contains the header fields:
This isn't necessary but it made it a little bit easier trying stuff out
Header field is ['Field', 'Integer', 'Hex']
The first tuple in the list declares the so called "Header fields" as shown in the list example.
For this case there are only 3 items, but this can differ from time to time. So I tried to calculate the size of items with:
length_container_header = len(container[0])
This variable can be used to correctly build up the output.
Building the header "print" I would build something like this.
print("{:21} {:7} {:7}".format(header_field[0], header_field[1], header_field[2]))
Now this is a manual version on how it should be. As you noticed the header field "Field" is shorter then
PointerToSymbolTable in the list. I wrote this function to determine the longest item for each position in the list
container_lenght_list = []
local_l = 0
for field in range(0, lenght_container_body):
for item in container[1:]:
if len(str(item[field])) > local_l:
local_l = len(str(item[field]))
else:
continue
container_lenght_list.append(str(local_l))
local_l = 0
Produces a list along the lines like [21, 7, 7] in this occasion.
creating the format string can be done pretty simple,
formatstring = ""
for line in lst:
formatstring+= "{:" + str(line) +"}"
Which produces string:
{:21}{:7}{:7}
This is the part were a run into trouble, how can I produce the last part of the format string?
I tried a nested for loop in the format() function but I ended up with all sort of Errors. I think it can be done with a
for loop, I just can't figure out how. If someone could push me in the right direction for the header print I would be very grateful. Once I figured out how to print the header I can pretty much figure out the rest. I hope I explained it well enough
With Kind Regards,
You can use * to unpack argument list:
container = [
('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
]
lengths = [
max(len(str(row[i])) for row in container) for i in range(len(container[0]))
] # => [21, 7, 7]
# OR lengths = [max(map(len, map(str, x))) for x in zip(*container)]
fmt = ' '.join('{:<%d}' % l for l in lengths)
# => '{:<21} {:<7} {:<7}' # < for left-align
print(fmt.format(*container[0])) # header
print('-' * (sum(lengths) + len(lengths) - 1)) # separator
for row in container[1:]:
print(fmt.format(*row)) # <------- unpacking argument list
# similar to print(fmt.format(row[0], row[1], row[2])
output:
Field Integer Hex
-------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
Formatting data in tabular form requires four important steps
Determine the field layout i.e. representing data row wise or column wise. Based on the decision you might need to transpose the data using zip
Determine the field sizes. Unless you wan;t to hard-code the field size (not-recommend), you should actually determine the maximum field size based on the data, allowing customized padding between fields. Generally this requires reading the data and determining the maximum length of the fields [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
Extract the header row. This is easy as it only requires indexing the 0th row i.e. data[0]
Formatting the data. This requires some understanding of python format string
Implementation
class FormatTable(object):
def __init__(self, data, pad = 2):
self.data = data
self.pad = pad
self.header = data[0]
self.field_size = [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
self.format = ''.join('{{:<{}}}'.format(s) for s in self.field_size)
def __iter__(self):
yield ''.join(self.format.format(*self.header))
yield '-'*(sum(self.field_size) + self.pad * len(self.header))
for row in data[1:]:
yield ''.join(self.format.format(*row))
Demo
for row in FormatTable(data):
print row
Field Integer Hex
-----------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
I don't know if it is "Pythonic", but you can use pandas to format your output.
import pandas as pd
data = [('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')]
s = pd.DataFrame(data[1:], columns=data[0])
print s.to_string(index=False)
Result:
Field Integer Hex
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000

Categories

Resources