Reading a binary file using np.fromfile() - python
I have a binary file that has numerous sections. Each section has its own pattern (i.e. the placement of integers, floats, and strings).
The pattern of each section is known. However, the number of times that pattern occurs within the section is unknown. Each record is in between two same integers. These integers indicate the size of the record. The section name is in between two integer record length variables: 8 and 8. Also within each section, there are multiple records (which are known).
Header
---------------------
Known header pattern
---------------------
8 Section One 8
---------------------
Section One pattern repeating i times
---------------------
8 Section Two 8
---------------------
Section Two pattern repeating j times
---------------------
8 Section Three 8
---------------------
Section Three pattern repeating k times
---------------------
Here was my approach:
Loop through and read each record using f.read(record_length), if the record is 8 bytes, convert to string, this will be the section name.
Then i call: np.fromfile(file,dtype=section_pattern,count=n)
I am calling np.fromfile for each section.
The issue I am having is two fold:
How do I determine n for each section without doing a first pass read?
Reading each record to find a section name seems rather inefficient. Is there a more efficient way to accomplish this?
The section names are always between two integer record variables: 8 and 8.
Here is a sample code, note that in this case i do not have to specify count since the OES section is the last section:
with open('m13.op2', "rb") as f:
filesize = os.fstat(f.fileno()).st_size
f.seek(108,1) # skip header
while True:
rec_len_1 = unpack_int(f.read(4))
record_bytes = f.read(rec_len_1)
rec_len_2 = unpack_int(f.read(4))
record_num = record_num + 1
if rec_len_1==8:
tablename = unpack_string(record_bytes).strip()
if tablename == 'OES':
OES = [
# Top keys
('1','i4',1),('op2key7','i4',1),('2','i4',1),
('3','i4',1),('op2key8','i4',1),('4','i4',1),
('5','i4',1),('op2key9','i4',1),('6','i4',1),
# Record 2 -- IDENT
('7','i4',1),('IDENT','i4',1),('8','i4',1),
('9','i4',1),
('acode','i4',1),
('tcode','i4',1),
('element_type','i4',1),
('subcase','i4',1),
('LSDVMN','i4',1), # Load set number
('UNDEF(2)','i4',2), # Undefined
('LOADSET','i4',1), # Load set number or zero or random code identification number
('FCODE','i4',1), # Format code
('NUMWDE(C)','i4',1), # Number of words per entry in DATA record
('SCODE(C)','i4',1), # Stress/strain code
('UNDEF(11)','i4',11), # Undefined
('THERMAL(C)','i4',1), # =1 for heat transfer and 0 otherwise
('UNDEF(27)','i4',27), # Undefined
('TITLE(32)','S1',32*4), # Title
('SUBTITL(32)','S1',32*4), # Subtitle
('LABEL(32)','S1',32*4), # Label
('10','i4',1),
# Record 3 -- Data
('11','i4',1),('KEY1','i4',1),('12','i4',1),
('13','i4',1),('KEY2','i4',1),('14','i4',1),
('15','i4',1),('KEY3','i4',1),('16','i4',1),
('17','i4',1),('KEY4','i4',1),('18','i4',1),
('19','i4',1),
('EKEY','i4',1), #Element key = 10*EID+Device Code. EID = (Element key)//10
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1),
('20','i4',1)]
nparr = np.fromfile(f,dtype=OES)
if f.tell() == filesize:
break
Related
Comparing PDF files with varying degrees of strictness
I have two folders, each including ca. 100 PDF files resulting from different runs of the same PDF generation program. After performing some changes to this program, the resulting PDF should always stay equal and nothing should break the layout, the fonts, any potential graphs and so on. This is why I would like to check for visual equality while ignoring any metadata that might have changed due to running the program at different times. My first approach was based on this post and attempted to compare the hashes of each file: h1 = hashlib.sha1() h2 = hashlib.sha1() with open(fileName1, "rb") as file: chunk = 0 while chunk != b'': chunk = file.read(1024) h1.update(chunk) with open(fileName2, "rb") as file: chunk = 0 while chunk != b'': chunk = file.read(1024) h2.update(chunk) return (h1.hexdigest() == h2.hexdigest()) This always returns "False". I assume that this is due to different time dependent metadata, which is why I would like to ignore them. I've already found a way to set the modification and creation data to "None": pdf1 = pdfrw.PdfReader(fileName1) pdf1.Info.ModDate = pdf1.Info.CreationDate = None pdfrw.PdfWriter().write(fileName1, pdf1) pdf2 = pdfrw.PdfReader(fileName2) pdf2.Info.ModDate = pdf2.Info.CreationDate = None pdfrw.PdfWriter().write(fileName2, pdf2) Looping through all files in each folder and running the second method before the first curiously sometimes results in a return value of "True" and sometimes in a return value of "False". Thanks to the kind help of #jorj-mckie (see answer below), I've the following methods checking for xref equality: doc1 = fitz.open(fileName1) xrefs1 = doc1.xref_length() # cross reference table 1 doc2 = fitz.open(fileName2) xrefs2 = doc2.xref_length() # cross reference table 2 if (xrefs1 != xrefs2): print("Files are not equal") return False for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped # compare the PDF object definition sources if (doc1.xref_object(xref) != doc2.xref_object(xref)): print(f"Files differ at xref {xref}.") return False if doc1.xref_is_stream(xref): # compare binary streams stream1 = doc1.xref_stream_raw(xref) # read binary stream try: stream2 = doc2.xref_stream_raw(xref) # read binary stream except: # stream extraction doc2 did not work! print(f"stream discrepancy at xref {xref}") return False if (stream1 != stream2): print(f"stream discrepancy at xref {xref}") return False return True and xref equality without metadata: doc1 = fitz.open(fileName1) xrefs1 = doc1.xref_length() # cross reference table 1 doc2 = fitz.open(fileName2) xrefs2 = doc2.xref_length() # cross reference table 2 info1 = doc1.xref_get_key(-1, "Info") # extract the info object info2 = doc2.xref_get_key(-1, "Info") if (info1 != info2): print("Unequal info objects") return False if (info1[0] == "xref"): # is there metadata at all? info_xref1 = int(info1[1].split()[0]) # xref of info object doc1 info_xref2 = int(info2[1].split()[0]) # xref of info object doc1 else: info_xref1 = 0 for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped # compare the PDF object definition sources if (xref != info_xref1): if (doc1.xref_object(xref) != doc2.xref_object(xref)): print(f"Files differ at xref {xref}.") return False if doc1.xref_is_stream(xref): # compare binary streams stream1 = doc1.xref_stream_raw(xref) # read binary stream try: stream2 = doc2.xref_stream_raw(xref) # read binary stream except: # stream extraction doc2 did not work! print(f"stream discrepancy at xref {xref}") return False if (stream1 != stream2): print(f"stream discrepancy at xref {xref}") return False return True If I run the last two functions on my PDF files, whose timestamps have already been set to "None" (see above), I end up with some equality checks resulting in a "True" return value and others resulting in "False". I'm using the reportlab library to generate the PDFs. Do I just have to live with the fact that some PDFs will always have a different internal structure, resulting in different hashes even if the files look exactly the same? I would be very happy to learn that this is not the case and there is indeed a way to check for equality without actually having to export all pages to images first.
I think you should use PyMuPDF for PDF handling - it has all batteries included for your task (and many more!). First thing to clarify: What type of equality are you looking for? If just number of pages must be equal and pages should look the same pairwise, is very much different from all object and streams must be identical with the exception of the PDF /ID. Both comparison types are possible with PyMuPDF. To do the latter comparison, loop through both object number tables and compare them pairwise: import sys import fitz # import package PyMuPDF doc1 = fitz.open("file1.pdf") xrefs1 = doc1.xref_length() # cross reference table 1 doc2 = fitz.open("file2.pdf") xrefs2 = doc2.xref_length() # cross reference table 2 if xref1 != xref2: sys.exit("Files are not equal") # quick exit for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped # compare the PDF object definition sources if doc1.xref_object(xref) != doc2.xref_object(xref): sys.exit(f"Files differ at xref {xref}.") if doc1.xref_is_stream(xref): # compare binary streams stream1 = doc1.xref_stream_raw(xref) # read binary stream try: stream2 = doc2.xref_stream_raw(xref) # read binary stream except: # stream extraction doc2 did not work! sys.exit(f"stream discrepancy at xref {xref}") if stream1 != stream2: sys.exit(f"stream discrepancy at xref {xref}") sys.exit("Files are equal!") This still is a rather strict equality check: For example, if any date or time in the document metadata has changed, you would report inequality even if the rest is equal. But there is help: Determine the xref of the metadata and exclude it from the above loop: info1 = doc1.xref_get_key(-1, "Info") # extract the info object info2 = doc2.xref_get_key(-1, "Info") if info1 != info2: sys.exit("Unequal info objects") if info1[0] == "xref" # is there metadata at all? info_xref1 = int(info1[1].split()[0]) # xref of info object doc1 info_xref2 = int(info2[1].split()[0]) # xref of info object doc1 # make another equality here # in above loop skip if xref == info_xref1. else: info_xref1 = 0 # 0 is never an xref number, so can safely be used in loop
Command line/ GUI pdf differs have been around a long time and many PDF difference tools available, like this cross platform one ( https://github.com/vslavik/diff-pdf) are available as both CLI and executable GUI, so best of both worlds. By default, its only output is its return code, which is 0 if there are no differences and 1 if the two PDFs differ. If given the --output-diff option, it produces a PDF file with visually highlighted differences: Others more specifically built for cross platform python tend to separate text differences 2 ways so you could try https://github.com/JoshData/pdf-diff, or for graphically there is https://github.com/bgeron/diff-pdf-visually So by way of example for above dual purpose diff-pdf text you can quickly parse a folder to collect the true false report by run compare blind in pairs then as a result do final one by one compare as visual by shell out to:- diff-pdf --view a.pdf b.pdf note this is version 0.4 but 0.5 is available. Sadly if all 100 are similar by simple compare then all need text testing thus you need a fast binary test batch file to run APPROX 4,950 (99x100/2) fast tests. test 1.pdf 2.pdf report test 1.pdf 3.pdf report ... test 1.pdf 100.pdf report test 2.pdf 3.pdf report test 2.pdf 4.pdf report ... test 98.pdf 99.pdf report test 98.pdf 100.pdf report test 99.pdf 100.pdf report then filter the similar ones out and visually inspect much lower number remaining as reported not matched. so if 49 = 30 = 1 and 60 = 45 = 25 = 2 but not others then there is only the 1 and 2 to look at closer. Of course there will likely be more and you can use a second opinion on those too. If you know a likely page number that changes you can exclusively test images of say 3rd page that has a date or other identifying feature.
WUnderground, Extraction of Extremes Today
As contributor to WUnderground not a problem to read via API-call the JSON-outputfile with Today's values for my station. That JSON-file has a series of numbered 'bins', with the series growing with time from 00:00. In each numbered 'bin' an equivalent dataset reporting values. At the end of the day a few hundred 'bins' in the JSON-file. Avoiding setup of a local database, to find an actual survey of Extremes_Today, it is required to periodically scan the latest JSON-file from bin0 till the latest added bin in a recursive way. It means in some way to read each numbered bin, extract&evaluate values, jump to next bin, till last bin reached & processed. Trying the 2 approaches below in a Python-script: these 2 script-segments just should check & report that a bin exists. The scriptlines till 442 do other jobs (incl. complete read-out of bin=0 for references), already running without error. # Line 442 = In WU JSON-Output Today's Data find & process next Bin upto/incl. last Bin # Example call-string for ToDay-info = https://api.weather.com/v2/pws/observations/all/1day?stationId=KMAHANOV10&format=json&units=m&apiKey=yourApiKey # Extracting contents of the JSON-file by the scriptlines below # page = urllib.urlopen('https://api.weather.com/v2/pws/observations/all/1day?stationId=KMAHANOV10&format=json&units=m&apiKey=yourApiKey') # content_test = page.read() # obj_test2 = json.loads(content_test) # Extraction of a value is like # Epochcheck = obj_test2['observations'][Bin]['epoch'] # 'epoch' is present as element in all bins of the JSON-file (with trend related to the number of the bin) and therefore choosen as key for scan & search. If not found, then that bin not existing = passed last present bin # Bin [0] earlier separately has been processed => initial contents at 00:00 = references for Extremes-search # GENERAL setup of the scanning function: # Bin = 0 # while 'epoch' exists # Read 'epoch' & translate to CET/LocalTime # Compare values of Extremes in that bin with earlier Extremes # if hi_value higher than hiExtreme => new hiExtreme & adapt HiTime (= translated 'epoch') # if low_value lower than LowExtreme => new lowExtreme & adapt LowTime (= translated 'epoch') # Bin = Bin + 1 # Approach1 Bin = 0 Epochcheck = obj_test2['observations'][0]['epoch'] try: Epochcheck = obj_test2['observations'][Bin]['epoch'] print(Bin) Bin += 1 except NameError: Epochcheck = None # Approach2 Bin = 0 Epochcheck = obj_test2['observations'][0]['epoch'] While Epochcheck is not None: Epochcheck = obj_test2['observations'][Bin]['epoch'] Print(Bin) Bin += 1 Approach1 does not throw an error, but it steps out at Bin = 1. Approach2 reports a syntax error. File "/home/pi/domoticz/scripts/python/URL_JSON_WU_to_HWA_Start01a_0186.py", line 476 While Epochcheck is not None: ^ SyntaxError: invalid syntax Apparently the checkline with dynamically variable contents for Bin cannot be set up in this way: the dynamic setting of variable Bin must be inserted/described in a different way. Epochcheck = obj_test2['observations'][Bin]['epoch'] What is in Python the appropriate way to perform such JSON-scanning using a dynamic variable [Bin]? Or simpler way of scan&extract a series of Bins in a JSON-file?
Biopython: adding section in middle sequence an having features aligned
I want to add section of sequence in middle of previous sequence(in gb file) and have all features still indexed by old sequence. For example: previous sequence: ATAGCCATTGAATGTGTGTGTGTCCTAGAGGGCCTAAAA fetaure: misc_feature complement(20..27) /gene="Py_ori+A" I add TTTTTT in position 10. new sequence: ATAGCCATTGTTTTTTAAGTGTGTGTGTCCTAGAGGGCCTAAAA fetaure: misc_feature complement(26..33) /gene="Py_ori+A" Indexes of features changed because feature must still be about segment TGTCCTA. I want to save the new sequence in a new gb file. Is there any biopython function or method that could add segment of sequence in middle of old sequence and add length of added segment to indexes of features, that are after the added segment?
TL;DR Call + on your sliced segments (e.g. a + b). As long as you didn't slice into a feature you should be OK. The long version: the BioPython supports feature joining. It is done simply by calling a + b on the respective SeqRecord classes (the features are part of the SeqRecord object not the Seq class.). There are a quirk to be aware of regarding slicing sequence with features. If you happen to do slicing in the feature, the feature will not be present in resulting SeqRecord. I've tried to illustrate the behaviour in the following code. from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation # THIS IS OK a = SeqRecord( Seq('ACGTA'), id='a', features=[ SeqFeature(FeatureLocation(2,4,1), id='f1') ] ) b = SeqRecord( Seq('ACGTA'), id='b', features=[ SeqFeature(FeatureLocation(2,4,1), id='f2') ] ) c = a + b print('seq a') print(a.seq) print(a.features) print('\nseq b') print(b.seq) print(b.features) print("\n two distinct features joined in seq c") print(c.seq) print(c.features) print("notice how the second feature has now indices (7,9), instead of 2,4\n") # BEWARE # slicing into the feature will remove the feature ! print("\nsliced feature removed") d = a[:3] print(d.seq) print(d.features) # Seq('ACG') # [] # However slicing around the feature will preserve it print("\nslicing out of the feature will preserve it") e = c[1:6] print(e.seq) print(e.features) OUTPUT seq a ACGTA [SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1')] seq b ACGTA [SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f2')] two distinct features joined in seq c ACGTAACGTA [SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(4), strand=1), id='f1'), SeqFeature(FeatureLocation(ExactPosition(7), ExactPosition(9), strand=1), id='f2')] notice how the second feature has now indices (7,9), instead of 2,4 sliced feature removed ACG [] slicing out of the feature will preserve it CGTAA [SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(3), strand=1), id='f1')]
R readBin vs. Python struct
I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code: x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little") myPoints <- data.frame("tmax" = x[1:(length(x)/4)], "nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))], "tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))], "nmin" = x[(3*length(x)/4 + 1):(length(x))]) With Python, I am trying the following code: import struct with open('file','rb') as f: val = f.read(16) while val != '': print(struct.unpack('4f', val)) val = f.read(16) I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below). Python output: R output: I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python. I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.
Deducing from the R code, the binary file first contains a certain number tmax's, then the same number of nmax's, then tmin's and nmin's. What the code does is reading the entire file, which is then chopped up in the 4 parts (tmax's, nmax's, etc..) using slicing. To do the same in python: import struct # Read entire file into memory first. This is done so we can count # number of bytes before parsing the bytes. It is not a very memory # efficient way, but it's the easiest. The R-code as posted wastes even # more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no # matter how small the file may be. # data = open('data.bin','rb').read() # Calculate number of points in the file. This is # file-size / 16, because there are 4 numeric()'s per # point, and they are 4 bytes each. # num = int(len(data) / 16) # Now we know how much there are, we take all tmax numbers first, then # all nmax's, tmin's and lastly all nmin's. # First generate a format string because it depends on the number points # there are in the file. It will look like: "fffff" # format_string = 'f' * num # Then, for cleaner code, calculate chunk size of the bytes we need to # slice off each time. # n = num * 4 # 4-byte floats # Note that python has different interpretation of slicing indices # than R, so no "+1" is needed here as it is in the R code. # tmax = struct.unpack(format_string, data[:n]) nmax = struct.unpack(format_string, data[n:2*n]) tmin = struct.unpack(format_string, data[2*n:3*n]) nmin = struct.unpack(format_string, data[3*n:]) print("tmax", tmax) print("nmax", nmax) print("tmin", tmin) print("nmin", nmin) If the goal is to have this data structured as a list of points(?) like (tmax,nmax,tmin,nmin), then append this to the code: print() print("Points:") # Combine ("zip") all 4 lists into a list of (tmax,nmax,tmin,nmin) points. # Python has a function to do this at once: zip() # i = 0 for point in zip(tmax, nmax, tmin, nmin): print(i, ":", point) i += 1
Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me) My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's. But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's. Anyway, here's the second code: import os import struct # load_data() # data_store : object to append() data items (floats) to # num : number of floats to read and store # datafile : opened binary file object to read float data from # def load_data(data_store, num, datafile): for i in range(num): data = datafile.read(4) # process one float (=4 bytes) at a time item = struct.unpack("<f", data)[0] # '<' means little endian data_store.append(item) # save_list() saves a list of float's as strings to a file # def save_list(filename, datalist): output = open(filename, "wt") for item in datalist: output.write(str(item) + '\n') output.close() #### MAIN #### datafile = open('data.bin','rb') # Get file size so we can calculate number of points without reading # the (large) file entirely into memory. # file_info = os.stat(datafile.fileno()) # Calculate number of points, i.e. number of each tmax's, nmax's, # tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number # of points = file-size / (4*4) # num = int(file_info.st_size / 16) tmax_list = list() load_data(tmax_list, num, datafile) save_list("tmax.txt", tmax_list) del tmax_list # huge list, save memory nmax_list = list() load_data(nmax_list, num, datafile) save_list("nmax.txt", nmax_list) del nmax_list # huge list, save memory tmin_list = list() load_data(tmin_list, num, datafile) save_list("tmin.txt", tmin_list) del tmin_list # huge list, save memory nmin_list = list() load_data(nmin_list, num, datafile) save_list("nmin.txt", nmin_list) del nmin_list # huge list, save memory
Reading binary data in python
Firstly, before this question gets marked as duplicate, I'm aware others have asked similar questions but there doesn't seem to be a clear explanation. I'm trying to read in a binary file into an 2D array (documented well here http://nsidc.org/data/docs/daac/nsidc0051_gsfc_seaice.gd.html). The header is a 300 byte array. So far, I have; import struct with open("nt_197912_n07_v1.1_n.bin",mode='rb') as file: filecontent = file.read() x = struct.unpack("iiii",filecontent[:300]) Throws up an error of string argument length.
Reading the Data (Short Answer) After you have determined the size of the grid (n_rowsxn_cols = 448x304) from your header (see below), you can simply read the data using numpy.frombuffer. import numpy as np #... #Get data from Numpy buffer dt = np.dtype(('>u1', (n_rows, n_cols))) x = np.frombuffer(filecontent[300:], dt) #we know the data starts from idx 300 onwards #Remove unnecessary dimension that numpy gave us x = x[0,:,:] The '>u1' specifies the format of the data, in this case unsigned integers of size 1-byte, that are big-endian format. Plotting this with matplotlib.pyplot import matplotlib.pyplot as plt #... plt.imshow(x, extent=[0,3,-3,3], aspect="auto") plt.show() The extent= option simply specifies the axis values, you can change these to lat/lon for example (parsed from your header) Explanation of Error from .unpack() From the docs for struct.unpack(fmt, string): The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt)) You can determine the size specified in the format string (fmt) by looking at the Format Characters section. Your fmt in struct.unpack("iiii",filecontent[:300]), specifies 4 int types (you can also use 4i = iiii for simplicity), each of which have size 4, requiring a string of length 16. Your string (filecontent[:300]) is of length 300, whilst your fmt is asking for a string of length 16, hence the error. Example Usage of .unpack() As an example, reading your supplied document I extracted the first 21*6 bytes, which has format: a 21-element array of 6-byte character strings that contain information such as polar stereographic grid characteristics With: x = struct.unpack("6s"*21, filecontent[:126]) This returns a tuple of 21 elements. Note the whitespace padding in some elements to meet the 6-byte requirement. >> print x # ('00255\x00', ' 304\x00', ' 448\x00', '1.799\x00', '39.43\x00', '45.00\x00', '558.4\x00', '154.0\x00', '234.0\x00', ' # SMMR\x00', '07 cn\x00', ' 336\x00', ' 0000\x00', ' 0034\x00', ' 364\x00', ' 0000\x00', ' 0046\x00', ' 1979\x00', ' 33 # 6\x00', ' 000\x00', '00250\x00') Notes: The first argument fmt, "6s"*21 is a string with 6s repeated 21 times. Each format-character 6s represents one string of 6-bytes (see below), this will match the required format specified in your document. The number 126 in filecontent[:126] is calculated as 6*21 = 126. Note that for the s (string) specifier, the preceding number does not mean to repeat the format character 6 times (as it would normally for other format characters). Instead, it specifies the size of the string. s represents a 1-byte string, whilst 6s represents a 6-byte string. More Extensive Solution for Header Reading (Long) Because the binary data must be manually specified, this may be tedious to do in source code. You can consider using some configuration file (like .ini file) This function will read the header and store it in a dictionary, where the structure is given from a .ini file # user configparser for Python 3x import ConfigParser def read_header(data, config_file): """ Read binary data specified by a INI file which specifies the structure """ with open(config_file) as fd: #Init the config class conf = ConfigParser.ConfigParser() conf.readfp(fd) #preallocate dictionary to store data header = {} #Iterate over the key-value pairs under the #'Structure' section for key in conf.options('structure'): #determine the string properties start_idx, end_idx = [int(x) for x in conf.get('structure', key).split(',')] start_idx -= 1 #remember python is zero indexed! strLength = end_idx - start_idx #Get the data header[key] = struct.unpack("%is" % strLength, data[start_idx:end_idx]) #Format the data header[key] = [x.strip() for x in header[key]] header[key] = [x.replace('\x00', '') for x in header[key]] #Unmap from list-type #use .items() for Python 3x header = {k:v[0] for k, v in header.iteritems()} return header An example .ini file below. The key is the name to use when storing the data, and the values is a comma-separated pair of values, the first being the starting index and the second being the ending index. These values were taken from Table 1 in your document. [structure] missing_data: 1, 6 n_cols: 7, 12 n_rows: 13, 18 latitude_enclosed: 25, 30 This function can be used as follows: header = read_header(filecontent, 'headerStructure.ini') n_cols = int(header['n_cols'])