Reading direct access binary file format in Python

Reading direct access binary file format in Python - python

Background:
A binary file is read on a Linux machine using the following Fortran code:
parameter(nx=720, ny=360, nday=365)
c
dimension tmax(nx,ny,nday),nmax(nx,ny,nday)
dimension tmin(nx,ny,nday),nmin(nx,ny,nday)
c
open(10,
&file='FILE',
&access='direct',recl=nx*ny*4)
c
do k=1,nday
read(10,rec=(k-1)*4+1)((tmax(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+2)((nmax(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+3)((tmin(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+4)((nmin(i,j,k),i=1,nx),j=1,ny)
end do
File Details:
options little_endian
title global daily analysis (grid box mean, the grid shown is the center of the grid box)
undef -999.0
xdef 720 linear 0.25 0.50
ydef 360 linear -89.75 0.50
zdef 1 linear 1 1
tdef 365 linear 01jan2015 1dy
vars 4
tmax 1 00 daily maximum temperature (C)
nmax 1 00 number of reports for maximum temperature (C)
tmin 1 00 daily minimum temperature (C)
nmin 1 00 number of reports for minimum temperature (C)
ENDVARS
Attempts at a solution:
I am trying to parse this into an array in python using the following code (purposely leaving out two attributes):
with gzip.open("/FILE.gz", "rb") as infile:
data = numpy.frombuffer(infile.read(), dtype=numpy.dtype('<f4'), count = -1)
while x <= len(data) / 4:
tmax.append(data[(x-1)*4])
tmin.append(data[(x-1)*4 + 2])
x += 1
data_full = zip(tmax, tmin)
When testing some records, the data does not seem to line up with some sample records from the file when using Fortran. I have also tried dtype=numpy.float32 as well with no success. It seems as though I am reading the file in correctly in terms of number of observations though. I was also using struct before I learned the file was created with Fortran. That was not working
There are similar questions out here, some of which have answers that I have tried adapting with no luck.
UPDATE
I am a little bit closer after trying out this code:
#Define numpy variables and empty arrays
nx = 720 #number of lon
ny = 360 #number of lat
nday = 0 #iterate up to 364 (or 365 for leap year)
tmax = numpy.empty([0], dtype='<f', order='F')
tmin = numpy.empty([0], dtype='<f', order='F')
#Parse the data into numpy arrays, shifting records as the date increments
while nday < 365:
tmax = numpy.append(tmax, data[(nx*ny)*nday:(nx*ny)*(nday + 1)].reshape((nx,ny), order='F'))
tmin = numpy.append(tmin, data[(nx*ny)*(nday + 2):(nx*ny)*(nday + 3)].reshape((nx,ny), order='F'))
nday += 1
I get the correct data for the first day, but for the second day I get all zeros, the third day the max is lower than the min, and so on.

While the exact format of Fortran binary files is compiler dependent, in all cases I'm aware of direct access files (files opened with access='direct' as in this question) do not have any record markers between records. Each record is of a fixed size, as given by the recl= specifier in the OPEN statement. That is, the record N starts at offset (N - 1) * RECL bytes in the file.
One portability gotcha is that the unit of the recl= is in terms of file storage units. For most compilers, the file storage unit specifies the size in 8-bit octets (as recommended in recent versions of the Fortran standard), but for the Intel Fortran compiler, recl= is in units of 32 bits; there is a commandline option -assume byterecl which can be used to make Intel Fortran match most other compilers.
So in the example given here and assuming a 8-bit file storage unit, your recl would be 1036800 bytes.
Further, looking at the code, it seems to assume the arrays are of a 4-byte type (e.g. integer or single precision real). So if it's single precision real, and the file has been created in little endian, then the numpy dtype <f4 that you have used seems to be the correct choice.
Now, getting back to the Intel Fortran compiler gotcha, if the file has been created by ifort without -assume byterecl then the data you want will be in the first quarter of each record, with the rest being padding (all zeros or maybe even random data?). Then you'll have to do some extra gymnastics to extract the correct data in python and not the padding. It should be easy to check this by checking the size of the file, is it nx * ny * 4 * nday *4 or is it nx * ny * 4 * nday * 4 * 4 bytes?

After the Update in my question, I realize I had an error with how I was looping. I of course spotted this about 10 minutes after issuing a bounty, aw well.
The error is with using the day to iterate through the records. This will not work as it iterates once per loop, not pushing the records far enough. Hence why some mins were higher than maxes. The new code is:
while nday < 365:
tmax = numpy.append(tmax, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
rm = rm + 2
tmin = numpy.append(tmin, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
rm = rm + 2
nday += 1
This used a Record Mover (or rm as I call it) to move the records the appropriate amount. That was all it needed.

Related

how calculate icclim indicator tn10p over a period < 365 days of year

The goal is to calculate the climate indicator tn10p (percentage of days when Tmin < 10th percentile) based on the icclim package (link). Alternatively, I tried the same indicator from the xclim package. (here). I want to calculate the predicotor for a specific time period, e.g. '1960-12-01' to '1961-01-31', that can include two different years and is <= 12 months.
1 - open xarray dataset (for every 3 hours)
t2m = xa.open_dataset('filepath.nc', decode_cf = True, decode_coords = "all").sel(time=slice('1960-12-01', '1961-01-31'))
2 - calculate minimum daily temperature values
t2m_min = t2m.t2m.resample(time='1D').min(keep_attrs = True)
3.1 - With Icclim:
icclim_tn10p = icclim._generated_api.tn10p(in_files=t2m_min, slice_mode=['season',([12,1])])
3.2 - With xClim:
t2m_min_q10 = percentile_doy(arr = t2m_min, window=5, per=10).sel(percentiles=10)
xclim_tn10p = xclim.indicators.atmos.tn10p(tasmin = t2m_min, t10 = t2m_min_q10)
In both cases, 3.1 and 3.2, I get the following ValueError:
ValueError: conflicting sizes for dimension 'dayofyear': length 61 on <this-array> and length 365 on {'longitude': 'longitude', 'latitude': 'latitude', 'dayofyear': 'dayofyear', 'percentiles': 'percentiles'}
I believe that the problem is the percentile_doy function (link) that only seems to work witg 365 or 366 calendar days. Any suggestions on how to solve this?
It seems to be related to this xclim issue.

The following way works for icclim library:
Compute tn10p for each year with slice_mode='month'
Select the time range from the output array
However, the process decreases performance.

plotting categorical data by row from a dataframe

I have some data showing a machine's performance. one of the columns is for when the pipe it makes fails a particular quality check causing the machine to automatically cut the pipe. Depending on the machine and the way it's set up this happens around 1% of the time and I am trying to make a plot that shows the failure rate against time - my theory is that the longer some of the tools have been in use, the more failures they produce.
Here is an example of the excel file the machine makes every 24 hours.
The column "Cut Event" is the one I am interested in. In the snip the "/" symbol indicates no cut was made, when a cut is made it the cell in that column will say "speed", "ovality" or "thickness" as a reason (in German). What I want to do I go through a dataframe and only capture rows that have a failure, i.e. not a forward slash.
Here is what I have from reading through SO and other tutorials. The machine "speaks" German btw, hence the longer words,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#fig = plt.gcf()
df = pd.read_excel("W03 tool with cuts and dates.xlsx",
dtype=object)
df = df[['Time','Cut_Event']]
df['Cut_Event'].loc[df['Cut_Event'] == 'Geschwindigkeitsschwankung'] = 'Speed Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == 'Kugelfehler'] = 'Kugel Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == '/'] = 'No Cut Event'
print (df)
What I am stuck on is passing these events over to be plotted. My python learned so far has been about plotting everything in a particular column of a numerical dataframe, rather than just specific events of categorical data and I am getting errors as a result. I tried seaborn but got nowhere.
All help genuinely appreciated.
edit: Adding the dataset
Datum WKZ_code Time Rad_t1 Not Important Cut_Event
10 Sep W03 00:00:00 100 250 /
10 Sep W03 00:00:01 100 250 /
10 Sep W03 00:00:02 100 250 /
10 Sep W03 00:00:03 100 250 /
10 Sep W03 00:00:04 100 250 /
10 Sep W03 00:00:00 100 250 Speed Cut

ParserError: Expected 2 fields in line 32, saw 4

I'm having trouble parsing a txt file (see here: File)
Here's my code
import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python', sep='\t', header=None)
Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)
However, nothing did the trick for me. For instance, when I use
sep='delimiter'
The dataframe is created but everything ends up in a single column.
When I use
error_bad_lines=False
The rows I'm interested in are simply skipped.
The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.
I guess another workaround would be to use
df = pd.read_csv(objectname, engine = 'python', sep = 'delimiter', header=None)
in combination with the split function Split function
Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10.
Any help is appreciated.

Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.
Here's what I did to extract the metadata lines:
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split('\n')
for index, line in enumerate(file_content):
if index<21 or index>37:
print(index, line.split('\t'))
Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).
Separately, I loaded the TSV into Pandas using
import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',
sep='\t', error_bad_lines=False, header=None,
skiprows=list(range(21))+list(range(37,89)))
Again, I skipped the metadata at the start of the file and at the end of the file.
Here's the file I experimented with. I've trimmed the extra data to reduce line count.
TITLE Test123
DATA TYPE
ORIGIN JASCO
OWNER
DATE 19/03/28
TIME 16:39:44
SPECTROMETER/DATA SYSTEM
LOCALE 1031
RESOLUTION
DELTAX -0,5
XUNITS NANOMETERS
YUNITS CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX 300,0000
LASTX 190,0000
NPOINTS 221
FIRSTY -0,78961
MAXY 37,26262
MINY -53,38971
XYDATA
300,0000 -0,789606 182,198 -0,0205245
299,5000 -0,691644 182,461 -0,0181217
299,0000 -0,700976 182,801 -0,0136756
298,5000 -0,614708 182,799 -0,0131957
298,0000 -0,422611 182,783 -0,0130073
195,0000 26,6231 997,498 4,7258
194,5000 -17,3049 997,574 4,6864
194,0000 16,0387 997,765 4,63967
193,5000 -14,4049 997,967 4,58593
193,0000 -0,277261 998,025 4,52411
192,5000 -29,6098 998,047 4,45244
192,0000 -11,5786 998,097 4,36608
191,5000 34,0505 998,282 4,27376
191,0000 28,2325 998,314 4,1701
190,5000 -13,232 998,336 4,05036
190,0000 -47,023 998,419 3,91883
##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen
[Detailed Information]
Creation date 28.03.2019 16:39
Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1) CD [mdeg]
Vertical axis(2) HT [V]
Vertical axis(3) Abs
Start 300 nm
End 190 nm
Data interval 0,5 nm
Data points 221
[Measurement Information]
Instrument name CD-Photometer
Model name J-1100
Serial No. A001361635
Detector Standard PMT
Lock-in amp. X mode
HT volt Auto
Accessory PTC-514
Accessory S/N A000161648
Temperature 18.63 C
Control sonsor Holder
Monitor sensor Holder
Measurement date 28.03.2019 16:39
Overload detect 203
Photometric mode CD, HT, Abs
Measure range 300 - 190 nm
Data pitch 0.5 nm
CD scale 2000 mdeg/1.0 dOD
FL scale 200 mdeg/1.0 dOD
D.I.T. 0.5 sec
Bandwidth 1.00 nm
Start mode Immediately
Scanning speed 200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations 3

Create a sampling rate array

I have a df with 40 000 000 points that looks like this:
A
0 0.50
1 0.90
2 5.94
.
40 000 000 84.53
As the data does not have any time, I am trying to create a time array to the df but every time that I do it I get Memory Errors. Sampling time = 60 kHz
I tried shrinking the data by slicing it and instead of taking 40 000 000 points. I checked and the important data for me lay between 20000001:40000000. I have tried to take less data points e.g. 20 000 but still, whenever I create the Time array I get the memory error.
N = N.iloc[20000001:40000000] #Lock data
N = N[0 : len(N) : 1000] # Slice by 1000 increments
N['Time'] = np.arange(0, len(N), 1/60000)
How could I create a Time array without killing my memory? Am I doing something wrong?

You may write a generator of floats similar to xrange (in python2) or range (python3). They lack of float support so we write it by ourselves:
def frange(end_number, fraction):
end_idx = end_number * fraction
idx = 0
while idx < end_idx:
yield float(idx) / fraction
idx += 1
a = frange(20, 3)
print([i for i in a]) # see how it works
b = frange(40000000, 60000) # no memory error

Add two Pandas Series or DataFrame objects in-place?

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.

After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10

assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.