Plotting large text file containing a matrix with gnuplot/matplotlib - python

For debugging purposes my program writes out the armadillo-based matrices in a raw-ascii format into text files, i.e. complex numbers are written as (1, 1). Moreover, the resulting matrices result in file sizes > 3 GByte.
I would like to "plot" those matrices (representing fields) such that I can look at different points within the field for debugging. What would be the best way of doing that?
When directly plotting my file with gnuplot using
plot "matrix_file.txt" matrix with image
I get the response
warning: matrix contains missing or undefined values
Warning: empty cb range [0:0], adjusting to [-1:1]
I also could use Matplotlib, iterate over each row in the file and convert the values into appropriate python values, but I assume reading the full file doing that will be rather time-consuming.
Thus, are there other reasonable fast options for plotting my matrix, or is there a way to tell gnuplot how to treat my complex numbers properly?
A part of the first line looks like
(0.0000000000000000e+00,0.0000000000000000e+00) (8.6305562282169946e-07,6.0526580514090297e-07) (1.2822974500623326e-05,1.1477679031930141e-05) (5.8656372718492336e-05,6.6626342814082442e-05) (1.6183121649896915e-04,2.3519364967920469e-04) (3.2919257507746272e-04,6.2745022681547850e-04) (5.3056616247733281e-04,1.3949688132772061e-03) (6.7714688179733437e-04,2.7240206117506108e-03) (6.0083005524875425e-04,4.8217990806492588e-03) (3.6759450038482363e-05,7.8957232784174231e-03) (-1.3887302495780910e-03,1.2126758313515496e-02) (-4.1629396217170980e-03,1.7638346107957101e-02) (-8.8831593853181175e-03,2.4463072133103888e-02) (-1.6244140097742808e-02,3.2509486873735290e-02) (-2.7017231109227786e-02,4.1531431496659221e-02) (-4.2022691198292300e-02,5.1101686500864850e-02) (-6.2097364532786636e-02,6.0590740956970250e-02) (-8.8060067117896060e-02,6.9150058884242055e-02) (-1.2067637255414780e-01,7.5697648270160053e-02) (-1.6062285417043359e-01,7.8902435158400494e-02) (-2.0844826713055306e-01,7.7163461035715558e-02) (-2.6452596415873003e-01,6.8580842184681204e-02) (-3.2898869195273894e-01,5.0918234150147214e-02) (-4.0163477687695504e-01,2.1561405580661022e-02) (-4.8179470918233597e-01,-2.2515842273449008e-02) (-5.6815035401912617e-01,-8.4759639628930100e-02) (-6.5850621484774385e-01,-1.6899215347429869e-01) (-7.4952345707877654e-01,-2.7928561041518252e-01) (-8.3644196044174313e-01,-4.1972419090890900e-01) (-9.1283160402230334e-01,-5.9403043419268908e-01) (-9.7042844114238713e-01,-8.0504703287094281e-01) (-9.9912107865273936e-01,-1.0540865412492695e+00) (-9.8715384989307420e-01,-1.3401890190155983e+00) (-9.2160320921981831e-01,-1.6593576679224276e+00) (-7.8916051033438095e-01,-2.0038702251062159e+00) (-5.7721850912406181e-01,-2.3617835609973805e+00) (-2.7521347260072193e-01,-2.7167550691449942e+00)
Ideally, I would like to be able to choose if I plot only the real part, the imaginary part or the abs()-value.

Here is a gnuplot only version.
Actually, I haven't seen (yet) a gnuplot example about how to plot complex numbers from a datafile.
Here, the idea is to split the data into columns at the characters ( and , and ) via:
set datafile separator '(,)'
Then you can address your i-th real and imaginary parts in column via column(3*i-1) and column(3*i), respectively.
You are creating a new dataset via plotting the data many times in a double loop, which is ok for small data. However, my guess would be that this solution might become pretty slow for large datasets, especially if you are plotting from a file. I assume if you have your data once in a datablock (instead of a file) it might be faster. Check gnuplot: load datafile 1:1 into datablock. In general, maybe it is more efficient to use another tool, e.g. Python, awk, etc. to prepare the data.
Just a thought: if you have approx. 3e9 Bytes of data and (according to your example) approx. 48-50 Bytes per datapoint and if you want to plot it as a square graph, then the number of pixels on a side would be sqrt(3e9/50)=7746 pixels. I doubt that you have a display which can display this at once.
Edit:
The modified version below is now using set print to datablock and is much faster then the original version (using a double loop of plot ... every ...). The speed improvement I can already see with my little data example. Good luck with your huge dataset ;-).
Just for reference and comparison, the old version listed again here:
# create a new datablock with row,col,Real,Imag,Abs
# using plot ...with table (pretty slow and inefficient)
set table $Data2
set datafile separator '(,)' # now, split your data at these characters
myReal(i) = column(3*i-1)
myImag(i) = column(3*i)
myAbs(i) = sqrt(myReal(i)**2 + myImag(i)**2)
plot for [row=0:rowMax-1] for [col=1:colMax] $Data u (row):(col):(myReal(col)):(myImag(col)):(myAbs(col)) every ::row::row w table
set datafile separator whitespace # set separator back to whitespace
unset table
Code: (modified using set print)
### plotting complex numbers
reset session
$Data <<EOD
(0.1,0.1) (0.2,1.2) (0.3,2.3) (0.4,3.4) (0.5,4.5)
(1.1,0.1) (1.2,1.2) (1.3,2.3) (1.4,3.4) (1.5,4.5)
(2.1,0.1) (2.2,1.2) (2.3,2.3) (2.4,3.4) (2.5,4.5)
(3.1,0.1) (3.2,1.2) (3.3,2.3) (3.4,3.4) (3.5,4.5)
(4.1,0.1) (4.2,1.2) (4.3,2.3) (4.4,3.4) (4.5,4.5)
(5.1,0.1) (5.2,1.2) (5.3,2.3) (5.4,3.4) (5.5,4.5)
(6.1,0.1) (6.2,1.2) (6.3,2.3) (6.4,3.4) (6.5,4.5)
(7.1,0.1) (7.2,1.2) (7.3,2.3) (7.4,3.4) (7.5,4.5)
EOD
stats $Data u 0 nooutput # get number of columns and rows, separator is whitespace
colMax = STATS_columns
rowMax = STATS_records
# create a new datablock with row,col,Real,Imag,Abs
# using print to datablock
set print $Data2
myCmplx(row,col) = word($Data[row+1],col)
myReal(row,col) = (s=myCmplx(row,col),s[2:strstrt(s,',')-1])
myImag(row,col) = (s=myCmplx(row,col),s[strstrt(s,',')+1:strlen(s)-1])
myAbs(row,col) = sqrt(myReal(row,col)**2 + myImag(row,col)**2)
do for [row=0:rowMax-1] {
do for [col=1:colMax] {
print sprintf("%d %d %s %s %g",row-1,col,myReal(row,col),myImag(row,col),myAbs(row,col))
}
}
set print
set key box opaque
set multiplot layout 2,2
plot $Data2 u 1:2:3 w image ti "Real part"
plot $Data2 u 1:2:4 w image ti "Imaginary part"
set origin 0.25,0
plot $Data2 u 1:2:5 w image ti "Absolute value"
unset multiplot
### end of code
Result:

Maybe not what you asked for but I think it is neat to plot directly from your code and it is simple to modify what you want to show abs(x),real(x),... Here is a simple snippet to plot an Armadillo matrix as an image in gnuplot (Linux)
#include <armadillo>
using namespace std;
using namespace arma;
void plot_image(mat& x, FILE* cmd_pipe)
{
fputs("set nokey;set yrange [*:*] reverse\n", cmd_pipe);
fputs("plot '-' matrix with image\n", cmd_pipe);
for(uword r=0; r<x.n_rows; r++){
for(uword c=0; c<x.n_cols; c++){
string str=to_string(x(r,c))+" ";
fputs(str.c_str(), cmd_pipe);
}
fputs("\n", cmd_pipe);
}
fputs("e\n", cmd_pipe);
}
int main()
{
FILE* gnuplot_pipe = popen("gnuplot -persist","w");
mat x={{1,2,3,4,5},
{2,2,3,4,5},
{3,3,3,4,5},
{4,4,4,4,5},
{5,5,9,9,9}};
plot_image(x,gnuplot_pipe);
return 0 ;
}
The output is:

Related

Data transfer problems to an array and slowness when access data compared to matlab

I'm trying to port a code from matlab to python, my major problem is reading the file and tranposing the data to arrays.
In matlab:
[filename,pathname,~] = uigetfile('*.out');
data{1} = importdata(fullfile(pathname,filename), '\t', 8);
unit = dados{1}.colheaders;
title = strsplit(char(dados{1}.textdata(7,1)));
In python:
import tkinter.filedialog
import numpy as np
def openfile():
file_path = tkinter.filedialog.askopenfile(mode='r', filetypes=[('','.out')])
data=np.loadtxt(file_path,delimiter='\t',skiprows=8)
nrows, ncols = np.shape(data)
return data, nrows, ncols
data, nrows, ncols = openfile()
print(data[0:5][0])
But when i try to access the first column (time vector) and then print this vector, i got the print of a line. Even if i invert the indices from [0:5][0] to [0][0:5] i got a similar result.
Another problem, is that accessing files takes much longer than in matlab.
Below a sample of data which i'm trying to access in python.
#
Predictions were generated on 07-Jun-2021 at 07:36:56 using OpenFAST, compiled as a 64-bit application using double precision at commit v2.5.0
linked with NWTC Subroutine Library; ElastoDyn; InflowWind; AeroDyn; ServoDyn; HydroDyn; MoorDyn (v1.01.02F, 8-Apr-2016)
Description from the FAST input file: IEA 15 MW offshore reference model on UMaine VolturnUS-S semi-submersible floating platform
Time NcIMUTVxs NcIMUTVys NcIMUTVzs NcIMUTAxs NcIMUTAys NcIMUTAzs NcIMURVxs NcIMURVys NcIMURVzs NcIMURAxs NcIMURAys NcIMURAzs
(s) (m/s) (m/s) (m/s) (m/s^2) (m/s^2) (m/s^2) (deg/s) (deg/s) (deg/s) (deg/s^2) (deg/s^2) (deg/s^2)
0.0000 0.000E+00 0.000E+00 0.000E+00 -7.319E-01 -3.911E-01 -1.344E+00 0.000E+00 0.000E+00 0.000E+00 4.008E+00 -1.493E+01 4.163E-01
0.0250 -1.818E-02 -9.621E-03 -3.261E-02 -6.358E-01 -3.754E-01 -1.210E+00 9.613E-02 -3.609E-01 9.976E-03 3.542E+00 -1.345E+01 3.672E-01
0.0500 -3.140E-02 -1.845E-02 -5.898E-02 -5.513E-01 -3.181E-01 -9.064E-01 1.709E-01 -6.537E-01 1.772E-02 2.361E+00 -9.933E+00 2.434E-01
0.0750 -4.459E-02 -2.540E-02 -7.653E-02 -3.923E-01 -2.385E-01 -4.594E-01 2.103E-01 -8.428E-01 2.174E-02 7.456E-01 -4.845E+00 7.446E-02
0.1000 -5.177E-02 -3.032E-02 -8.156E-02 -2.350E-01 -1.594E-01 5.288E-02 2.078E-01 -8.920E-01 2.140E-02 -9.449E-01 9.618E-01 -1.022E-01
numpy.loadtxt is, in general, not very efficient (numpy save/load works best with binary format). Plus your code as-is doesn't work for me (because the delimiter is not really a tab, but rather multiple spaces, and I don't think that's supported by numpy).
In your position I would either use raw python (and then convert to numpy array) or pandas (probably slower but more robust).
Ignoring the tkinter part and just supposing the file name to be data.txt, the first solution would look like:
import numpy as np
data = []
with open('data.txt') as fp:
for i, line in fp:
if i >= 8:
data.append([float(x) for x in line.split()])
data = np.asarray(data)
The second solution with pandas would be:
import pandas as pd
df = pd.read_csv('data.txt', skiprows=7, delimiter=' ', skipinitialspace=True)
data = df.values
The results are equivalent, but the slightly different: python's split function automatically trims white space at beginning and end, plus it considers any white space as one separator (one space, multiple spaces, tab, etc.). The conversion to float works in the example you provided. All the first 8 rows are skipped. Pandas' version also ignores multiple spaces, but I think it wouldn't work with tabs, plus we need to explicitly tell it to ignore the whitespaces at the beginning of the line. We also just skip 7 lines there, not 8, because by default csv files must have the column names in the first column. So in this particular case, we would get a dataframe with column names
['(s)', '(m/s)', '(m/s).1', '(m/s).2', '(m/s^2)', '(m/s^2).1',
'(m/s^2).2', '(deg/s)', '(deg/s).1', '(deg/s).2', '(deg/s^2)',
'(deg/s^2).1', '(deg/s^2).2']
But that doesn't matter anyway, because when we take .values in the end, only numeric values are kept.
Perhaps, the more important difference is that if there is an invalid value at some place (say, a string), python's code would raise an exception when trying to convert to float, pandas' solution will happily accept it and create a column of "object" type (i.e. "anything" type) and not convert even valid entries to float (in that column).

Converting pixels into wavelength using 2 FITS files

I am new to python and FITS image files, as such I am running into issues. I have two FITS files; the first FITS file is pixels/counts and the second FITS file (calibration file) is pixels/wavelength. I need to convert pixels/counts into wavelength/counts. Once this is done, I need to output wavelength/counts as a new FITS file for further analysis. So far I have managed to array the required data as shown in the code below.
import numpy as np
from astropy.io import fits
# read the images
image_file = ("run_1.fits")
image_calibration = ("cali_1.fits")
hdr = fits.getheader(image_file)
hdr_c = fits.getheader(image_calibration)
# print headers
sp = fits.open(image_file)
print('\n\nHeader of the spectrum :\n\n', sp[0].header, '\n\n')
sp_c = fits.open(image_calibration)
print('\n\nHeader of the spectrum :\n\n', sp_c[0].header, '\n\n')
# generation of arrays with the wavelengths and counts
count = np.array(sp[0].data)
wave = np.array(sp_c[0].data)
I do not understand how to save two separate arrays into one FITS file. I tried an alternative approach by creating list as shown in this code
file_list = fits.open(image_file)
calibration_list = fits.open(image_calibration)
image_data = file_list[0].data
calibration_data = calibration_list[0].data
# make a list to hold images
img_list = []
img_list.append(image_data)
img_list.append(calibration_data)
# list to numpy array
img_array = np.array(img_list)
# save the array as fits - image cube
fits.writeto('mycube.fits', img_array)
However I could only save as a cube, which is not correct because I just need wavelength and counts data. Also, I lost all the headers in the newly created FITS file. To say I am lost is an understatement! Could someone point me in the right direction please? Thank you.
I am still working on this problem. I have now managed (I think) to produce a FITS file containing the wavelength and counts using this website:
https://www.mubdirahman.com/assets/lecture-3---numerical-manipulation-ii.pdf
This is my code:
# Making a Primary HDU (required):
primaryhdu = fits.PrimaryHDU(flux) # Makes a header # or if you have a header that you’ve created: primaryhdu = fits.PrimaryHDU(arr1, header=head1)
# If you have additional extensions:
secondhdu = fits.ImageHDU(wave)
# Making a new HDU List:
hdulist1 = fits.HDUList([primaryhdu, secondhdu])
# Writing the file:
hdulist1.writeto("filename.fits", overwrite=True)
image = ("filename.fits")
hdr = fits.open(image)
image_data = hdr[0].data
wave_data = hdr[1].data
I am sure this is not the correct format for wavelength/counts. I need both wavelength and counts to be contained in hdr[0].data
If you are working with spectral data, it might be useful to look into specutils which is designed for common tasks associated with reading/writing/manipulating spectra.
It's common to store spectral data in FITS files using tables, rather than images. For example you can create a table containing wavelength, flux, and counts columns, and include the associated units in the column metadata.
The docs include an example on how to create a generic "FITS table" writer with wavelength and flux columns. You could start from this example and modify it to suit your exact needs (which can vary quite a bit from case to case, which is probably why a "generic" FITS writer is not built-in).
You might also be able to use the fits-wcs1d format.
If you prefer not to use specutils, that example still might be useful as it demonstrates how to create an Astropy Table from your data and output it to a well-formatted FITS file.

How to read and extract values from a binary file using python code?

I am relatively new to python. As part of my astronomy project work, I have to deal with binary files (which of course is again new to me). I was given a binary file and a python code which reads data from the binary file. I was then asked by my professor to understand how the code works on the binary file. I spent couple of days trying to figure out, but nothing helped. Can anyone here help me with the code?
# Read the binary opacity file
f = open(file, "r")
# read file dimension sizes
a = np.fromfile(f, dtype=np.int32, count=16)
NX, NY, NZ = a[1], a[4], a[7]
# read the time and time step
time, time_step = np.fromfile(f, dtype=np.float64, count=2)
# number of iterations
nite = np.fromfile(f, dtype=np.int32, count=1)
# radius array
trash = np.fromfile(f, dtype=np.float64, count=1)
rad = np.fromfile(f, dtype=np.float64, count=a[1])
# phi array
trash = np.fromfile(f, dtype=np.float64, count=1)
phi = np.fromfile(f, dtype=np.float64, count=a[4])
# close the file
f.close()
The binary file as far as I know contains several parameters (eg: radius, phi, sound speed, radiation energy) and its many values. The above code extract the values 2 parameters- radius and phi from the binary file. Both radius and phi have more than 100 values. The program works, but I am not able to understand how it works. Any help would be appreciated.
The binary file is essentially just a long list of continuous data; you need to tell np.fromfile() both where to look and what type of data to expect.
Perhaps it's easiest to understand if you create your own file:
import numpy as np
with open('numpy_testfile', 'w+') as f:
## we create a "header" line, which collects the lengths of all relevant arrays
## you can then use this header line to tell np.fromfile() *how long* the arrays are
dimensions=np.array([0,10,0,0,10,0,3,10],dtype=np.int32)
dimensions.tofile(f) ## write to file
a=np.arange(0,10,1) ## some fake data, length 10
a.tofile(f) ## write to file
print(a.dtype)
b=np.arange(30,40,1) ## more fake data, length 10
b.tofile(f) ## write to file
print(b.dtype)
## more interesting data, this time it's of type float, length 3
c=np.array([3.14,4.22,55.0],dtype=np.float64)
c.tofile(f) ## write to file
print(c.dtype)
a.tofile(f) ## just for fun, let's write "a" again
with open('numpy_testfile', 'r+b') as f:
### what's important to know about this step is that
# numpy is "seeking" the file automatically, i.e. it is considering
# the first count=8, than the next count=10, and so on
# as "continuous data"
dim=np.fromfile(f,dtype=np.int32,count=8)
print(dim) ## our header line: [ 0 10 0 0 10 0 3 10]
a=np.fromfile(f,dtype=np.int64,count=dim[1])## read the dim[1]=10 numbers
b=np.fromfile(f,dtype=np.int64,count=dim[4])## and the next 10
## now it's dim[6]=3, and the dtype is float 10
c=np.fromfile(f,dtype=np.float64,count=dim[6] )#count=30)
## read "the rest", unspecified length, let's hope it's all int64 actually!
d=np.fromfile(f,dtype=np.int64)
print(a)
print(b)
print(c)
print(d)
Addendum: the numpy documentation is quite explicit when it comes to discouraging the use of np.tofile() and np.fromfile():
Do not rely on the combination of tofile and fromfile for data storage, as the binary files generated are are not platform independent. In particular, no byte-order or data-type information is saved. Data can be stored in the platform independent .npy format using save and load instead.
Personal side note: if you spent a couple of days to understand this code, don't feel discouraged of learning python; we all start somewhere. I'd suggest to be honest about the obstacles you've hit to your Professor (if this comes up in conversation), as she/he should be able to correctly assert "where you're at" when it comes to programming. :-)
from astropy.io import ascii
data = ascii.read('/directory/filename')
column1data = data[nameofcolumn1]
column2data = data[nameofcolumn2]
ect.
column1data is now an array of all the values under that header
I use this method to import SourceExtractor dat files which are in the ASCII format.
I believe this a more elegant way to import data from ascii files.

appending an index to laspy file (.las)

I have two files, one an esri shapefile (.shp), the other a point cloud (.las).
Using laspy and shapefile modules I've managed to find which points of the .las file fall within specific polygons of the shapefile. What I now wish to do is to add an index number that enables identification between the two datasets. So e.g. all points that fall within polygon 231 should get number 231.
The problem is that as of yet I'm unable to append anything to the list of points when writing the .las file. The piece of code that I'm trying to do it in is here:
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
outFile1.points = truepoints
outFile1.points.append(indexfromshp)
outFile1.close()
The error I'm getting now is: AttributeError: 'numpy.ndarray' object has no attribute 'append'. I've tried multiple things already including np.append but I'm really at a loss here as to how to add anything to the las file.
Any help is much appreciated!
There are several ways to do this.
Las files have classification field, you could store the indexes in this field
las_file = laspy.file.File("las.las", mode="rw")
las_file.classification = indexfromshp
However if the Las file has version <= 1.2 the classification field can only store values in the range [0, 35], but you can use the 'user_data' field which can hold values in the range [0, 255].
Or if you need to store values higher than 255 / you need a separate field you can define a new dimension (see laspy's doc on how to add extra dimensions).
Your code should be close to something like this
outFile1 = laspy.file.File("laswrite2.las", mode = "w",header = inFile.header)
# copy fields
for dimension in inFile.point_format:
dat = inFile.reader.get_dimension(dimension.name)
outFile1.writer.set_dimension(dimension.name, dat)
outFile1.define_new_dimension(
name="index_from_shape",
data_type=7, # uint64_t
description = "Index of corresponding polygon from shape file"
)
outFile1.index_from_shape = indexfromshp
outFile1.close()

Converting an imperative algorithm into functional style

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

Categories

Resources