Detect and alter strings in PDFs - python

I want to be able to detect a pattern in a PDF and somehow flag it.
For instance, in this PDF, there's the string *2. I want to be able to parse the PDF, detect all instances of *[integer], and do something to call attention to the matches (like highlight them yellow or add a symbol in the margin).
I would prefer to do this in Python, but I'm open to other languages. So far, I've been able to use pyPdf to read the PDF's text. I can use a regex to detect the pattern. But I haven't been able to figure out how to flag the match and re-save the PDF.

Either people are not interested, or Python's not capable, so here's solution in Perl :-). Seriously, as noted above, you don't need to "alter strings". PDF annotations are the solution for you. I had small project with annotations not long ago, some code's from there. But, my content parser was not universal, and you don't need full-blown parsing -- meaning being able to alter content and write it back. Therefore I resorted to external tool. PDF Library I use is somewhat low-level, but I don't mind. It also means, one's expected to have proper knowledge of PDF internals to understand what's going on. Otherwise, just use the tool.
Here's a shot of marking e.g. all gerunds in OP's file with a command
perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'
The code (comment inside worth reading, too):
use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;
#####################################################################
#
# This is PDF highlight mark-up tool.
# Though fully functional, it's still a prototype proof-of-concept.
# Please don't feed it with non-pdf files or patterns like '\d*'
# (because you probably want '\d+', don't you?).
#
# Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
# ToDo:
# - error handling is primitive if any.
# - cropped files (CropBox) are processed incorrectly. Fix it.
# - of course there can be other useful parameters.
# - allow loading them from file.
# - allow searching across lines (e.g. for multi-word patterns)
# and certainly across "spans" within a line (see mudraw output).
# - multi-color mark-up, not just yellow.
# - control over output file name.
# - compress output (use cleanoutput method instead of output,
# plus more robust (think compressed object streams) compressors
# may be useful).
# - file list processing.
# - annotations are not just colorful marks on the page, their
# dictionaries can contain all sorts of useful information, which may
# be extracted automatically further up the food chain i.e. by
# whoever consumes these files (date, time, author, comments, actual
# text below, etc., etc., plus think of customized appearence streams,
# placing them on layers, etc..
# - ???
#
# Most complexity in the code comes from adding appearance
# dictionary (AP). You can safely delete it, because most viewers don't
# need AP for standard annotations. Ironically, muPDF-viewer wants it
# (otherwise highlight placement is not 100% correct), and since I relied
# on muPDF-tools, I thought it be proper to create PDFs consumable by
# their viewer... Firefox wants AP too, btw.
#
#####################################################################
my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file, '-p=s' => \$csv,
'-c!' => \$c_flag, '-w!' => \$w_flag)
and defined($file)
and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
"\t-f\t\tFILE\t PDF file to annotate\n",
"\t-p\t\tLIST\t comma-separated patterns\n",
"\t-c or -noc\t\t be case sensitive (default = no)\n",
"\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
->add(split(',', $csv))
->anchor_word($w_flag)
->flags($c_flag ? '' : 'i')
->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);
sub __num_nodes_list {
my $precision = shift;
[ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} #_ ]
}
sub add_highlight {
my ($idx, $x1, $y1, $x2, $y2) = #_;
my $p = $pdf->getPage($idx);
# mirror vertically to get to normal cartesian plane
my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
# corner radius
my $r = 2;
# AP appearance stream
my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
$s .= "1 j #{[sprintf '%.0f', $r * 2]} w\n";
$s .= "0 0 #{[sprintf '%.1f', $x2 - $x1]} ";
$s .= "#{[sprintf '%.1f',$y2 - $y1]} re B\n";
my $highlight = CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Highlight'),
Rect => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
QuadPoints => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
BS => CAM::PDF::Node->new('dictionary', {
S => CAM::PDF::Node->new('label', 'S'),
W => CAM::PDF::Node->new('number', 0),
}),
Border => CAM::PDF::Node->new('array',
__num_nodes_list(0, 0, 0, 0)),
C => CAM::PDF::Node->new('array',
__num_nodes_list(0, 1, 1, 0)),
AP => CAM::PDF::Node->new('dictionary', {
N => CAM::PDF::Node->new('reference',
$pdf->appendObject(undef,
CAM::PDF::Node->new('object',
CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Form'),
BBox => CAM::PDF::Node->new('array',
__num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2,
$y2 - $y1 + $r * 2)),
Resources => CAM::PDF::Node->new('dictionary', {
ExtGState => CAM::PDF::Node->new('dictionary', {
GS0 => CAM::PDF::Node->new('dictionary', {
BM => CAM::PDF::Node->new('label',
'Multiply'),
}),
}),
}),
StreamData => CAM::PDF::Node->new('stream', $s),
Length => CAM::PDF::Node->new('number', length $s),
}),
),
,0),
),
}),
});
$p->{Annots} ||= CAM::PDF::Node->new('array', []);
push #{$pdf->getValue($p->{Annots})}, $highlight;
$pdf->{changes}->{$p->{Type}->{objnum}} = 1
}
my $page_index = 1;
for my $page (#{$tree->{page}}) {
for my $block (#{$page->{block}}) {
for my $line (#{$block->{line}}) {
for my $span (#{$line->{span}}) {
my $string = join '', map {$_->{c}} #{$span->{char}};
while ($string =~ /$re/g) {
my ($x1, $y1) =
split ' ', $span->{char}->[$-[0]]->{bbox};
my (undef, undef, $x2, $y2) =
split ' ', $span->{char}->[$+[0] - 1]->{bbox};
add_highlight($page_index, $x1, $y1, $x2, $y2)
}
}
}
}
$page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);
__END__
P.s. I tagged the Question with 'Perl', to maybe have some feedback (code corrections, etc.) from community.

This is non-trivial. The problem is that PDF files are not meant to be "updated" on anything less than a page. You basically have to parse the page, adjust the PostScript rendering, and then write it back out. I don't think PyPDF has the support for doing what you want.
If "all" you want to do is to add highlighting you can probably just use the annotation dictionary. See the PDF specification for more information.
You might be able to do this using pyPDF2 but I haven't looked into it closely.

Related

How to loop through a vector in TCL?

I am new to tcl, and I couldn't figure out how to translate this code from python to TCL
import numpy as np
g0 = 7.88e12
Eox = np.array([155473, 15573, 1553, 1557473, 5473, 473, 1573, 19553])
E1 = 0.55e6
m = 0.7
fot= 1
D = float(input("rad dose"))
Fy = np.array([(abs(Eox)/(abs(Eox)+E1))**m])
Not = np.array([g0*D*Fy*fot])
Not[Not>6.8e18] = 6.8e18
Nit = (1.7e4*D)+1e10
if Nit > 5e12:
Nit = 5e12
print(Not)
print(Nit)
Tcl doesn't process lists of values like numpy does (it's closer to standard Python) without an extension. I remember there being such a thing, but I've not used it and I can't remember the name right now. So I'll use standard Tcl. The closest analogy for a numpy array in standard Tcl is a list of values (though that's got more in common with a Python list or tuple than anything else).
I'll translate these four lines for you:
Eox = np.array([155473, 15573, 1553, 1557473, 5473, 473, 1573, 19553])
Fy = np.array([(abs(Eox)/(abs(Eox)+E1))**m])
Not = np.array([g0*D*Fy*fot])
Not[Not>6.8e18] = 6.8e18
They become:
set Eox {155473 15573 1553 1557473 5473 473 1573 19553}
set Fy [lmap EoxVal $Eox {expr {
(abs($EoxVal) / abs($EoxVal + $E1)) ** $m
}}]
set Not [lmap FyVal $Fy {expr {
$g0 * $D * $FyVal * $fot
}}]
set Not [lmap NotVal $Not {expr {
($NotVal > 6.8e18) ? 6.8e18 : $NotVal
}}]
The use of lmap is doing the same thing that numpy's doing behind the scenes for you on those arrays. The last two commands can be combined though:
set Not [lmap FyVal $Fy {expr {
min($g0 * $D * $FyVal * $fot, 6.8e18)
}}]
(Also, I've split each of the lmap/expr commands over multiple lines for increased clarity. You can write more compact code, but it's harder to read. Clear code is a very good plan in any code you intend to keep around for longer than 5 minutes.)
The other lines of code are pure basic stuff. set, puts, gets and expr will do what you need. And the suspiciously obvious if.

how to intersect multiple files by several columns

I have spent a lot of time on this any help would be appreciated.
I have two files as below; what I want to do is to search for every item of f1_col1 and f1_col2 separately inside the f2_col3 - if an item exists then save it and add its related row from the (f2_col3) to a new column in the new df.
f1:(two columns)
f1_col1,f1_col2
kctd,Malat1
Gas5,Snhg6
f2:(three columns)
f2_col1,f2_col2,f2_col3
chr7,snRNA,Gas5
chr1,protein_coding,Malat1
chr2,TEC,Snhg6
chr1,TEC,kctd
So based on the two files mentioned the desired output should be:
new_df:
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2
note: f2_col2 is not important.
I do not have a strong programming background and found this very difficult - Even though I have checked multiple sources but have not been able to develop a solution - any help is appreciated. Thanks
Based on 1 possible interpretation of your requirements and the 1 sunny-day example you provided where every key field always matches on every line, this MAY be what you're trying to do:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( FNR == 1 ) {
hdr = $1
}
map[$3] = $1
next
}
{ print $0, ( FNR>1 ? map[$1] OFS map[$2] : hdr OFS hdr ) }
$ awk -f tst.awk f2 f1
f1_col1,f1_col2,f2_col1,f2_col1
kctd,Malat1,chr1,chr1
Gas5,Snhg6,chr7,chr2

How to implement the slowly updating side inputs in python

I am attempting to implement the slowly updating global window side inputs example from the documentation from java into python and I am kinda stuck on what the AfterProcessingTime.pastFirstElementInPane() equivalent in python. For the map I've done something like this:
class ApiKeys(beam.DoFn):
def process(self, elm) -> Iterable[Dict[str, str]]:
yield TimestampedValue(
{"<api_key_1>": "<account_id_1>", "<api_key_2>": "<account_id_2>",},
elm,
)
map = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # would be timestamp.Timestamp.now() in production
20, # would be timestamp.MAX_TIMESTAMP in production
1, # would be around 1 hour or so in production
)
)
| "GenSequence"
>> PeriodicSequence()
| "ApplyWindowing"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(Always(), AfterProcessingTime(???)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| "api_keys" >> beam.ParDo(ApiKeys())
)
I am hoping to use this as a Dict[str, str] input to a downstream function that will have windows of 60 seconds, merging with this one that I hope to update on an hourly basis.
The point is to run this on google cloud dataflow (where we currently just re-release it to update the api_keys).
I've pasted the java example from the documentation below for convenience sake:
public static void sideInputPatterns() {
// This pipeline uses View.asSingleton for a placeholder external service.
// Run in debug mode to see the output.
Pipeline p = Pipeline.create();
// Create a side input that updates each second.
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input, OutputReceiver<Map<String, String>> o) {
// Replace map with test data from the placeholder external service.
// Add external reads here.
o.output(PlaceholderExternalService.readTestData());
}
}))
.apply(View.asSingleton());
// Consume side input. GenerateSequence generates test data.
// Use a real source (like PubSubIO or KafkaIO) in production.
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, KV<Long, Long>>() {
#ProcessElement
public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug(
"Value is {}, key A is {}, and key B is {}.",
c.element(),
keyMap.get("Key_A"),
keyMap.get("Key_B"));
}
})
.withSideInputs(map));
}
/** Placeholder class that represents an external service generating test data. */
public static class PlaceholderExternalService {
public static Map<String, String> readTestData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
Any ideas as to how to emulate this example would be enormously appreciated, I've spent literally days on this issue now :(
Update #2 based on #AlexanderMoraes
So, I've tried changing it according to my understanding of your suggestions:
main_window_size = 5
trigger_interval = 30
side_input = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # timestamp.Timestamp.now().__float__(),
60, # timestamp.Timestamp.now().__float__() + 30.0,
trigger_interval, # fire_interval
)
)
| "GenSequence" >> PeriodicSequence()
| "api_keys" >> beam.ParDo(ApiKeys())
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(window_size)),
accumulation_mode=AccumulationMode.DISCARDING,
)
)
But when combining this with another pipeline with windowing set to something smaller than trigger_interval I am unable to use the dictionary as a singleton because for some reason they are duplicated:
ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}", "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}". [while running 'Pair with AccountIDs']
Is there some way to clarify that the singleton output should ignore whatever came before it?
The title of the question "slowly updating side inputs" refers to the documentation, which already has a Python version of the code. However, the code you provided is from "updating global window side inputs", which just has the Java version for the code. So I will be addressing an answer for the second one.
You are not able to reproduce the AfterProcessingTime.pastFirstElementInPane() within Python. This function is used to fire triggers, which determine when to emit results of each window (refered as pane). In your case, this particular call AfterProcessingTime.pastFirstElementInPane() creates a trigger that fires when the current processing time passes the processing time at which this trigger saw the first element in a pane, here. In Python this is achieve using AfterWatermark and AfterProcessingTime().
Below, there are two pieces of code one in Java and another one in Python. Thus, you can understand more about each one's usage. Both examples set a time-based trigger which emits results one minute after the first element of the window has been processed. Also, the accumulation mode is set for not accumulating the results (Java: discardingFiredPanes() and Python: accumulation_mode=AccumulationMode.DISCARDING).
1- Java:
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))
.discardingFiredPanes());
2- Python: the trigger configuration is the same as described in point 1
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING)
The examples above were taken from thew documentation.
AsIter() insted of AsSingleton() worked for me

Understanding Perl Script for GPIB Interface

I am new to the perl and python. As a part of my job currently I am asked to convert the perl script to python. The purpose of this script is to automate the task of the magnum tester and parametric analyzer. Is any of you able to understand what the get_gpib_status function trying to do ??. Specific questions are
What does if(/Error/) mean in perl ?
What does it mean by
chomp;
s/\+//g;
$value = $_;
$foundError = 0;
in perl?. What is the python equivalent of the get_gpib_status function ??.
Any kind of help is highly appreciated. Thanks in advance. The Script is as shown below.
BEGIN {unshift(#INC, "." , ".." ,
"\\Micron\\Nextest\\perl_modules");}
use runcli;
# Enable input from perl script as nextest cli command, Runcli is the
command
that you’ll use to communicate with the tester
use getHost;
# To latch in module/ "library into current script, here the
getHost.pm
is
loaded, used once on nextest system
#open FILE,">","iv.txt" or die $!;
# Make file ready for reading from FILE
$k=148;
# Time period T = 38ns corresponds to data value of 140
#$i=0;
while($k<156)
{
$i=3;
while($i<4)
{
#$logfile = "vpasscp8stg_SS_vppmode"."stg"."$i"."freq"."$k".".txt";
# Give the name to the logfile
#open(LOG,">$logfile") or die $!;
# Makes the file ready for reading from LOG
#******************* SUBS ****************************
if($i==3)
{
runcli("gpibinit;");
runcli("gpibsend(0x1,\":PAGE:CHAN:MODE SWEEP\")");
# PAGE, CHANnel, MODE, Set the mode to sweep
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:VNAME 'Vout'\")");
# Source Monitor Unit, voltage name Vout
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:INAME 'Iout'\")");
# current name Iout
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:MODE V\")");
# voltage output node
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:FUNCTION VAR1\")");
# function Variable
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:VNAME 'Vcc'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:INAME 'Icc'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:MODE V\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:FUNCTION CONSTANT\")");
# function constant
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:VNAME 'Vpp'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:INAME 'Ipp'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:MODE V\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:FUNCTION CONSTANT\")");
#runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:DIS\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU4:DIS\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:VSU1:DIS\")");
# Voltage Source Unit DISabled
runcli("gpibsend(0x1,\":PAGE:CHAN:VSU2:DIS\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Vout'\")");
# DISPlay LIST
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Iout'\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Vcc'\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Icc'\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:MODE SINGLE\")");
# Single Stair Sweep
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:SPACING LINEAR\")");
# The sweep is incremented (decremented) by the
# stepsize until the stop value is reached.
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:START 2.8\")");
# Setting the sweep range of Vout
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:STOP 18\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:STEP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:COMPLIANCE 0.05\")");
# Compliance: meaning the stable state of voltage, on the parametric
analyzer
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:PCOMPLIANCE:STATE
0\")");
# PCOMPLIANCE: Might be the state before the stable state
runcli("gpibsend(0x1,\":PAGE:MEAS:DEL 2\")");
# Delay
runcli("gpibsend(0x1,\":PAGE:MEAS:HTIM 50\")");
# Hold Time
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU2:SOURCE 3.3\")");
# Setting the values for VCC
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU2:COMP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU3:SOURCE 12\")");
# Setting the values for VPP
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU3:COMP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:SCON:SING\")");
sleep(2);
runcli("ctst");
runcli("stst");
sleep(2);
runcli("pu;rs");
runcli("B16R_vpasscp_vpp.txt()");
runcli("regaccess(static_load,0x9,0x9,$k)");
# Using the Cregs 0x9 to modulate the frequency
runcli("adputr(0xcf,0x03)");
runcli("rs");
poll_4156c();
}
sub poll_4156c{
runcli("gpibsend(0x1,\":STAT:OPER:COND?\");");
# This command returns the present status of the Operation
# Status "CONDITION" register. Reading this register does not clear
it.
runcli("gpibreceive(0x1);");
while((get_gpib_status() > 0)&&($foundError < 1) ){
sleep(3);
runcli("gpibsend(0x1,\":STAT:OPER:COND?\")");
runcli("gpibreceive(0x1);");
}
}#end poll_4156c subroutine
sub get_gpib_status{
# get file info
$host_meas = getHost();
# Retrieve the nextest station detail, will return something like
mav2pt
- 0014
$file_meas = $host_meas."_temp.cli";
# Define the file_meas as the nextest cli temporary file, Contains
all
the text as displayed on Nextest CLI
open(STATUS, "$file_meas" ) || die("Can't open logfile: $!");
print "\nSTATUS received from GPIB:";
while(<STATUS>)
{
if(/Error/){
runcli("gpibinit;");
$foundError = 1;
}
else
{
chomp;
s/\+//g;
$value = $_;
$foundError = 0;
}
} # End of while(<INMEAS>) loop.
close(STATUS);
#print "value = $value";
return($value);
}#End of get_gpib_status subroutine.
$i=$i+1;
}
$k=$k+8;
}
Ad 1. /Error/ is a regular expression matching any string that contains "Error" as its substring; in this context (if(/Error/) it is evaluated as boolean expression, ie. it is true if match is found in $_ variable;
Ad. 2. chomp removes trailing newline character from its argument, or in lack of argument, from $_ variable; s/\+//g; replaces any + signs found in $_ with nothing (ie. removes them). As you might have already noticed, many Perl constructs operate on $_ if not specified otherwise.
get_gpib_status is defined in your script - sub get_gpib_status is the beginning of definition (in Perl, functions are defined using sub keyword, stands for "subroutine").
And finally, I offer my sincere condolences. Giving the task of rewriting a program from one language to another to someone without prior experience with either language is probably not the stupidest managerial decision I've heard of, but high on the list.

Converting an imperative algorithm into functional style

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

Categories

Resources