I am new to the perl and python. As a part of my job currently I am asked to convert the perl script to python. The purpose of this script is to automate the task of the magnum tester and parametric analyzer. Is any of you able to understand what the get_gpib_status function trying to do ??. Specific questions are
What does if(/Error/) mean in perl ?
What does it mean by
chomp;
s/\+//g;
$value = $_;
$foundError = 0;
in perl?. What is the python equivalent of the get_gpib_status function ??.
Any kind of help is highly appreciated. Thanks in advance. The Script is as shown below.
BEGIN {unshift(#INC, "." , ".." ,
"\\Micron\\Nextest\\perl_modules");}
use runcli;
# Enable input from perl script as nextest cli command, Runcli is the
command
that you’ll use to communicate with the tester
use getHost;
# To latch in module/ "library into current script, here the
getHost.pm
is
loaded, used once on nextest system
#open FILE,">","iv.txt" or die $!;
# Make file ready for reading from FILE
$k=148;
# Time period T = 38ns corresponds to data value of 140
#$i=0;
while($k<156)
{
$i=3;
while($i<4)
{
#$logfile = "vpasscp8stg_SS_vppmode"."stg"."$i"."freq"."$k".".txt";
# Give the name to the logfile
#open(LOG,">$logfile") or die $!;
# Makes the file ready for reading from LOG
#******************* SUBS ****************************
if($i==3)
{
runcli("gpibinit;");
runcli("gpibsend(0x1,\":PAGE:CHAN:MODE SWEEP\")");
# PAGE, CHANnel, MODE, Set the mode to sweep
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:VNAME 'Vout'\")");
# Source Monitor Unit, voltage name Vout
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:INAME 'Iout'\")");
# current name Iout
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:MODE V\")");
# voltage output node
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU1:FUNCTION VAR1\")");
# function Variable
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:VNAME 'Vcc'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:INAME 'Icc'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:MODE V\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU2:FUNCTION CONSTANT\")");
# function constant
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:VNAME 'Vpp'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:INAME 'Ipp'\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:MODE V\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:FUNCTION CONSTANT\")");
#runcli("gpibsend(0x1,\":PAGE:CHAN:SMU3:DIS\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:SMU4:DIS\")");
runcli("gpibsend(0x1,\":PAGE:CHAN:VSU1:DIS\")");
# Voltage Source Unit DISabled
runcli("gpibsend(0x1,\":PAGE:CHAN:VSU2:DIS\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Vout'\")");
# DISPlay LIST
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Iout'\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Vcc'\")");
runcli("gpibsend(0x1,\":PAGE:DISP:LIST:SELECT 'Icc'\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:MODE SINGLE\")");
# Single Stair Sweep
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:SPACING LINEAR\")");
# The sweep is incremented (decremented) by the
# stepsize until the stop value is reached.
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:START 2.8\")");
# Setting the sweep range of Vout
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:STOP 18\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:STEP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:COMPLIANCE 0.05\")");
# Compliance: meaning the stable state of voltage, on the parametric
analyzer
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:VAR1:PCOMPLIANCE:STATE
0\")");
# PCOMPLIANCE: Might be the state before the stable state
runcli("gpibsend(0x1,\":PAGE:MEAS:DEL 2\")");
# Delay
runcli("gpibsend(0x1,\":PAGE:MEAS:HTIM 50\")");
# Hold Time
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU2:SOURCE 3.3\")");
# Setting the values for VCC
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU2:COMP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU3:SOURCE 12\")");
# Setting the values for VPP
runcli("gpibsend(0x1,\":PAGE:MEAS:SWEEP:CONS:SMU3:COMP 0.1\")");
runcli("gpibsend(0x1,\":PAGE:SCON:SING\")");
sleep(2);
runcli("ctst");
runcli("stst");
sleep(2);
runcli("pu;rs");
runcli("B16R_vpasscp_vpp.txt()");
runcli("regaccess(static_load,0x9,0x9,$k)");
# Using the Cregs 0x9 to modulate the frequency
runcli("adputr(0xcf,0x03)");
runcli("rs");
poll_4156c();
}
sub poll_4156c{
runcli("gpibsend(0x1,\":STAT:OPER:COND?\");");
# This command returns the present status of the Operation
# Status "CONDITION" register. Reading this register does not clear
it.
runcli("gpibreceive(0x1);");
while((get_gpib_status() > 0)&&($foundError < 1) ){
sleep(3);
runcli("gpibsend(0x1,\":STAT:OPER:COND?\")");
runcli("gpibreceive(0x1);");
}
}#end poll_4156c subroutine
sub get_gpib_status{
# get file info
$host_meas = getHost();
# Retrieve the nextest station detail, will return something like
mav2pt
- 0014
$file_meas = $host_meas."_temp.cli";
# Define the file_meas as the nextest cli temporary file, Contains
all
the text as displayed on Nextest CLI
open(STATUS, "$file_meas" ) || die("Can't open logfile: $!");
print "\nSTATUS received from GPIB:";
while(<STATUS>)
{
if(/Error/){
runcli("gpibinit;");
$foundError = 1;
}
else
{
chomp;
s/\+//g;
$value = $_;
$foundError = 0;
}
} # End of while(<INMEAS>) loop.
close(STATUS);
#print "value = $value";
return($value);
}#End of get_gpib_status subroutine.
$i=$i+1;
}
$k=$k+8;
}
Ad 1. /Error/ is a regular expression matching any string that contains "Error" as its substring; in this context (if(/Error/) it is evaluated as boolean expression, ie. it is true if match is found in $_ variable;
Ad. 2. chomp removes trailing newline character from its argument, or in lack of argument, from $_ variable; s/\+//g; replaces any + signs found in $_ with nothing (ie. removes them). As you might have already noticed, many Perl constructs operate on $_ if not specified otherwise.
get_gpib_status is defined in your script - sub get_gpib_status is the beginning of definition (in Perl, functions are defined using sub keyword, stands for "subroutine").
And finally, I offer my sincere condolences. Giving the task of rewriting a program from one language to another to someone without prior experience with either language is probably not the stupidest managerial decision I've heard of, but high on the list.
Related
For debugging purposes my program writes out the armadillo-based matrices in a raw-ascii format into text files, i.e. complex numbers are written as (1, 1). Moreover, the resulting matrices result in file sizes > 3 GByte.
I would like to "plot" those matrices (representing fields) such that I can look at different points within the field for debugging. What would be the best way of doing that?
When directly plotting my file with gnuplot using
plot "matrix_file.txt" matrix with image
I get the response
warning: matrix contains missing or undefined values
Warning: empty cb range [0:0], adjusting to [-1:1]
I also could use Matplotlib, iterate over each row in the file and convert the values into appropriate python values, but I assume reading the full file doing that will be rather time-consuming.
Thus, are there other reasonable fast options for plotting my matrix, or is there a way to tell gnuplot how to treat my complex numbers properly?
A part of the first line looks like
(0.0000000000000000e+00,0.0000000000000000e+00) (8.6305562282169946e-07,6.0526580514090297e-07) (1.2822974500623326e-05,1.1477679031930141e-05) (5.8656372718492336e-05,6.6626342814082442e-05) (1.6183121649896915e-04,2.3519364967920469e-04) (3.2919257507746272e-04,6.2745022681547850e-04) (5.3056616247733281e-04,1.3949688132772061e-03) (6.7714688179733437e-04,2.7240206117506108e-03) (6.0083005524875425e-04,4.8217990806492588e-03) (3.6759450038482363e-05,7.8957232784174231e-03) (-1.3887302495780910e-03,1.2126758313515496e-02) (-4.1629396217170980e-03,1.7638346107957101e-02) (-8.8831593853181175e-03,2.4463072133103888e-02) (-1.6244140097742808e-02,3.2509486873735290e-02) (-2.7017231109227786e-02,4.1531431496659221e-02) (-4.2022691198292300e-02,5.1101686500864850e-02) (-6.2097364532786636e-02,6.0590740956970250e-02) (-8.8060067117896060e-02,6.9150058884242055e-02) (-1.2067637255414780e-01,7.5697648270160053e-02) (-1.6062285417043359e-01,7.8902435158400494e-02) (-2.0844826713055306e-01,7.7163461035715558e-02) (-2.6452596415873003e-01,6.8580842184681204e-02) (-3.2898869195273894e-01,5.0918234150147214e-02) (-4.0163477687695504e-01,2.1561405580661022e-02) (-4.8179470918233597e-01,-2.2515842273449008e-02) (-5.6815035401912617e-01,-8.4759639628930100e-02) (-6.5850621484774385e-01,-1.6899215347429869e-01) (-7.4952345707877654e-01,-2.7928561041518252e-01) (-8.3644196044174313e-01,-4.1972419090890900e-01) (-9.1283160402230334e-01,-5.9403043419268908e-01) (-9.7042844114238713e-01,-8.0504703287094281e-01) (-9.9912107865273936e-01,-1.0540865412492695e+00) (-9.8715384989307420e-01,-1.3401890190155983e+00) (-9.2160320921981831e-01,-1.6593576679224276e+00) (-7.8916051033438095e-01,-2.0038702251062159e+00) (-5.7721850912406181e-01,-2.3617835609973805e+00) (-2.7521347260072193e-01,-2.7167550691449942e+00)
Ideally, I would like to be able to choose if I plot only the real part, the imaginary part or the abs()-value.
Here is a gnuplot only version.
Actually, I haven't seen (yet) a gnuplot example about how to plot complex numbers from a datafile.
Here, the idea is to split the data into columns at the characters ( and , and ) via:
set datafile separator '(,)'
Then you can address your i-th real and imaginary parts in column via column(3*i-1) and column(3*i), respectively.
You are creating a new dataset via plotting the data many times in a double loop, which is ok for small data. However, my guess would be that this solution might become pretty slow for large datasets, especially if you are plotting from a file. I assume if you have your data once in a datablock (instead of a file) it might be faster. Check gnuplot: load datafile 1:1 into datablock. In general, maybe it is more efficient to use another tool, e.g. Python, awk, etc. to prepare the data.
Just a thought: if you have approx. 3e9 Bytes of data and (according to your example) approx. 48-50 Bytes per datapoint and if you want to plot it as a square graph, then the number of pixels on a side would be sqrt(3e9/50)=7746 pixels. I doubt that you have a display which can display this at once.
Edit:
The modified version below is now using set print to datablock and is much faster then the original version (using a double loop of plot ... every ...). The speed improvement I can already see with my little data example. Good luck with your huge dataset ;-).
Just for reference and comparison, the old version listed again here:
# create a new datablock with row,col,Real,Imag,Abs
# using plot ...with table (pretty slow and inefficient)
set table $Data2
set datafile separator '(,)' # now, split your data at these characters
myReal(i) = column(3*i-1)
myImag(i) = column(3*i)
myAbs(i) = sqrt(myReal(i)**2 + myImag(i)**2)
plot for [row=0:rowMax-1] for [col=1:colMax] $Data u (row):(col):(myReal(col)):(myImag(col)):(myAbs(col)) every ::row::row w table
set datafile separator whitespace # set separator back to whitespace
unset table
Code: (modified using set print)
### plotting complex numbers
reset session
$Data <<EOD
(0.1,0.1) (0.2,1.2) (0.3,2.3) (0.4,3.4) (0.5,4.5)
(1.1,0.1) (1.2,1.2) (1.3,2.3) (1.4,3.4) (1.5,4.5)
(2.1,0.1) (2.2,1.2) (2.3,2.3) (2.4,3.4) (2.5,4.5)
(3.1,0.1) (3.2,1.2) (3.3,2.3) (3.4,3.4) (3.5,4.5)
(4.1,0.1) (4.2,1.2) (4.3,2.3) (4.4,3.4) (4.5,4.5)
(5.1,0.1) (5.2,1.2) (5.3,2.3) (5.4,3.4) (5.5,4.5)
(6.1,0.1) (6.2,1.2) (6.3,2.3) (6.4,3.4) (6.5,4.5)
(7.1,0.1) (7.2,1.2) (7.3,2.3) (7.4,3.4) (7.5,4.5)
EOD
stats $Data u 0 nooutput # get number of columns and rows, separator is whitespace
colMax = STATS_columns
rowMax = STATS_records
# create a new datablock with row,col,Real,Imag,Abs
# using print to datablock
set print $Data2
myCmplx(row,col) = word($Data[row+1],col)
myReal(row,col) = (s=myCmplx(row,col),s[2:strstrt(s,',')-1])
myImag(row,col) = (s=myCmplx(row,col),s[strstrt(s,',')+1:strlen(s)-1])
myAbs(row,col) = sqrt(myReal(row,col)**2 + myImag(row,col)**2)
do for [row=0:rowMax-1] {
do for [col=1:colMax] {
print sprintf("%d %d %s %s %g",row-1,col,myReal(row,col),myImag(row,col),myAbs(row,col))
}
}
set print
set key box opaque
set multiplot layout 2,2
plot $Data2 u 1:2:3 w image ti "Real part"
plot $Data2 u 1:2:4 w image ti "Imaginary part"
set origin 0.25,0
plot $Data2 u 1:2:5 w image ti "Absolute value"
unset multiplot
### end of code
Result:
Maybe not what you asked for but I think it is neat to plot directly from your code and it is simple to modify what you want to show abs(x),real(x),... Here is a simple snippet to plot an Armadillo matrix as an image in gnuplot (Linux)
#include <armadillo>
using namespace std;
using namespace arma;
void plot_image(mat& x, FILE* cmd_pipe)
{
fputs("set nokey;set yrange [*:*] reverse\n", cmd_pipe);
fputs("plot '-' matrix with image\n", cmd_pipe);
for(uword r=0; r<x.n_rows; r++){
for(uword c=0; c<x.n_cols; c++){
string str=to_string(x(r,c))+" ";
fputs(str.c_str(), cmd_pipe);
}
fputs("\n", cmd_pipe);
}
fputs("e\n", cmd_pipe);
}
int main()
{
FILE* gnuplot_pipe = popen("gnuplot -persist","w");
mat x={{1,2,3,4,5},
{2,2,3,4,5},
{3,3,3,4,5},
{4,4,4,4,5},
{5,5,9,9,9}};
plot_image(x,gnuplot_pipe);
return 0 ;
}
The output is:
I am attempting to implement the slowly updating global window side inputs example from the documentation from java into python and I am kinda stuck on what the AfterProcessingTime.pastFirstElementInPane() equivalent in python. For the map I've done something like this:
class ApiKeys(beam.DoFn):
def process(self, elm) -> Iterable[Dict[str, str]]:
yield TimestampedValue(
{"<api_key_1>": "<account_id_1>", "<api_key_2>": "<account_id_2>",},
elm,
)
map = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # would be timestamp.Timestamp.now() in production
20, # would be timestamp.MAX_TIMESTAMP in production
1, # would be around 1 hour or so in production
)
)
| "GenSequence"
>> PeriodicSequence()
| "ApplyWindowing"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(Always(), AfterProcessingTime(???)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| "api_keys" >> beam.ParDo(ApiKeys())
)
I am hoping to use this as a Dict[str, str] input to a downstream function that will have windows of 60 seconds, merging with this one that I hope to update on an hourly basis.
The point is to run this on google cloud dataflow (where we currently just re-release it to update the api_keys).
I've pasted the java example from the documentation below for convenience sake:
public static void sideInputPatterns() {
// This pipeline uses View.asSingleton for a placeholder external service.
// Run in debug mode to see the output.
Pipeline p = Pipeline.create();
// Create a side input that updates each second.
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input, OutputReceiver<Map<String, String>> o) {
// Replace map with test data from the placeholder external service.
// Add external reads here.
o.output(PlaceholderExternalService.readTestData());
}
}))
.apply(View.asSingleton());
// Consume side input. GenerateSequence generates test data.
// Use a real source (like PubSubIO or KafkaIO) in production.
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, KV<Long, Long>>() {
#ProcessElement
public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug(
"Value is {}, key A is {}, and key B is {}.",
c.element(),
keyMap.get("Key_A"),
keyMap.get("Key_B"));
}
})
.withSideInputs(map));
}
/** Placeholder class that represents an external service generating test data. */
public static class PlaceholderExternalService {
public static Map<String, String> readTestData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
Any ideas as to how to emulate this example would be enormously appreciated, I've spent literally days on this issue now :(
Update #2 based on #AlexanderMoraes
So, I've tried changing it according to my understanding of your suggestions:
main_window_size = 5
trigger_interval = 30
side_input = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # timestamp.Timestamp.now().__float__(),
60, # timestamp.Timestamp.now().__float__() + 30.0,
trigger_interval, # fire_interval
)
)
| "GenSequence" >> PeriodicSequence()
| "api_keys" >> beam.ParDo(ApiKeys())
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(window_size)),
accumulation_mode=AccumulationMode.DISCARDING,
)
)
But when combining this with another pipeline with windowing set to something smaller than trigger_interval I am unable to use the dictionary as a singleton because for some reason they are duplicated:
ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}", "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}". [while running 'Pair with AccountIDs']
Is there some way to clarify that the singleton output should ignore whatever came before it?
The title of the question "slowly updating side inputs" refers to the documentation, which already has a Python version of the code. However, the code you provided is from "updating global window side inputs", which just has the Java version for the code. So I will be addressing an answer for the second one.
You are not able to reproduce the AfterProcessingTime.pastFirstElementInPane() within Python. This function is used to fire triggers, which determine when to emit results of each window (refered as pane). In your case, this particular call AfterProcessingTime.pastFirstElementInPane() creates a trigger that fires when the current processing time passes the processing time at which this trigger saw the first element in a pane, here. In Python this is achieve using AfterWatermark and AfterProcessingTime().
Below, there are two pieces of code one in Java and another one in Python. Thus, you can understand more about each one's usage. Both examples set a time-based trigger which emits results one minute after the first element of the window has been processed. Also, the accumulation mode is set for not accumulating the results (Java: discardingFiredPanes() and Python: accumulation_mode=AccumulationMode.DISCARDING).
1- Java:
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))
.discardingFiredPanes());
2- Python: the trigger configuration is the same as described in point 1
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING)
The examples above were taken from thew documentation.
AsIter() insted of AsSingleton() worked for me
From a C file I'd like to parse switch parts to be able to identify 2 things :
a switch with only 2 cases :
switch(Data)
{
case 0: value = 10 ; break;
case 1 : value = 20 ;break;
default:
somevar = false;
value = 0 ;
----
break;
}
==> for instance would print "section with 2 case"
a switch with many (unlimited) cases :
switch(Data)
{
case Constant1 : value = 10 ; break;
case constant2 : value = 20 ;break;
case constant3 : value = 30 ;break;
case constant4 : value = 40 ;break;
default:
somevar = false;
value = 0 ;
----
break;
}
==> would print "section with case : Constant1, Constant2, Constant3, Constant4"
To do that, I've done the following :
original_file = open(original_file,"r")
for line in original_file:
line_nb +=1
regex_case = re.compile('.*case.*:')
found_case = regex_case.search(line)
if found_case :
cases_dict[line_nb]=found_case.group() # rely on line nb is somewhat not reliable as the c file may have break on an additional line
bool_or_enum(cases_dict)
what would need the bool_or_enum to test all the required results:
def bool_or_enum(in_dict={}):
sorted_dict = sorted(in_dict.items(), key=operator.itemgetter(0))
for index, item in enumerate(sorted_dict):
According to comments I've searching and found that 2 solutions can be provided :
by using pycparser
pros: is a python package, free and opensource
cons : not really easy to start with, need addtional tools to preprocess files (gcc, llvm, etc) .
by using an external tool : understand from scitools
This tool is usable with its GUI to build a complete project to parse so you can have call graph, metrics, code checking, etc. . For this question I've been using the API which is available as docs and examples
pros:
the parsing is totally done by the tool from the GUI
I have many source files, re-parsing from a complete directory is a "push-button" solution
cons :
not free
not open sourced
I usually prefer open source project but in that case Understand is the unique solution. By the way the license is not so expensive and over all : I've chosen this because I could parse some files that couldn't be compiled because their dependencies (libs and header files) couldn't be available.
Here is the code I've used :
import understand
understand_file="C:\\Users\\dlewin\\myproject.udb"
#Create a list with all the cases from the file
def find_cases(file):
returnList = []
for lexeme in file.lexer(False,8,False,True):
if ( lexeme.text() == "case" ) :
returnList.append(lexeme.line_end()) #line nb
returnList.append(lexeme.text()) #found a case
return returnList
def find_identifiers(file):
returnList = []
# Open the file lexer with macros expanded and inactive code removed
for lexeme in file.lexer(False,8,False,True):
if(lexeme.token() == "Identifier"):
returnList.append(lexeme.line_end()) #line nb
returnList.append(lexeme.text()) #identifier found
return returnList
db = understand.open(understand_file ) # Open Database
file = db.lookup("mysourcefile.cpp","file")[0]
print (file.longname())
liste_idents = find_identifiers(file)
liste_cases = find_cases(file)
I want to be able to detect a pattern in a PDF and somehow flag it.
For instance, in this PDF, there's the string *2. I want to be able to parse the PDF, detect all instances of *[integer], and do something to call attention to the matches (like highlight them yellow or add a symbol in the margin).
I would prefer to do this in Python, but I'm open to other languages. So far, I've been able to use pyPdf to read the PDF's text. I can use a regex to detect the pattern. But I haven't been able to figure out how to flag the match and re-save the PDF.
Either people are not interested, or Python's not capable, so here's solution in Perl :-). Seriously, as noted above, you don't need to "alter strings". PDF annotations are the solution for you. I had small project with annotations not long ago, some code's from there. But, my content parser was not universal, and you don't need full-blown parsing -- meaning being able to alter content and write it back. Therefore I resorted to external tool. PDF Library I use is somewhat low-level, but I don't mind. It also means, one's expected to have proper knowledge of PDF internals to understand what's going on. Otherwise, just use the tool.
Here's a shot of marking e.g. all gerunds in OP's file with a command
perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'
The code (comment inside worth reading, too):
use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;
#####################################################################
#
# This is PDF highlight mark-up tool.
# Though fully functional, it's still a prototype proof-of-concept.
# Please don't feed it with non-pdf files or patterns like '\d*'
# (because you probably want '\d+', don't you?).
#
# Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
# ToDo:
# - error handling is primitive if any.
# - cropped files (CropBox) are processed incorrectly. Fix it.
# - of course there can be other useful parameters.
# - allow loading them from file.
# - allow searching across lines (e.g. for multi-word patterns)
# and certainly across "spans" within a line (see mudraw output).
# - multi-color mark-up, not just yellow.
# - control over output file name.
# - compress output (use cleanoutput method instead of output,
# plus more robust (think compressed object streams) compressors
# may be useful).
# - file list processing.
# - annotations are not just colorful marks on the page, their
# dictionaries can contain all sorts of useful information, which may
# be extracted automatically further up the food chain i.e. by
# whoever consumes these files (date, time, author, comments, actual
# text below, etc., etc., plus think of customized appearence streams,
# placing them on layers, etc..
# - ???
#
# Most complexity in the code comes from adding appearance
# dictionary (AP). You can safely delete it, because most viewers don't
# need AP for standard annotations. Ironically, muPDF-viewer wants it
# (otherwise highlight placement is not 100% correct), and since I relied
# on muPDF-tools, I thought it be proper to create PDFs consumable by
# their viewer... Firefox wants AP too, btw.
#
#####################################################################
my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file, '-p=s' => \$csv,
'-c!' => \$c_flag, '-w!' => \$w_flag)
and defined($file)
and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
"\t-f\t\tFILE\t PDF file to annotate\n",
"\t-p\t\tLIST\t comma-separated patterns\n",
"\t-c or -noc\t\t be case sensitive (default = no)\n",
"\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
->add(split(',', $csv))
->anchor_word($w_flag)
->flags($c_flag ? '' : 'i')
->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);
sub __num_nodes_list {
my $precision = shift;
[ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} #_ ]
}
sub add_highlight {
my ($idx, $x1, $y1, $x2, $y2) = #_;
my $p = $pdf->getPage($idx);
# mirror vertically to get to normal cartesian plane
my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
# corner radius
my $r = 2;
# AP appearance stream
my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
$s .= "1 j #{[sprintf '%.0f', $r * 2]} w\n";
$s .= "0 0 #{[sprintf '%.1f', $x2 - $x1]} ";
$s .= "#{[sprintf '%.1f',$y2 - $y1]} re B\n";
my $highlight = CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Highlight'),
Rect => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
QuadPoints => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
BS => CAM::PDF::Node->new('dictionary', {
S => CAM::PDF::Node->new('label', 'S'),
W => CAM::PDF::Node->new('number', 0),
}),
Border => CAM::PDF::Node->new('array',
__num_nodes_list(0, 0, 0, 0)),
C => CAM::PDF::Node->new('array',
__num_nodes_list(0, 1, 1, 0)),
AP => CAM::PDF::Node->new('dictionary', {
N => CAM::PDF::Node->new('reference',
$pdf->appendObject(undef,
CAM::PDF::Node->new('object',
CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Form'),
BBox => CAM::PDF::Node->new('array',
__num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2,
$y2 - $y1 + $r * 2)),
Resources => CAM::PDF::Node->new('dictionary', {
ExtGState => CAM::PDF::Node->new('dictionary', {
GS0 => CAM::PDF::Node->new('dictionary', {
BM => CAM::PDF::Node->new('label',
'Multiply'),
}),
}),
}),
StreamData => CAM::PDF::Node->new('stream', $s),
Length => CAM::PDF::Node->new('number', length $s),
}),
),
,0),
),
}),
});
$p->{Annots} ||= CAM::PDF::Node->new('array', []);
push #{$pdf->getValue($p->{Annots})}, $highlight;
$pdf->{changes}->{$p->{Type}->{objnum}} = 1
}
my $page_index = 1;
for my $page (#{$tree->{page}}) {
for my $block (#{$page->{block}}) {
for my $line (#{$block->{line}}) {
for my $span (#{$line->{span}}) {
my $string = join '', map {$_->{c}} #{$span->{char}};
while ($string =~ /$re/g) {
my ($x1, $y1) =
split ' ', $span->{char}->[$-[0]]->{bbox};
my (undef, undef, $x2, $y2) =
split ' ', $span->{char}->[$+[0] - 1]->{bbox};
add_highlight($page_index, $x1, $y1, $x2, $y2)
}
}
}
}
$page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);
__END__
P.s. I tagged the Question with 'Perl', to maybe have some feedback (code corrections, etc.) from community.
This is non-trivial. The problem is that PDF files are not meant to be "updated" on anything less than a page. You basically have to parse the page, adjust the PostScript rendering, and then write it back out. I don't think PyPDF has the support for doing what you want.
If "all" you want to do is to add highlighting you can probably just use the annotation dictionary. See the PDF specification for more information.
You might be able to do this using pyPDF2 but I haven't looked into it closely.
I am struggling with an SQL command issued from my python script. Here is what I have tried so far, the first example works fine but the rest do not.
#working SQL = "SELECT ST_Distance(ST_Transform(ST_GeomFromText(%s, 4326),27700),ST_Transform(ST_GeomFromText(%s, 4326),27700));"
#newPointSQL = "SELECT ST_ClosestPoint(ST_GeomFromText(%s),ST_GeomFromText(%s));"
#newPointSQL = "SELECT ST_As_Text(ST_ClosestPoint(ST_GeomFromText(%s), ST_GeomFromText(%s)));"
#newPointSQL = "SELECT ST_AsText(ST_ClosestPoint(ST_GeomFromEWKT(%s), ST_GeomFromText(%s)));"
#newPointSQL = "SELECT ST_AsText(ST_Line_Interpolate_Point(ST_GeomFromText(%s),ST_Line_Locate_Point(ST_GeomFromText(%s),ST_GeomFromText(%s))));"
newPointData = (correctionPathLine,pointToCorrect) - ( MULTILINESTRING((-3.16427109855617 55.9273798550064,-3.16462372283029 55.9273883602162)), POINT(-3.164667 55.92739))
My data is picked up ok because the first sql is successfull when executed. The problem is when I use the ST_ClosestPoint function.
Can anyone notice a misuse anywhere? Am I using the ST_ClosetsPoint in a wrong way?
In the last example, I did modify my data (in case someone notices) to run it but it still would not execute.
I don't know with what kind of geometries you are dealing with, but I had the same trouble before with MultiLineStrings, I realized that when a MultiLinestring can't be merged, the function ST_Line_Locate_Point doesn't work.(you can know if a MultiLineString can't be merged using the ST_LineMerge function) I've made a pl/pgSQL function based in an old maillist but I added some performance tweaks, It only works with MultiLineStrings and LineStrings (but can be easily modified to work with Polygons). First it checks if the geometry only has 1 dimension, if it has, you can use the old ST_Line_Interpolate_Point and ST_Line_Locate_Point combination, if not, then you have to do the same for each LineString in the MultiLineString. Also I've added a ST_LineMerge for pre 1.5 compatibility :
CREATE OR REPLACE FUNCTION ST_MultiLine_Nearest_Point(amultiline geometry,apoint geometry)
RETURNS geometry AS
$BODY$
DECLARE
mindistance float8;
adistance float8;
nearestlinestring geometry;
nearestpoint geometry;
simplifiedline geometry;
line geometry;
BEGIN
simplifiedline:=ST_LineMerge(amultiline);
IF ST_NumGeometries(simplifiedline) <= 1 THEN
nearestpoint:=ST_Line_Interpolate_Point(simplifiedline, ST_Line_Locate_Point(simplifiedline,apoint) );
RETURN nearestpoint;
END IF;
-- *Change your mindistance according to your projection, it should be stupidly big*
mindistance := 100000;
FOR line IN SELECT (ST_Dump(simplifiedline)).geom as geom LOOP
adistance:=ST_Distance(apoint,line);
IF adistance < mindistance THEN
mindistance:=adistance;
nearestlinestring:=line;
END IF;
END LOOP;
RETURN ST_Line_Interpolate_Point(nearestlinestring,ST_Line_Locate_Point(nearestlinestring,apoint));
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE STRICT;
UPDATE:
As noted by #Nicklas Avén ST_Closest_Point() should work, ST_Closest_Point was added in 1.5 .