How can I parse a C header file with Perl? - python

I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back.
For example I have some structure like
const BYTE Some_Idx[] = {
4,7,10,15,17,19,24,29,
31,32,35,45,49,51,52,54,
55,58,60,64,65,66,67,69,
70,72,76,77,81,82,83,85,
88,93,94,95,97,99,102,103,
105,106,113,115,122,124,125,126,
129,131,137,139,140,149,151,152,
153,155,158,159,160,163,165,169,
174,175,181,182,183,189,190,193,
197,201,204,206,208,210,211,212,
213,214,215,217,218,219,220,223,
225,228,230,234,236,237,240,241,
242,247,249};
Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like:
const BYTE Some_Idx_Mod_mul_2[] = {
8,14,20, ...
...
484,494,498};
Is there any Perl library already available for this? If not Perl, something else like Python is also OK.
Can somebody please help!!!

Keeping your data lying around in a header makes it trickier to get at using other programs like Perl. Another approach you might consider is to keep this data in a database or another file and regenerate your header file as-needed, maybe even as part of your build system. The reason for this is that generating C is much easier than parsing C, it's trivial to write a script that parses a text file and makes a header for you, and such a script could even be invoked from your build system.
Assuming that you want to keep your data in a C header file, you will need one of two things to solve this problem:
a quick one-off script to parse exactly (or close to exactly) the input you describe.
a general, well-written script that can parse arbitrary C and work generally on to lots of different headers.
The first case seems more common than the second to me, but it's hard to tell from your question if this is better solved by a script that needs to parse arbitrary C or a script that needs to parse this specific file. For code that works on your specific case, the following works for me on your input:
#!/usr/bin/perl -w
use strict;
open FILE, "<header.h" or die $!;
my #file = <FILE>;
close FILE or die $!;
my $in_block = 0;
my $regex = 'Some_Idx\[\]';
my $byte_line = '';
my #byte_entries;
foreach my $line (#file) {
chomp $line;
if ( $line =~ /$regex.*\{(.*)/ ) {
$in_block = 1;
my #digits = #{ match_digits($1) };
push #digits, #byte_entries;
next;
}
if ( $in_block ) {
my #digits = #{ match_digits($line) };
push #byte_entries, #digits;
}
if ( $line =~ /\}/ ) {
$in_block = 0;
}
}
print "const BYTE Some_Idx_Mod_mul_2[] = {\n";
print join ",", map { $_ * 2 } #byte_entries;
print "};\n";
sub match_digits {
my $text = shift;
my #digits;
while ( $text =~ /(\d+),*/g ) {
push #digits, $1;
}
return \#digits;
}
Parsing arbitrary C is a little tricky and not worth it for many applications, but maybe you need to actually do this. One trick is to let GCC do the parsing for you and read in GCC's parse tree using a CPAN module named GCC::TranslationUnit.
Here's the GCC command to compile the code, assuming you have a single file named test.c:
gcc -fdump-translation-unit -c test.c
Here's the Perl code to read in the parse tree:
use GCC::TranslationUnit;
# echo '#include <stdio.h>' > stdio.c
# gcc -fdump-translation-unit -c stdio.c
$node = GCC::TranslationUnit::Parser->parsefile('stdio.c.tu')->root;
# list every function/variable name
while($node) {
if($node->isa('GCC::Node::function_decl') or
$node->isa('GCC::Node::var_decl')) {
printf "%s declared in %s\n",
$node->name->identifier, $node->source;
}
} continue {
$node = $node->chain;
}

Sorry if this is a stupid question, but why worry about parsing the file at all? Why not write a C program that #includes the header, processes it as required and then spits out the source for the modified header. I'm sure this would be simpler than the Perl/Python solutions, and it would be much more reliable because the header would be being parsed by the C compilers parser.

You don't really provide much information about how what is to be modified should be determined, but to address your specific example:
$ perl -pi.bak -we'if ( /const BYTE Some_Idx/ .. /;/ ) { s/Some_Idx/Some_Idx_Mod_mul_2/g; s/(\d+)/$1 * 2/ge; }' header.h
Breaking that down, -p says loop through input files, putting each line in $_, running the supplied code, then printing $_. -i.bak enables in-place editing, renaming each original file with a .bak suffix and printing to a new file named whatever the original was. -w enables warnings. -e'....' supplies the code to be run for each input line. header.h is the only input file.
In the perl code, if ( /const BYTE Some_Idx/ .. /;/ ) checks that we are in a range of lines beginning with a line matching /const BYTE Some_Idx/ and ending with a line matching /;/.
s/.../.../g does a substitution as many times as possible. /(\d+)/ matches a series of digits. The /e flag says the result ($1 * 2) is code that should be evaluated to produce a replacement string, instead of simply a replacement string. $1 is the digits that should be replaced.

If all you need to do is to modify structs, you can directly use regex to split and apply changes to each value in the struct, looking for the declaration and the ending }; to know when to stop.
If you really need a more general solution you could use a parser generator, like PyParsing

There is a Perl module called Parse::RecDescent which is a very powerful recursive descent parser generator. It comes with a bunch of examples. One of them is a grammar that can parse C.
Now, I don't think this matters in your case, but the recursive descent parsers using Parse::RecDescent are algorithmically slower (O(n^2), I think) than tools like Parse::Yapp or Parse::EYapp. I haven't checked whether Parse::EYapp comes with such a C-parser example, but if so, that's the tool I'd recommend learning.

Python solution (not full, just a hint ;)) Sorry if any mistakes - not tested
import re
text = open('your file.c').read()
patt = r'(?is)(.*?{)(.*?)(}\s*;)'
m = re.search(patt, text)
g1, g2, g3 = m.group(1), m.group(2), m.group(3)
g2 = [int(i) * 2 for i in g2.split(',')
out = open('your file 2.c', 'w')
out.write(g1, ','.join(g2), g3)
out.close()

There is a really useful Perl module called Convert::Binary::C that parses C header files and converts structs from/to Perl data structures.

You could always use pack / unpack, to read, and write the data.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my #data;
{
open( my $file, '<', 'Some_Idx.bin' );
local $/ = \1; # read one byte at a time
while( my $byte = <$file> ){
push #data, unpack('C',$byte);
}
close( $file );
}
print join(',', #data), "\n";
{
open( my $file, '>', 'Some_Idx_Mod_mul_2.bin' );
# You have two options
for my $byte( #data ){
print $file pack 'C', $byte * 2;
}
# or
print $file pack 'C*', map { $_ * 2 } #data;
close( $file );
}

For the GCC::TranslationUnit example see hparse.pl from http://gist.github.com/395160
which will make it into C::DynaLib, and the not yet written Ctypes also.
This parses functions for FFI's, and not bare structs contrary to Convert::Binary::C.
hparse will only add structs if used as func args.

Related

Romanize generic Japanese in commandline

I would like to transliterate generic Japanese, including Kanji, by the standard Hepburn system on the bash command line.
I've evaluated several options, but
Google Translator (available via Translate Shell) is only accurate at Hiragana / Katakana
KAKASI delivers ASCII, but no transliteration (so Toukyou instead of Tōkyō)
So I would like to parse the ouput of http://nihongo.j-talk.com
The output is in div.outputwrap or div.output
If it's futile to do this purely with Bash tools (curl / jq?), how could I reach this with Python / BeautifulSoup?
Sorry for giving no snippet, I have no clue how to POST data to a website AND use the result if there is no API.
Taking a look of source html of http://nihongo.j-talk.com site, I've made a guess of API.
Here are the steps:
1) Send a Japanese string to the server by wget and obtain the result in index.html.
2) Parse the index.html and extract Romaji strings.
Here is the sample code:
#!/bin/bash
string="日本語は、主に日本で使われている言語である。日本では法規によって「公用語」として規定されているわけではないが、各種法令(裁判所法第74条、会社計算規則第57条、特許法施行規則第2条など)において日本語を用いることが定められるなど事実>上の公用語となっており、学校教育の「国語」でも教えられる。"
uniqid="46a7e5f7e7c7d8a7d9636ecb077da485479b66bc"
wget -N --post-data "uniqid=$uiqid&Submit='Translate Now'&kanji_parts=standard&kanji=$string&converter=spaced&kana_output=romaji" http://nihongo.j-talk.com/ > /dev/null 2>&1
perl -e '
$file = "index.html";
open(FH, $file) or die "$file: $!\n";
while (<FH>) {
if (/<div id=.spaced. class=.romaji.>(.+)/) {
($str = $1) =~ s/<.*?>//g;
$str =~ s/\&\#(\d+);/&utfconv($1)/eg;
print $str, "\n";
}
}
# utf16 to utf8
sub utfconv {
$utf16 = shift;
my $upper = ($utf16 >> 6) & 0b0001_1111 | 0b1100_0000;
my $lower = $utf16 & 0b0011_1111 | 0b1000_0000;
pack("C2", $upper, $lower);
}'
Some comments:
- I wrote the parser with Perl just because it is rather familiar to me but you may modify or convert it to other language by reading index.html file.
- The uniqid string is what I have picked from html source of the site. If it doesn't work well, make sure what is embedded in the html source.
Hope this helps.

Trying to read DBM file

I have a stripped down real-time Linux box that interfaces with some hardware.
The configuration files are *.dbm files and I cannot access them. They seem to be some sort of key-value database but every library I have tried has come up empty.
I have tried the DBM reading libraries from Perl, Python, and Ruby with no luck. Any guidance on these files would be great, I have not seen them before.
This is what happens when I cat one file out.
DBMFILE Aug 31 2004,�
,jy �
�~���"��+�K&��gB��7JJ�
,��GLOBA.PB_COMBI�SMSI���
JG]
,��BUS_DP
PC �
'
xLokalT
J��
,��SL_DP
PC!�
��
#,��PLC_PARAMJPf,��PROJEKT�PROFIBUS new network1.00022.02.2012J,��KBL_HEADER:�JJp,��KBLJ��,��ALI-SETUPB ����
������������������JJ,,��OBJ-DEFJJ��,��ALI_CLIENTTJJ�
,��ALI_SERVERJ J\r�����2, �� ST_OV_00Boolean0Integer8 0Integer16
0Integer32
0Unsigned8
0Unsigned32Floating-Point0igned16
Octet String Jo� ,��DESCRIPT �ABB OyABB Drives RPBA-01ABBSlave1***reserved***�
�
%
So to show what i've tried already, and only come up with empty objects ( no key-values)*edit
perl -
#!/usr/bin/perl -w
use strict;
use DB_File;
use GDBM_File;
my ($filename, %hash, $flags, $mode, $DB_HASH) = #ARGV;
tie %hash, 'DB_File', [$filename, $flags, $mode, $DB_HASH]
or die "Cannot open $filename: $!\n";
while ( my($key, $value) = each %hash ) {
print "$key = $value\n";
}
# these unties happen automatically at program exit
untie %hash;
which returns nothing
python -
db = dbm.open('file', 'c')
ruby -
db = DBM.open('file', 666, DBM::CREATRW)
Every one of these returned empty. I assume they use the same low level library. Some history/context on DBM files would be great as there seems to be some different versions.
**Edit
running file on it returns
$ file abb12mb_uncontrolledsynch_ppo2_1slave.dbm
abb12mb_uncontrolledsynch_ppo2_1slave.dbm: data
and running strings outputs
$ strings abb12mb_uncontrolledsynch_ppo2_1slave.dbm
DBMFILE
Aug 31 2004
GLOBAL
PB_COMBI
SMSI
BUS_DP
Lokal
SL_DP
PLC_PARAM
PROJEKT
PROFIBUS new network
1 .000
22.02.2012
KBL_HEADER
ALI-SETUP
OBJ-DEF
ALI_CLIENT
ALI_SERVER
ST_OV_0
Boolean
Integer8
Integer16
Integer32
Unsigned8
Unsigned16
Unsigned32
Floating-Point
Octet String
DESCRIPT
ABB Oy
ABB Drives RPBA-01
ABBSlave1
***reserved***
Just to make my comment clear, you should try using the default options for DB_File, like this
use strict;
use warnings;
use DB_File;
my ($filename) = #ARGV;
tie my %dbm, 'DB_File', $filename or die qq{Cannot open DBM file "$filename": $!};
print "$_\n" for keys %dbm;
From the documentation for Perl's dbmopen function:
[This function has been largely superseded by the tie function.]
You probably want to try tieing it with DB_File.
use DB_File;
tie %hash, 'DB_File', $filename, $flags, $mode, $DB_HASH;
Then your data is in %hash.
Might also be interesting to run file against the file to see what it actually is.

more efficient way to extract lines from a file whose first column matches another file

I have two files, target and clean.
Target has some 1055772 lines, each of which has 3000 columns, tab separated. (size is 7.5G)
Clean is slightly shorter at 806535. Clean only has one column, that matches the format of the first column of Target. (size is 13M)
I want to extract the lines of target that has a matching first column to a line in clean.
I wrote a grep based loop to do this but its painfully slow. Speedups will be rewarded with upvotes and/or smileys.
clean = "/path/to/clean"
target = "/path/to/target"
oFile = "/output/file"
head -1 $target > $oFile
cat $clean | while read snp; do
echo $snp
grep $snp $target >> $oFile
done
$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G
Edit: Wrote a simple python script to do it.
clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"
allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"
outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")
clean_variant_dict = {}
for line in open(clean_variants_file):
clean_variant_dict[line.strip()] = 0
for line in open(allChr_file):
ll = line.strip().split("\t")
id_ = ll[0]
if id_ in clean_variant_dict:
outfile.write(line)
outfile.close()
This Perl solution would use quite a lot of memory (because we load the entire file into memory), but would save you from looping twice. It uses a hash for duplicate checking, where each line is stored as a key. Note that this code is not thoroughly tested, but seems to work on a limited set of data.
use strict;
use warnings;
my ($clean, $target) = #ARGV;
open my $fh, "<", $clean or die "Cannot open file '$clean': $!";
my %seen;
while (<$fh>) {
chomp;
$seen{$_}++;
}
open $fh, "<", $target
or die "Cannot open file '$target': $!"; # reuse file handle
while (<$fh>) {
my ($first) = /^([^\t]*)/;
print if $seen{$first};
}
If your target file is proper tab separated CSV data, you could use Text::CSV_XS which reportedly is very fast.
python solution:
with open('/path/to/clean', 'r') as fin:
keys = set(fin.read().splitlines())
with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
for line in fin:
if line[:line.index('\t')] in keys:
fout.write(line)
Using a perl one-liner:
perl -F'\t' -lane '
BEGIN{ local #ARGV = pop; #s{<>} = () }
print if exists $s{"$F[0]\n"}
' target clean
Switches:
-F: Alternate pattern for -a switch
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Or as a perl script:
use strict;
use warnings;
die "Usage: $0 target clean\n" if #ARGV != 2;
my %s = do {
local #ARGV = pop;
map {$_ => 1} (<>)
};
while (<>) {
my ($f) = split /\t/;
print if $s{"$f\n"}
}
For fun, I thought I would convert a solution or two into Perl6.
Note: These are probably going to be slower than the originals until Rakudo/NQP gets more optimizations, which really only started in earnest fairly recently at the time of posting.
First is TLP's Perl5 answer converted nearly one-to-one into Perl6.
#! /usr/bin/env perl6
# I have a link named perl6 aliased to Rakudo on MoarVM-jit
use v6;
multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version
MAIN( :$clean, :$target ); # call the named version
}
multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments
note "Processing clean file";
my %seen := SetHash.new;
for open( $clean, :r ).lines -> $line {
next unless $line.chars; # skip empty lines
%seen{$line}++;
}
note "Processing target file";
for open( $target, :r ).lines -> $line {
$line ~~ /^ $<first> = <-[\t]>+ /;
say $line if %seen{$<first>.Str};
}
}
I used MAIN subroutines so that you will get a Usage message if you don't give it the correct arguments.
I also used a SetHash instead of a regular Hash to reduce memory use since we don't need to know how many we have found, only that they were found.
Next I tried to combine all of the lines in the clean file into one regex.
This is similar to the sed and grep answer from Cyrus, except instead of many regexes there is only one.
I didn't want to change the subroutine that I had already written, so I added one that is differentiated by adding --single-regex or -s to the command line. ( All of the examples are in the same file )
multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){
note "Processing clean file";
my $regex;
{
my #regex = open( $clean, :r ).lines.grep(*.chars);
$regex = /^ [ | #regex ] /;
} # throw away #regex
note "Processing target file";
for open( $target, :r ).lines -> $line {
say $line if $line ~~ $regex;
}
}
I will say that I took quite a bit longer to write this than it would have taken me to write it in Perl5. Most of the time was taken up searching for some idioms online, and looking over the source files for Rakudo. I don't think it would take much effort to get better at Perl6 than Perl5.

Make math on numbers being in the specific part of a file

I have many files containing :
data: numbers that I have to use/manipulate, formatted in a specific way, specified in the following,
rows that I need just as they are (configurations of the software use these files).
The files most of time are huge, many millions of rows, and can't be handled fast enough with bash. I have made a script that checks each line to see if it's data, writing them to another file (without calculations), but it's very slow (many thousand rows per second).
The data is formatted in a way like this:
text
text
(
($data $data $data)
($data $data $data)
($data $data $data)
)
text
text
(
($data $data $data)
($data $data $data)
)
text
( text )
( text )
(text text)
I have to make another file, using $data, that should be the results of some operation with it.
The portions of file that contains numbers can be distinguished by the presence of this occurrence:
(
(
and the same:
)
)
at the end.
I've made before a C++ program that makes the operation I want, but for files containing columns of numbers only. I don't know how to ignore the text that I don't have to modify and handle the way the data is formatted.
Where do I have to look to solve my problem smartly?
Which should be the best way to handle data files, formatted in different ways, and make math with them? Maybe Python?
Are you sure that the shell isn't fast enough? Maybe your bash just needs improved. :)
It appears that you want to print every line after a line with just a ( until you get to a closing ). So...
#!/usr/bin/ksh
print=0
while read
do
if [[ "$REPLY" == ')' ]]
then
print=0
elif [[ "$print" == 1 ]]
then
echo "${REPLY//[()]/}"
elif [[ "$REPLY" == '(' ]]
then
print=1
fi
done
exit 0
And, with your provided test data:
danny#machine:~$ ./test.sh < file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data
I'll bet you'll find that to be roughly as fast as anything else you would write. If I was going to be using this often, I'd be inclined to add several more error checks - but if your data is well-formed, this will work fine.
Alternatively, you could just use sed.
danny#machine:~$ sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}' file
$data $data $data
$data $data $data
$data $data $data
$data $data $data
$data $data $data
performance note edit:
I was comparing python implementations below, so I thought I'd test these as well. The sed solution runs about identically to the fastest python implementation on the same data - less than one second (0.9 seconds) to filter ~80K lines. The bash version takes 42.5 seconds to do it. However, just replacing #!/bin/bash with #!/usr/bin/ksh above (which is ksh93, on Ubuntu 13.10) and making no other changes to the script reduces runtime down to 10.5 seconds. Still slower than python or sed, but that's part of why I hate scripting in bash.
I also updated both solutions to remove the opening and closing parens, to be more consistent with the other answers.
Here is something which should perform well on huge data, and it's using Python 3:
#!/usr/bin/python3
import mmap
fi = open('so23434490in.txt', 'rb')
m = mmap.mmap(fi.fileno(), 0, access=mmap.ACCESS_READ)
fo = open('so23434490out.txt', 'wb')
p2 = 0
while True:
p1 = m.find(b'(\n(', p2)
if p1 == -1:
break
p2 = m.find(b')\n)', p1)
if p2 == -1:
break # unmatched opening sequence!
data = m[p1+3:p2]
data = data.replace(b'(',b'').replace(b')',b'')
# Now decide: either do some computation on that data in Python
for line in data.split(b'\n'):
cols = list(map(float, data.split(b' ')))
# perform some operation on cols
# Or simply write out the data to use it as input for your C++ code
fo.write(data)
fo.write(b'\n')
fo.close()
m.close()
fi.close()
This uses mmap to map the file into memory. Then you can access it easily without having to worry about reading it in. It also is very efficient, since it can avoid unneccessary copying (from the page cache to the application heap).
I guess we need a perl solution, too.
#!/usr/bin/perl
my $p=0;
while(<STDIN>){
if( /^\)\s*$/ ){
$p = 0;
}
elsif( $p ){
s/[()]//g;
print;
}
elsif( /^\(\s*$/ ){
$p = 1;
}
}
On my system, this runs slightly slower than the fastest python implementation from above (while also doing the parenthesis removal), and about the same as
sed -n '/^($/,/^)$/{/^[()]$/d;s/[()]//gp}'
Using C provides much better speed than bash/ksh or C++(or Python, even though saying that stings). I created a text file containing 18 million lines containing the example text duplicated 1 million times. On my laptop, this C program works with the file in 1 second, while the Python version takes 5 seconds, and running the bash version under ksh(because it's faster than bash) with the edits mentioned in that answer's comments takes 1 minute 20 seconds(a.k.a 80 seconds). Note that this C program doesn't check for errors at all except for the non-existent file. Here it is:
#include <string.h>
#include <stdio.h>
#define BUFSZ 1024
// I highly doubt there are lines longer than 1024 characters
int main()
{
int is_area=0;
char line[BUFSZ];
FILE* f;
if ((f = fopen("out.txt", "r")) != NULL)
{
while (fgets(line, BUFSZ, f))
{
if (line[0] == ')') is_area=0;
else if (is_area) fputs(line, stdout); // NO NEWLINE!
else if (strcmp(line, "(\n") == 0) is_area=1;
}
}
else
{
fprintf(stderr, "THE SKY IS FALLING!!!\n");
return 1;
}
return 0;
}
If the fact it's completely unsafe freaks you out, here's a C++ version, which took 2 seconds:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
// ^ FYI, the above is a bad idea, but I'm trying to preserve clarity
int main()
{
ifstream in("out.txt");
string line;
bool is_area(false);
while (getline(in, line))
{
if (line[0] == ')') is_area = false;
else if (is_area) cout << line << '\n';
else if(line == "(") is_area = true;
}
return 0;
}
EDIT: As MvG pointed out in the comments, I wasn't benching the Python version fairly. It doesn't take 24 seconds as I originally stated, but 5 instead.

Simple command line handling equivalent of Perl in Python

I have done some basic Perl coding but never something in python. I would like to do the equivalent of sending the file to be read from in the command line option. This file is tab delimited, so split each column and then be able to perform some operation in those columns.
The perl code for doing this is
#!/usr/bin/perl
use warnings;
use strict;
while(<>) {
chomp;
my #H = split /\t/;
my $col = $H[22];
if($H[30] eq "Good") {
some operation in col...
}
else {
do something else
}
}
What would be the python equivalent of this task?
Edit: I need the H[22] column to be a unicode character. How do I make col variable to be so?
#file: process_columns.py
#!/usr/bin/python
import fileinput
for line in fileinput.input():
cols = l.split('\t')
# do something with the columns
The snippet above can be used this way
./process_columns.py < data
or just
./process_columns.py data
Related to: Python equivalent of Perl's while (<>) {...}?
#!/usr/bin/env python
import fileinput
for line in fileinput.input():
line = line.rstrip("\r\n") # equiv of chomp
H = line.split('\t')
if H[30]=='Good':
# some operation in col
# first - what do you get from this?
print repr(H[22])
# second - what do you get from this?
print unicode(H[22], "Latin-1")
else:
# do something else
pass # only necessary if no code in this section
Edit: at a guess, you are reading a byte-string and must properly encode it to a unicode string; the proper way to do this depends on what format the file is saved in and what your localization settings are. Also see Character reading from file in Python

Categories

Resources