I have one program written in C++ that outputs the data from several different types of arrays. For simplicity, I'm using ints and just writing them out one at a time to figure this out.
I need to be able to read the file in on Python, but clearly am missing something. I'm having trouble translating the concepts from C++ over to Python.
This is the C++ I have that's working - it writes out two numbers to a file and then reads that file back in (yes, I have to use the ostream.write() and istream.read() functions - that's how at the base level the library I'm using does it and I can't change it).
int main(int argc, char **argv) {
std::ofstream fout;
std::ifstream fin;
int outval1 = 1234;
int outval2 = 5678;
fout.open("out.txt");
fout.write(reinterpret_cast<const char*>(&outval1), sizeof(int));
fout.write(reinterpret_cast<const char*>(&outval2), sizeof(int));
fout.close();
int inval;
fin.open("out.txt");
while (fin.read(reinterpret_cast<char*>(&inval), sizeof(int))) {
std::cout << inval << std::endl;
}
fin.close();
return 0;
}
This is what I have on the Python side, but I know it's not correct. I don't think I should need to read in as binary but that's the only way it's working so far
with open("out.txt", "rb") as f:
while (byte := f.read(1)):
print(byte)
In the simple case you have provided, it is easy to write the Python code to read out 1234 and 5678 (assuming sizeof(int) is 4 bytes) by using int.from_bytes.
And you should open the file in binary mode.
import sys
with open("out.txt", "rb") as f:
while (byte := f.read(4)):
print(int.from_bytes(byte, sys.byteorder))
To deal with floats, you may want to try struct.unpack:
import struct
byte = f.read(4)
print(struct.unpack("f", byte)[0])
I am currently working in a Yocto Linux build and am trying to interface with a hardware block on an FPGA. This block is imitating an SD card with a FAT16 file system on it; containing a single file (cam.raw). This file represents the shared memory space between the FPGA and the linux system. As such, I want to be able to write data from the linux system to this memory and get back any changes the FPGA might make (Currently, the FPGA simply takes part of the data from the memory space and adds 6 to the LSB of a 32-bit word, like I write 0x40302010 and should get back 0x40302016 if I read back the data). However, due to some caching somewhere, while I can write the data to the FPGA, I cannot immediately get back the result.
I am currently doing something like this (using python because its easy):
% mount /dev/mmcblk1 /memstick
% python
>> import mmap
>> import os
>> f = os.open("/memstick/cam.raw", os.O_RDWR | os.O_DIRECT)
>> m = mmap.mmap(f, 0)
>> for i in xrange(1024):
... m[i] = chr(i % 256)
...
>> m.flush() # Make sure data goes from linux to FPGA
>> hex(ord(m[0])) # Should be 0x6
'0x0'
I can confirm with dd that the data is changed (though I frequently run into buffering issues with that too) and using the tools for the FPGA (SignalTap/ChipScope) that I am indeed getting correct answer (ie, the first 32-bit word in this case is 0x03020106). However, someone, whether its python or linux or both are buffering the file and not reading from the "SD card" (FPGA) again and storing the file data in memory. I need to shut this completely off so all reads result in reads from the FPGA; but Im not sure where the buffering is taking place or how to do that.
Any insight would be appreciated! (Note, I can use mmap.flush() to take any data I write from python to dump it to the FPGA, but I need like a reverse flush or something to have it reread the file data into the mmap!)
Update:
As suggested in the comments, the mmap approach might not be the best one to implement what I need. However, I have now tried both in python and C, but using basic I/O functions (os.read/write in python, read/write in C) using the O_DIRECT flag. For most of these operations, I end up getting errno 22. Still looking into this....
After doing digging, I found out what I was doing wrong with the O_DIRECT flag. In my C and Python versions, I wasnt using memalign to create the buffer and wasn't doing block reads/writes. This post has a good explanation:
How can I read a file with read() and O_DIRECT in C++ on Linux?
So, in order to achieve what I am doing, this C program works as a basic example:
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#define BLKSIZE 512
int main() {
int fd;
int x;
char* buf;
fd = open("/home/root/sd/fpga/cam.raw", O_RDWR | O_SYNC | O_DIRECT);
if (!fd) {
printf("Oh noes, no file!\n");
return -1;
}
printf("%d %d\n", fd, errno);
buf = (char*) memalign(BLKSIZE, BLKSIZE*2);
if (!buf) {
printf("Oh noes, no buf!\n");
return -1;
}
x = read(fd, buf, BLKSIZE);
printf("%d %d %x %x %x %x\n", x, errno, buf[0], buf[1], buf[2], buf[3]);
lseek(fd, 0, 0);
buf[0] = '1';
buf[1] = '2';
buf[2] = '3';
buf[3] = '4';
x = write(fd, buf, BLKSIZE);
printf("%d %d\n", fd, errno);
lseek(fd, 0, 0);
x = read(fd, buf, BLKSIZE);
printf("%d %d %x %x %x %x\n", x,errno, buf[0], buf[1], buf[2], buf[3]);
return 0;
}
This will work for my purposes, I didnt look how to do proper memory alignment to use Python's os.read/os.write functions in a similar way.
I have two files, target and clean.
Target has some 1055772 lines, each of which has 3000 columns, tab separated. (size is 7.5G)
Clean is slightly shorter at 806535. Clean only has one column, that matches the format of the first column of Target. (size is 13M)
I want to extract the lines of target that has a matching first column to a line in clean.
I wrote a grep based loop to do this but its painfully slow. Speedups will be rewarded with upvotes and/or smileys.
clean = "/path/to/clean"
target = "/path/to/target"
oFile = "/output/file"
head -1 $target > $oFile
cat $clean | while read snp; do
echo $snp
grep $snp $target >> $oFile
done
$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G
Edit: Wrote a simple python script to do it.
clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"
allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"
outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")
clean_variant_dict = {}
for line in open(clean_variants_file):
clean_variant_dict[line.strip()] = 0
for line in open(allChr_file):
ll = line.strip().split("\t")
id_ = ll[0]
if id_ in clean_variant_dict:
outfile.write(line)
outfile.close()
This Perl solution would use quite a lot of memory (because we load the entire file into memory), but would save you from looping twice. It uses a hash for duplicate checking, where each line is stored as a key. Note that this code is not thoroughly tested, but seems to work on a limited set of data.
use strict;
use warnings;
my ($clean, $target) = #ARGV;
open my $fh, "<", $clean or die "Cannot open file '$clean': $!";
my %seen;
while (<$fh>) {
chomp;
$seen{$_}++;
}
open $fh, "<", $target
or die "Cannot open file '$target': $!"; # reuse file handle
while (<$fh>) {
my ($first) = /^([^\t]*)/;
print if $seen{$first};
}
If your target file is proper tab separated CSV data, you could use Text::CSV_XS which reportedly is very fast.
python solution:
with open('/path/to/clean', 'r') as fin:
keys = set(fin.read().splitlines())
with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
for line in fin:
if line[:line.index('\t')] in keys:
fout.write(line)
Using a perl one-liner:
perl -F'\t' -lane '
BEGIN{ local #ARGV = pop; #s{<>} = () }
print if exists $s{"$F[0]\n"}
' target clean
Switches:
-F: Alternate pattern for -a switch
-l: Enable line ending processing
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Or as a perl script:
use strict;
use warnings;
die "Usage: $0 target clean\n" if #ARGV != 2;
my %s = do {
local #ARGV = pop;
map {$_ => 1} (<>)
};
while (<>) {
my ($f) = split /\t/;
print if $s{"$f\n"}
}
For fun, I thought I would convert a solution or two into Perl6.
Note: These are probably going to be slower than the originals until Rakudo/NQP gets more optimizations, which really only started in earnest fairly recently at the time of posting.
First is TLP's Perl5 answer converted nearly one-to-one into Perl6.
#! /usr/bin/env perl6
# I have a link named perl6 aliased to Rakudo on MoarVM-jit
use v6;
multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version
MAIN( :$clean, :$target ); # call the named version
}
multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments
note "Processing clean file";
my %seen := SetHash.new;
for open( $clean, :r ).lines -> $line {
next unless $line.chars; # skip empty lines
%seen{$line}++;
}
note "Processing target file";
for open( $target, :r ).lines -> $line {
$line ~~ /^ $<first> = <-[\t]>+ /;
say $line if %seen{$<first>.Str};
}
}
I used MAIN subroutines so that you will get a Usage message if you don't give it the correct arguments.
I also used a SetHash instead of a regular Hash to reduce memory use since we don't need to know how many we have found, only that they were found.
Next I tried to combine all of the lines in the clean file into one regex.
This is similar to the sed and grep answer from Cyrus, except instead of many regexes there is only one.
I didn't want to change the subroutine that I had already written, so I added one that is differentiated by adding --single-regex or -s to the command line. ( All of the examples are in the same file )
multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){
note "Processing clean file";
my $regex;
{
my #regex = open( $clean, :r ).lines.grep(*.chars);
$regex = /^ [ | #regex ] /;
} # throw away #regex
note "Processing target file";
for open( $target, :r ).lines -> $line {
say $line if $line ~~ $regex;
}
}
I will say that I took quite a bit longer to write this than it would have taken me to write it in Perl5. Most of the time was taken up searching for some idioms online, and looking over the source files for Rakudo. I don't think it would take much effort to get better at Perl6 than Perl5.
I have a large text file that I am extracting URLs from. If I run:
import re
with open ('file.in', 'r') as fh:
for match in re.findall(r'http://matchthis\.com', fh.read()):
print match
it runs in a second or so user time and gets the URLs I was wanting, but if I run either of these:
var regex = /http:\/\/matchthis\.com/g;
fs.readFile('file.in', 'ascii', function(err, data) {
while(match = regex.exec(data))
console.log(match);
});
OR
fs.readFile('file.in', 'ascii', function(err, data) {
var matches = data.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
I get:
FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory
What is happening with the node.js regex engine? Is there any way I can modify things such that they work in node?
EDIT: The error appears to be fs centric as this also produces the error:
fs.readFile('file.in', 'ascii', function(err, data) {
});
file.in is around 800MB.
You should process the file line by line using the streaming file interface. Something like this:
var fs = require('fs');
var byline = require('byline');
var input = fs.createReadStream('tmp.txt');
var lines = input.pipe(byline.createStream());
lines.on('readable', function(){
var line = lines.read().toString('ascii');
var matches = line.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
In this example, I'm using the byline module to split the stream into lines so that you won't miss matches by getting partial chunks of lines per .read() call.
To elaborate more, what you were doing is allocating ~800MB of RAM as a Buffer (outside of V8's heap) and then converting that to an ASCII string (and thus transferring it into V8's heap), which will take at least 800MB and likely more depending on V8's internal optimizations. I believe V8 stores strings as UCS2 or UTF16, which means each character will be 2 bytes (given ASCII input) so your string would really be about 1600MB.
Node's max allocated heap space is 1.4GB, so by trying to create such a large string, you cause V8 to throw an exception.
Python does not have this problem because it does not have a maximum heap size and will chew through all of your RAM. As others have pointed out, you should also avoid fh.read() in Python since that will copy all the file data into RAM as a string instead of streaming it line by line with an iterator.
Given that both programs are trying to read the entire 1400000 file into memory, I'd suggest it would be a difference between how Node and Python handle large strings. Try doing a line by line search and the problem should disappear.
For example, in Python you can do this:
import re
with open ('file.in', 'r') as file:
for line in file:
for match in re.findall(r'http://matchthis\.com', line):
print match
I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back.
For example I have some structure like
const BYTE Some_Idx[] = {
4,7,10,15,17,19,24,29,
31,32,35,45,49,51,52,54,
55,58,60,64,65,66,67,69,
70,72,76,77,81,82,83,85,
88,93,94,95,97,99,102,103,
105,106,113,115,122,124,125,126,
129,131,137,139,140,149,151,152,
153,155,158,159,160,163,165,169,
174,175,181,182,183,189,190,193,
197,201,204,206,208,210,211,212,
213,214,215,217,218,219,220,223,
225,228,230,234,236,237,240,241,
242,247,249};
Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like:
const BYTE Some_Idx_Mod_mul_2[] = {
8,14,20, ...
...
484,494,498};
Is there any Perl library already available for this? If not Perl, something else like Python is also OK.
Can somebody please help!!!
Keeping your data lying around in a header makes it trickier to get at using other programs like Perl. Another approach you might consider is to keep this data in a database or another file and regenerate your header file as-needed, maybe even as part of your build system. The reason for this is that generating C is much easier than parsing C, it's trivial to write a script that parses a text file and makes a header for you, and such a script could even be invoked from your build system.
Assuming that you want to keep your data in a C header file, you will need one of two things to solve this problem:
a quick one-off script to parse exactly (or close to exactly) the input you describe.
a general, well-written script that can parse arbitrary C and work generally on to lots of different headers.
The first case seems more common than the second to me, but it's hard to tell from your question if this is better solved by a script that needs to parse arbitrary C or a script that needs to parse this specific file. For code that works on your specific case, the following works for me on your input:
#!/usr/bin/perl -w
use strict;
open FILE, "<header.h" or die $!;
my #file = <FILE>;
close FILE or die $!;
my $in_block = 0;
my $regex = 'Some_Idx\[\]';
my $byte_line = '';
my #byte_entries;
foreach my $line (#file) {
chomp $line;
if ( $line =~ /$regex.*\{(.*)/ ) {
$in_block = 1;
my #digits = #{ match_digits($1) };
push #digits, #byte_entries;
next;
}
if ( $in_block ) {
my #digits = #{ match_digits($line) };
push #byte_entries, #digits;
}
if ( $line =~ /\}/ ) {
$in_block = 0;
}
}
print "const BYTE Some_Idx_Mod_mul_2[] = {\n";
print join ",", map { $_ * 2 } #byte_entries;
print "};\n";
sub match_digits {
my $text = shift;
my #digits;
while ( $text =~ /(\d+),*/g ) {
push #digits, $1;
}
return \#digits;
}
Parsing arbitrary C is a little tricky and not worth it for many applications, but maybe you need to actually do this. One trick is to let GCC do the parsing for you and read in GCC's parse tree using a CPAN module named GCC::TranslationUnit.
Here's the GCC command to compile the code, assuming you have a single file named test.c:
gcc -fdump-translation-unit -c test.c
Here's the Perl code to read in the parse tree:
use GCC::TranslationUnit;
# echo '#include <stdio.h>' > stdio.c
# gcc -fdump-translation-unit -c stdio.c
$node = GCC::TranslationUnit::Parser->parsefile('stdio.c.tu')->root;
# list every function/variable name
while($node) {
if($node->isa('GCC::Node::function_decl') or
$node->isa('GCC::Node::var_decl')) {
printf "%s declared in %s\n",
$node->name->identifier, $node->source;
}
} continue {
$node = $node->chain;
}
Sorry if this is a stupid question, but why worry about parsing the file at all? Why not write a C program that #includes the header, processes it as required and then spits out the source for the modified header. I'm sure this would be simpler than the Perl/Python solutions, and it would be much more reliable because the header would be being parsed by the C compilers parser.
You don't really provide much information about how what is to be modified should be determined, but to address your specific example:
$ perl -pi.bak -we'if ( /const BYTE Some_Idx/ .. /;/ ) { s/Some_Idx/Some_Idx_Mod_mul_2/g; s/(\d+)/$1 * 2/ge; }' header.h
Breaking that down, -p says loop through input files, putting each line in $_, running the supplied code, then printing $_. -i.bak enables in-place editing, renaming each original file with a .bak suffix and printing to a new file named whatever the original was. -w enables warnings. -e'....' supplies the code to be run for each input line. header.h is the only input file.
In the perl code, if ( /const BYTE Some_Idx/ .. /;/ ) checks that we are in a range of lines beginning with a line matching /const BYTE Some_Idx/ and ending with a line matching /;/.
s/.../.../g does a substitution as many times as possible. /(\d+)/ matches a series of digits. The /e flag says the result ($1 * 2) is code that should be evaluated to produce a replacement string, instead of simply a replacement string. $1 is the digits that should be replaced.
If all you need to do is to modify structs, you can directly use regex to split and apply changes to each value in the struct, looking for the declaration and the ending }; to know when to stop.
If you really need a more general solution you could use a parser generator, like PyParsing
There is a Perl module called Parse::RecDescent which is a very powerful recursive descent parser generator. It comes with a bunch of examples. One of them is a grammar that can parse C.
Now, I don't think this matters in your case, but the recursive descent parsers using Parse::RecDescent are algorithmically slower (O(n^2), I think) than tools like Parse::Yapp or Parse::EYapp. I haven't checked whether Parse::EYapp comes with such a C-parser example, but if so, that's the tool I'd recommend learning.
Python solution (not full, just a hint ;)) Sorry if any mistakes - not tested
import re
text = open('your file.c').read()
patt = r'(?is)(.*?{)(.*?)(}\s*;)'
m = re.search(patt, text)
g1, g2, g3 = m.group(1), m.group(2), m.group(3)
g2 = [int(i) * 2 for i in g2.split(',')
out = open('your file 2.c', 'w')
out.write(g1, ','.join(g2), g3)
out.close()
There is a really useful Perl module called Convert::Binary::C that parses C header files and converts structs from/to Perl data structures.
You could always use pack / unpack, to read, and write the data.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my #data;
{
open( my $file, '<', 'Some_Idx.bin' );
local $/ = \1; # read one byte at a time
while( my $byte = <$file> ){
push #data, unpack('C',$byte);
}
close( $file );
}
print join(',', #data), "\n";
{
open( my $file, '>', 'Some_Idx_Mod_mul_2.bin' );
# You have two options
for my $byte( #data ){
print $file pack 'C', $byte * 2;
}
# or
print $file pack 'C*', map { $_ * 2 } #data;
close( $file );
}
For the GCC::TranslationUnit example see hparse.pl from http://gist.github.com/395160
which will make it into C::DynaLib, and the not yet written Ctypes also.
This parses functions for FFI's, and not bare structs contrary to Convert::Binary::C.
hparse will only add structs if used as func args.