Why can't node handle this regex but python can?

Why can't node handle this regex but python can? - python

I have a large text file that I am extracting URLs from. If I run:
import re
with open ('file.in', 'r') as fh:
for match in re.findall(r'http://matchthis\.com', fh.read()):
print match
it runs in a second or so user time and gets the URLs I was wanting, but if I run either of these:
var regex = /http:\/\/matchthis\.com/g;
fs.readFile('file.in', 'ascii', function(err, data) {
while(match = regex.exec(data))
console.log(match);
});
OR
fs.readFile('file.in', 'ascii', function(err, data) {
var matches = data.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
I get:
FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory
What is happening with the node.js regex engine? Is there any way I can modify things such that they work in node?
EDIT: The error appears to be fs centric as this also produces the error:
fs.readFile('file.in', 'ascii', function(err, data) {
});
file.in is around 800MB.

You should process the file line by line using the streaming file interface. Something like this:
var fs = require('fs');
var byline = require('byline');
var input = fs.createReadStream('tmp.txt');
var lines = input.pipe(byline.createStream());
lines.on('readable', function(){
var line = lines.read().toString('ascii');
var matches = line.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
In this example, I'm using the byline module to split the stream into lines so that you won't miss matches by getting partial chunks of lines per .read() call.
To elaborate more, what you were doing is allocating ~800MB of RAM as a Buffer (outside of V8's heap) and then converting that to an ASCII string (and thus transferring it into V8's heap), which will take at least 800MB and likely more depending on V8's internal optimizations. I believe V8 stores strings as UCS2 or UTF16, which means each character will be 2 bytes (given ASCII input) so your string would really be about 1600MB.
Node's max allocated heap space is 1.4GB, so by trying to create such a large string, you cause V8 to throw an exception.
Python does not have this problem because it does not have a maximum heap size and will chew through all of your RAM. As others have pointed out, you should also avoid fh.read() in Python since that will copy all the file data into RAM as a string instead of streaming it line by line with an iterator.

Given that both programs are trying to read the entire 1400000 file into memory, I'd suggest it would be a difference between how Node and Python handle large strings. Try doing a line by line search and the problem should disappear.
For example, in Python you can do this:
import re
with open ('file.in', 'r') as file:
for line in file:
for match in re.findall(r'http://matchthis\.com', line):
print match

Related

Reaching a limit when executing python scripts from node.js [duplicate]

My Node & Python backend is running just fine, but I now encountered an issue where if a JSON I'm sending from Python back no Node is too long, it gets split into two and my JSON.parse at the Node side fails.
How should I fix this? For example, the first batch clips at
... [1137.6962355826706, -100.78015825640887], [773.3834338399517, -198
and the second one has the remaining few entries
.201506231888], [-87276.575065248, -60597.8827676457], [793.1850250453127,
-192.1674702207991], [1139.4465453979683, -100.56741252031816],
[780.498416769341, -196.04064849430705]]}
Do I have to create some logic on the Node side for long JSONs or is this some sort of a buffering issue I'm having on my Python side that I can overcome with proper settings? Here's all I'm doing on the python side:
outPoints, _ = cv2.projectPoints(inPoints, np.asarray(rvec),
np.asarray(tvec), np.asarray(camera_matrix), np.asarray(dist_coeffs))
# flatten the output to get rid of double brackets per result before JSONifying
flattened = [val for sublist in outPoints for val in sublist]
print(json.dumps({'testdata':np.asarray(flattened).tolist()}))
sys.stdout.flush()
And on the Node side:
// Handle python data from print() function
pythonProcess.stdout.on('data', function (data){
try {
// If JSON handle the data
console.log(JSON.parse(data.toString()));
} catch (e) {
// Otherwise treat as a log entry
console.log(data.toString());
}
});

The emitted data is chunked, so if you want to parse a JSON you will need to join all the chunks, and on end perform JSON.parse.
By default, pipes for stdin, stdout, and stderr are established
between the parent Node.js process and the spawned child. These pipes
have limited (and platform-specific) capacity. If the child process
writes to stdout in excess of that limit without the output being
captured, the child process will block waiting for the pipe buffer to
accept more data.
In linux each chunk is limited to 65536 bytes.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 65536 bytes.
let result = '';
pythonProcess.stdout.on('data', data => {
result += data.toString();
// Or Buffer.concat if you prefer.
});
pythonProcess.stdout.on('end', () => {
try {
// If JSON handle the data
console.log(JSON.parse(result));
} catch (e) {
// Otherwise treat as a log entry
console.log(result);
}
});

Sending JSON from Python to Node via child_process gets truncated if too long, how to fix?

My Node & Python backend is running just fine, but I now encountered an issue where if a JSON I'm sending from Python back no Node is too long, it gets split into two and my JSON.parse at the Node side fails.
How should I fix this? For example, the first batch clips at
... [1137.6962355826706, -100.78015825640887], [773.3834338399517, -198
and the second one has the remaining few entries
.201506231888], [-87276.575065248, -60597.8827676457], [793.1850250453127,
-192.1674702207991], [1139.4465453979683, -100.56741252031816],
[780.498416769341, -196.04064849430705]]}
Do I have to create some logic on the Node side for long JSONs or is this some sort of a buffering issue I'm having on my Python side that I can overcome with proper settings? Here's all I'm doing on the python side:
outPoints, _ = cv2.projectPoints(inPoints, np.asarray(rvec),
np.asarray(tvec), np.asarray(camera_matrix), np.asarray(dist_coeffs))
# flatten the output to get rid of double brackets per result before JSONifying
flattened = [val for sublist in outPoints for val in sublist]
print(json.dumps({'testdata':np.asarray(flattened).tolist()}))
sys.stdout.flush()
And on the Node side:
// Handle python data from print() function
pythonProcess.stdout.on('data', function (data){
try {
// If JSON handle the data
console.log(JSON.parse(data.toString()));
} catch (e) {
// Otherwise treat as a log entry
console.log(data.toString());
}
});

The emitted data is chunked, so if you want to parse a JSON you will need to join all the chunks, and on end perform JSON.parse.
By default, pipes for stdin, stdout, and stderr are established
between the parent Node.js process and the spawned child. These pipes
have limited (and platform-specific) capacity. If the child process
writes to stdout in excess of that limit without the output being
captured, the child process will block waiting for the pipe buffer to
accept more data.
In linux each chunk is limited to 65536 bytes.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 65536 bytes.
let result = '';
pythonProcess.stdout.on('data', data => {
result += data.toString();
// Or Buffer.concat if you prefer.
});
pythonProcess.stdout.on('end', () => {
try {
// If JSON handle the data
console.log(JSON.parse(result));
} catch (e) {
// Otherwise treat as a log entry
console.log(result);
}
});

How to write program to do file transfer based on based omniORBpy

I'm now writing a Corba project to do file transfering between client and server.
But I face trouble when I want to upload file from the client to the server.
The IDL I defined is:
interface SecretMessage
{
string send_file(in string file_name, in string file_obj);
};
And I implemented the uploading function in the client code:
f = open('SB.docx', 'rb')
data = ''
for piece in read_in_chunks(f):
data += piece
result = mo.send_file('2.docx', data)
If the file is a plain txt file, there is no problem.
But if the file is a, like jpg, doc, or others except txt, then it does work.
It gives me the error:
omniORB.CORBA.BAD_PARAM: CORBA.BAD_PARAM(omniORB.BAD_PARAM_WrongPythonType, CORBA.COMPLETED_NO)
Where is the problem?

I think it is because by default omniORB wants to see ASCII data for strings. Try changing your IDL to this
interface SecretMessage
{
typedef sequence<octet> OctetSequence;
string send_file(in string file_name, in OctetSequence file_obj);
};
You can keep your Python client code the same because in the IDL to Python mapping, octet sequences map to Python strings.

Garbage in file after truncate(0) in Python

Assume there is a file test.txt containing a string 'test'.
Now, consider the following Python code:
f = open('test', 'r+')
f.read()
f.truncate(0)
f.write('passed')
f.flush();
Now I expect test.txt to contain 'passed' now, however there are additionally some strange symbols!
Update: flush after truncate does not help.

Yeah, that's true that truncate() doesn't move the position, but said that, is simple as death:
f.read()
f.seek(0)
f.truncate(0)
f.close()
this is perfectly working ;)

This is because truncate doesn't change the stream position.
When you read() the file, you move the position to the end. So successive writes will write to file from that position. However, when you call flush(), it seems not only it tries to write the buffer to the file, but also does some error checking and fixes the current file position. When Flush() is called after the truncate(0), writes nothing (buffer is empty), then checks the file size and places the position at the first applicable place (which is 0).
UPDATE
Python's file function are NOT just wrappers around the C standard library equivalents, but knowing the C functions helps knowing what is happening more precisely.
From the ftruncate man page:
The value of the seek pointer is not modified by a call to ftruncate().
From the fflush man page:
If stream points to an input stream or an update stream into which the most recent operation was input, that stream is flushed if it is seekable and is not already at end-of-file. Flushing an input stream discards any buffered input and adjusts the file pointer such that the next input operation accesses the byte after the last one read.
This means if you put flush before truncate it has no effect. I checked and it was so.
But for putting flush after truncate:
If stream points to an output stream or an update stream in which the most recent operation was not input, fflush() causes any unwritten data for that stream to be written to the file, and the st_ctime and st_mtime fields of the underlying file are marked for update.
The man page doesn't mention the seek pointer when explaining output streams with last operation not being input. (Here our last operation is truncate)
UPDATE 2
I found something in python source code: Python-3.2.2\Modules\_io\fileio.c:837
#ifdef HAVE_FTRUNCATE
static PyObject *
fileio_truncate(fileio *self, PyObject *args)
{
PyObject *posobj = NULL; /* the new size wanted by the user */
#ifndef MS_WINDOWS
Py_off_t pos;
#endif
...
#ifdef MS_WINDOWS
/* MS _chsize doesn't work if newsize doesn't fit in 32 bits,
so don't even try using it. */
{
PyObject *oldposobj, *tempposobj;
HANDLE hFile;
////// THIS LINE //////////////////////////////////////////////////////////////
/* we save the file pointer position */
oldposobj = portable_lseek(fd, NULL, 1);
if (oldposobj == NULL) {
Py_DECREF(posobj);
return NULL;
}
/* we then move to the truncation position */
...
/* Truncate. Note that this may grow the file! */
...
////// AND THIS LINE //////////////////////////////////////////////////////////
/* we restore the file pointer position in any case */
tempposobj = portable_lseek(fd, oldposobj, 0);
Py_DECREF(oldposobj);
if (tempposobj == NULL) {
Py_DECREF(posobj);
return NULL;
}
Py_DECREF(tempposobj);
}
#else
...
#endif /* HAVE_FTRUNCATE */
Look at the two lines I indicated (///// This Line /////). If your platform is Windows, then it's saving the position and returning it back after the truncate.
To my surprise, most of the flush functions inside the Python 3.2.2 functions either did nothing or did not call fflush C function at all. The 3.2.2 truncate part was also very undocumented. However, I did find something interesting in Python 2.7.2 sources. First, I found this in Python-2.7.2\Objects\fileobject.c:812 in truncate implementation:
/* Get current file position. If the file happens to be open for
* update and the last operation was an input operation, C doesn't
* define what the later fflush() will do, but we promise truncate()
* won't change the current position (and fflush() *does* change it
* then at least on Windows). The easiest thing is to capture
* current pos now and seek back to it at the end.
*/
So to summarize all, I think this is a fully platform dependent thing. I checked on default Python 3.2.2 for Windows x64 and got the same results as you. Don't know what happens on *nixes.

If anyone is in the same boat as mine, here is my problem with solution:
I have a program that is always ON i.e. it doesn't stop, keeps on polling the data and writes to a log file
The problem is, i want to split the main file as soon as it reaches the 10 MB mark, therefore, i wrote the below program.
I found the solution as well to the problem, where truncate was writing null values to the file causing further problem.
Below is an illustration on how i solved this issue.
f1 = open('client.log','w')
nowTime = datetime.datetime.now().time()
f1.write(os.urandom(1024*1024*15)) #Adding random values worth 15 MB
if (int(os.path.getsize('client.log') / 1048576) > 10): #checking if file size is 10 MB and above
print 'File size limit Exceeded, needs trimming'
dst = 'client_'+ str(randint(0, 999999)) + '.log'
copyfile('client.log', dst) #Copying file to another one
print 'Copied content to ' + str(dst)
print 'Erasing current file'
f1.truncate(0) #Truncating data, this works fine but puts the counter at the last
f1.seek(0) #very important to use after truncate so that new data begins from 0
print 'File truncated successfully'
f1.write('This is fresh content') #Dummy content
f1.close()
print 'All Job Processed'

Truncate doesn't change the file position.
Note also that even if the file is opened in read+write you cannot just switch between the two types of operation (a seek operation is required to be able to switch from read to write or vice versa).

I expect the following is the code you meant to write:
open('test.txt').read()
open('test.txt', 'w').write('passed')

It depends. If you want to keep the file open and access it without closing it then flush will force writing to the file. If you're closing the file right after flush then no you don't need it because close will flush for you. That's my understanding from the docs

How can I parse a C header file with Perl?

I have a header file in which there is a large struct. I need to read this structure using some program and make some operations on each member of the structure and write them back.
For example I have some structure like
const BYTE Some_Idx[] = {
4,7,10,15,17,19,24,29,
31,32,35,45,49,51,52,54,
55,58,60,64,65,66,67,69,
70,72,76,77,81,82,83,85,
88,93,94,95,97,99,102,103,
105,106,113,115,122,124,125,126,
129,131,137,139,140,149,151,152,
153,155,158,159,160,163,165,169,
174,175,181,182,183,189,190,193,
197,201,204,206,208,210,211,212,
213,214,215,217,218,219,220,223,
225,228,230,234,236,237,240,241,
242,247,249};
Now, I need to read this and apply some operation on each of the member variable and create a new structure with different order, something like:
const BYTE Some_Idx_Mod_mul_2[] = {
8,14,20, ...
...
484,494,498};
Is there any Perl library already available for this? If not Perl, something else like Python is also OK.
Can somebody please help!!!

Keeping your data lying around in a header makes it trickier to get at using other programs like Perl. Another approach you might consider is to keep this data in a database or another file and regenerate your header file as-needed, maybe even as part of your build system. The reason for this is that generating C is much easier than parsing C, it's trivial to write a script that parses a text file and makes a header for you, and such a script could even be invoked from your build system.
Assuming that you want to keep your data in a C header file, you will need one of two things to solve this problem:
a quick one-off script to parse exactly (or close to exactly) the input you describe.
a general, well-written script that can parse arbitrary C and work generally on to lots of different headers.
The first case seems more common than the second to me, but it's hard to tell from your question if this is better solved by a script that needs to parse arbitrary C or a script that needs to parse this specific file. For code that works on your specific case, the following works for me on your input:
#!/usr/bin/perl -w
use strict;
open FILE, "<header.h" or die $!;
my #file = <FILE>;
close FILE or die $!;
my $in_block = 0;
my $regex = 'Some_Idx\[\]';
my $byte_line = '';
my #byte_entries;
foreach my $line (#file) {
chomp $line;
if ( $line =~ /$regex.*\{(.*)/ ) {
$in_block = 1;
my #digits = #{ match_digits($1) };
push #digits, #byte_entries;
next;
}
if ( $in_block ) {
my #digits = #{ match_digits($line) };
push #byte_entries, #digits;
}
if ( $line =~ /\}/ ) {
$in_block = 0;
}
}
print "const BYTE Some_Idx_Mod_mul_2[] = {\n";
print join ",", map { $_ * 2 } #byte_entries;
print "};\n";
sub match_digits {
my $text = shift;
my #digits;
while ( $text =~ /(\d+),*/g ) {
push #digits, $1;
}
return \#digits;
}
Parsing arbitrary C is a little tricky and not worth it for many applications, but maybe you need to actually do this. One trick is to let GCC do the parsing for you and read in GCC's parse tree using a CPAN module named GCC::TranslationUnit.
Here's the GCC command to compile the code, assuming you have a single file named test.c:
gcc -fdump-translation-unit -c test.c
Here's the Perl code to read in the parse tree:
use GCC::TranslationUnit;
# echo '#include <stdio.h>' > stdio.c
# gcc -fdump-translation-unit -c stdio.c
$node = GCC::TranslationUnit::Parser->parsefile('stdio.c.tu')->root;
# list every function/variable name
while($node) {
if($node->isa('GCC::Node::function_decl') or
$node->isa('GCC::Node::var_decl')) {
printf "%s declared in %s\n",
$node->name->identifier, $node->source;
}
} continue {
$node = $node->chain;
}

Sorry if this is a stupid question, but why worry about parsing the file at all? Why not write a C program that #includes the header, processes it as required and then spits out the source for the modified header. I'm sure this would be simpler than the Perl/Python solutions, and it would be much more reliable because the header would be being parsed by the C compilers parser.

You don't really provide much information about how what is to be modified should be determined, but to address your specific example:
$ perl -pi.bak -we'if ( /const BYTE Some_Idx/ .. /;/ ) { s/Some_Idx/Some_Idx_Mod_mul_2/g; s/(\d+)/$1 * 2/ge; }' header.h
Breaking that down, -p says loop through input files, putting each line in $_, running the supplied code, then printing $_. -i.bak enables in-place editing, renaming each original file with a .bak suffix and printing to a new file named whatever the original was. -w enables warnings. -e'....' supplies the code to be run for each input line. header.h is the only input file.
In the perl code, if ( /const BYTE Some_Idx/ .. /;/ ) checks that we are in a range of lines beginning with a line matching /const BYTE Some_Idx/ and ending with a line matching /;/.
s/.../.../g does a substitution as many times as possible. /(\d+)/ matches a series of digits. The /e flag says the result ($1 * 2) is code that should be evaluated to produce a replacement string, instead of simply a replacement string. $1 is the digits that should be replaced.

If all you need to do is to modify structs, you can directly use regex to split and apply changes to each value in the struct, looking for the declaration and the ending }; to know when to stop.
If you really need a more general solution you could use a parser generator, like PyParsing

There is a Perl module called Parse::RecDescent which is a very powerful recursive descent parser generator. It comes with a bunch of examples. One of them is a grammar that can parse C.
Now, I don't think this matters in your case, but the recursive descent parsers using Parse::RecDescent are algorithmically slower (O(n^2), I think) than tools like Parse::Yapp or Parse::EYapp. I haven't checked whether Parse::EYapp comes with such a C-parser example, but if so, that's the tool I'd recommend learning.

Python solution (not full, just a hint ;)) Sorry if any mistakes - not tested
import re
text = open('your file.c').read()
patt = r'(?is)(.*?{)(.*?)(}\s*;)'
m = re.search(patt, text)
g1, g2, g3 = m.group(1), m.group(2), m.group(3)
g2 = [int(i) * 2 for i in g2.split(',')
out = open('your file 2.c', 'w')
out.write(g1, ','.join(g2), g3)
out.close()

There is a really useful Perl module called Convert::Binary::C that parses C header files and converts structs from/to Perl data structures.

You could always use pack / unpack, to read, and write the data.
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
my #data;
{
open( my $file, '<', 'Some_Idx.bin' );
local $/ = \1; # read one byte at a time
while( my $byte = <$file> ){
push #data, unpack('C',$byte);
}
close( $file );
}
print join(',', #data), "\n";
{
open( my $file, '>', 'Some_Idx_Mod_mul_2.bin' );
# You have two options
for my $byte( #data ){
print $file pack 'C', $byte * 2;
}
# or
print $file pack 'C*', map { $_ * 2 } #data;
close( $file );
}

For the GCC::TranslationUnit example see hparse.pl from http://gist.github.com/395160
which will make it into C::DynaLib, and the not yet written Ctypes also.
This parses functions for FFI's, and not bare structs contrary to Convert::Binary::C.
hparse will only add structs if used as func args.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can't node handle this regex but python can? - python

Related

Reaching a limit when executing python scripts from node.js [duplicate]

Sending JSON from Python to Node via child_process gets truncated if too long, how to fix?

How to write program to do file transfer based on based omniORBpy

Garbage in file after truncate(0) in Python

How can I parse a C header file with Perl?

Categories

Resources