I have been looking to speed up a basic Python function which basically just takes a line of text and checks the line for a substring. The Python program is as follows:
import time
def fun(line):
l = line.split(" ", 10)
if 'TTAGGG' in l[9]:
pass # Do nothing
line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"
time0 = time.time()
for i in range(10000):
fun(line)
print time.time() - time0
I wanted to see if I could use some of the high level features of Rust to possibly gain some performance, but the code runs considerably slower. The Rust conversion is:
extern crate regex;
extern crate time;
use regex::Regex;
fn main() {
let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";
let substring: &str = "TTAGGG";
let time0: f64 = time::precise_time_s();
for _ in 0..10000 {
fun(line, substring);
}
let time1: f64 = time::precise_time_s();
let elapsed: f64 = time1 - time0;
println!("{}", elapsed);
}
fn fun(line: &str, substring: &str) {
let l: Vec<&str> = line.split(" ")
.enumerate()
.filter(|&(i, _)| i==9)
.map(|(_, e) | e)
.collect();
let re = Regex::new(substring).unwrap();
if re.is_match(&l[0]) {
// Do nothing
}
}
On my machine, Python times this at 0.0065s vs Rusts 1.3946s.
Just checking some basic timings, the line.split() part of the code takes around 1s, and the regex step is around 0.4s. Can this really be right, or is there an issue with timing this properly?
As a baseline, I ran your Python program with Python 2.7.6. Over 10 runs, it had a mean time of 12.2ms with a standard deviation of 443μs. I don't know how you got the very good time of 6.5ms.
Running your Rust code with Rust 1.4.0-dev (febdc3b20), without optimizations, I got a mean of 958ms and a standard deviation of 33ms.
Running your code with optimizations (cargo run --release), I got a mean of 34.6ms and standard deviation of 495μs. Always do benchmarking in release mode.
There are further optimizations you can do:
Compiling the regex once, outside of the timing loop:
fn main() {
// ...
let substring = "TTAGGG";
let re = Regex::new(substring).unwrap();
// ...
for _ in 0..10000 {
fun(line, &re);
}
// ...
}
fn fun(line: &str, re: &Regex) {
// ...
}
Produces an average of 10.4ms with a standard deviation of 678μs.
Switching to a substring match:
fn fun(line: &str, substring: &str) {
// ...
if l[0].contains(substring) {
// Do nothing
}
}
Has a mean of 8.7ms and a standard deviation of 334μs.
And finally, if you look at just the one result instead of collecting everything into a vector:
fn fun(line: &str, substring: &str) {
let col = line.split(" ").nth(9);
if col.map(|c| c.contains(substring)).unwrap_or(false) {
// Do nothing
}
}
Has a mean of 6.30ms and standard deviation of 114μs.
A direct translation of the Python would be
extern crate time;
fn fun(line: &str) {
let mut l = line.split(" ");
if l.nth(9).unwrap().contains("TTAGGG") {
// do nothing
}
}
fn main() {
let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";
let time0 = time::precise_time_s();
for _ in 0..10000 {
fun(line);
}
println!("{}", time::precise_time_s() - time0);
}
Using cargo run --release on stable (1.2.0), I get about 0.0267 as compared to about 0.0240 for Python (CPython, 2.7.10). Given Python's in on strings is just a C routine, this is reasonable.
Impressively, on beta (1.3.0) and nightly (1.4.0) this decreases to about just 0.0122, or about twice the speed of CPython!
Related
I have scala code and python code that are attempting the same task (2021 advent of code day 1 https://adventofcode.com/2021/day/1).
The Python returns the correct solution, the Scala does not. I ran diff on both of the outputs and have determined that my Scala code is incorrectly evaluating the following pairs:
1001 > 992 -> false
996 > 1007 -> true
1012 > 977 -> false
the following is my Python code:
import pandas as pd
data = pd.read_csv("01_input.csv", header=None)
incr = 0
prevval = 99999
for index, row in data.iterrows():
if index != 0:
if row[0] > prevval:
print(f"{index}-{row[0]}-{prevval}")
incr += 1
prevval = row[0]
prevval = row[0]
print(incr)
and here is my Scala code:
import scala.io.Source; // library to read input file in
object advent_of_code_2021_01 {
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("01_input.csv").getLines().toList; // file as immutable list
var increases = 0;
for (i <- 1 until lines.length) { // iterate over list by index
if (lines(i) > lines(i-1)) {
increases += 1;
println(s"$i-${lines(i)}-${lines(i-1)}")
}
}
println(increases);
}
}
I do not understand what is causing this issue on these particular values. in the shell, Scala evaluates them correctly, but I do not know where to even begin with this. Is there some behavior I need to know about that I'm not accounting for? Am I just doing something stupid? Any help is appreciated, thank you.
As #Edward Peters https://stackoverflow.com/users/6016064/edward-peters correctly identified, my problem was that I was doing string comparisons, and not numerical comparisons, so I needed to convert my values to Int and not String. I did this with the very simple .toInt and it fixed all my issues.
fixed scala code:
import scala.io.Source; // library to read input file in
object advent_of_code_2021_01 {
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("01_input.csv").getLines().toList; // file as immutable list
var increases = 0;
for (i <- 1 until lines.length) { // iterate over list by index
if (lines(i).toInt > lines(i-1).toInt) { // evaluate
increases += 1; // increment when true
println(s"$i-${lines(i)}-${lines(i-1)}") // debug line
}
}
println(increases); // result
}
}
I have a file.cc that contains an array of doubles values, as seen here:
double values[][4] = {
{ 0.1234, +0.5678, 0.1222, 0.9683 },
{ 0.1631, +0.4678, 0.2122, 0.6643 },
{ 0.1332, +0.5678, 0.1322, 0.1683 },
{ 0.1636, +0.7678, 0.7122, 0.6283 }
... continue
}
How can I export these values to a Python list?
I cannot touch these files because they belong to an external library, subject to modification. Exactly, I want to be able to update the library without affecting my code.
This is pretty much answered in this other SO post.
But I will add a bit here. You need to define a type then use the in_dll method.
From your example I made a so with those values in values. I hope you have an idea how big it is or can find out from other vars in the library, otherwise this is a seg fault waiting to happen.
import ctypes
lib = ctypes.CDLL('so.so')
da = ctypes.c_double*4*4
da.in_dll(lib, "values")[0][0]
# 0.1234
da.in_dll(lib, "values")[0][1]
# 0.5678
da.in_dll(lib, "values")[0][2]
# 0.1222
From here I would just loop over them reading into a list.
How about using a temporary file? Put the matrix in it by C and read them by python.
In file.cc, write a function to save the matrix to a file.
int save_to_file(double matrix[][4],int row) {
int i,j;
FILE *fp;
fp=fopen("tmp","w");
for(i=0;i<row;i++)
for(j=0;j<4;j++) {
fprintf(fp,"%f",matrix[i][j]);
if(j==3)
fprintf(fp,"\n",matrix[i][j]);
else
fprintf(fp," ",matrix[i][j]);
}
fclose(fp);
return 0;
}
and read them by a Python script like this:
tmp=open('tmp')
L = []
for line in tmp:
newline = []
t = line.split(' ')
for string in t:
newline.append(float(string))
L.append(newline)
tmp.close()
for row in L:
for number in row:
print "%.4f" %number
print " "
I would like to use Perl and/or Python to implement the following JavaScript pseudocode:
var c=0;
function timedCount()
{
c=c+1;
print("c=" + c);
if (c<10) {
var t;
t=window.setTimeout("timedCount()",100);
}
}
// main:
timedCount();
print("after timedCount()");
var i=0;
for (i=0; i<5; i++) {
print("i=" + i);
wait(500); //wait 500 ms
}
Now, this is a particularly unlucky example to choose as a basis - but I simply couldn't think of any other language to provide it in :) Basically, there is a 'main loop' and an auxiliary 'loop' (timedCount), which both count at different rates: main with 500 ms period (implemented through a wait), timedCount with 100 ms period (implemented via setInterval). However, JavaScript is essentially single-threaded, not multi-threaded - and so, there is no real sleep/wait/pause or similar (see JavaScript Sleep Function - ozzu.com), which is why the above is, well, pseudocode ;)
By moving the main part to yet another setInterval function, however, we can get a version of the code which can be pasted and ran in a browser shell like JavaScript Shell 1.4 (but not in a terminal shell like EnvJS/Rhino):
var c=0;
var i=0;
function timedCount()
{
c=c+1;
print("c=" + c);
if (c<10) {
var t;
t=window.setTimeout("timedCount()",100);
}
}
function mainCount() // 'main' loop
{
i=i+1;
print("i=" + i);
if (i<5) {
var t;
t=window.setTimeout("mainCount()",500);
}
}
// main:
mainCount();
timedCount();
print("after timedCount()");
... which results with something like this output:
i=1
c=1
after timedCount()
c=2
c=3
c=4
c=5
c=6
i=2
c=7
c=8
c=9
c=10
i=3
i=4
i=5
... that is, the main counts and auxiliary counts are 'interleaved'/'threaded'/'interspersed', with a main count on approx every five auxiliary counts, as anticipated.
And now the main question - what is the recommended way of doing this in Perl and Python, respectively?
Additionally, do either Python or Perl offer facilities to implement the above with microsecond timing resolution in cross-platform manner?
Many thanks for any answers,
Cheers!
The simplest and most general way I can think of doing this in Python is to use Twisted (an event-based networking engine) to do this.
from twisted.internet import reactor
from twisted.internet import task
c, i = 0, 0
def timedCount():
global c
c += 1
print 'c =', c
def mainCount():
global i
i += 1
print 'i =', i
c_loop = task.LoopingCall(timedCount)
i_loop = task.LoopingCall(mainCount)
c_loop.start(0.1)
i_loop.start(0.5)
reactor.run()
Twisted has a highly efficient and stable event-loop implementation called the reactor. This makes it single-threaded and essentially a close analogue to Javascript in your example above. The reason I'd use it to do something like your periodic tasks above is that it gives tools to make it easy to add as many complicated periods as you like.
It also offers more tools for scheduling task calls you might find interesting.
A simple python implementation using the standard library's threading.Timer:
from threading import Timer
def timed_count(n=0):
n += 1
print 'c=%d' % n
if n < 10:
Timer(.1, timed_count, args=[n]).start()
def main_count(n=0):
n += 1
print 'i=%d' % n
if n < 5:
Timer(.5, main_count, args=[n]).start()
main_count()
timed_count()
print 'after timed_count()'
Alternatively, you can't go wrong using an asynchronous library like twisted (demonstrated in this answer) or gevent (there are quite a few more out there).
For Perl, for default capabilities, in How do I sleep for a millisecond in Perl?, it is stated that:
sleep has resolution of a second
select accepts floating point, the decimal part interpreted as milliseconds
And then for greater resolution, one can use Time::HiRes module, and for instance, usleep().
If using the default Perl capabilities, the only way to achieve this 'threaded' counting seems to be to 'fork' the script, and let each 'fork' act as a 'thread' and do its own count; I saw this approach on Perl- How to call an event after a time delay - Perl - and the below is a modified version, made to reflect the OP:
#!/usr/bin/env perl
use strict;
my $pid;
my $c=0;
my $i=0;
sub mainCount()
{
print "mainCount\n";
while ($i < 5) {
$i = $i + 1;
print("i=" . $i . "\n");
select(undef, undef, undef, 0.5); # sleep 500 ms
}
};
sub timedCount()
{
print "timedCount\n";
while ($c < 10) {
$c = $c + 1;
print("c=" . $c . "\n");
select(undef, undef, undef, 0.1); # sleep 100 ms
}
};
# main:
die "cant fork $!\n" unless defined($pid=fork());
if($pid) {
mainCount();
} else {
timedCount();
}
Here's another Perl example - without fork, with Time::HiRes with usleep (for main) and setitimer (for auxiliary) - however, it seems that the setitimer needs to be retriggered - and even then, it seems just to run through the commands (not actually wait):
#!/usr/bin/env perl
use strict;
use warnings;
use Time::HiRes qw(usleep ITIMER_VIRTUAL setitimer);
my $c=0;
my $i=0;
sub mainCount()
{
print "mainCount\n";
while ($i < 5) {
$i = $i + 1;
print("i=" . $i . "\n");
#~ select(undef, undef, undef, 0.5); # sleep 500 ms
usleep(500000);
}
};
my $tstart = 0;
sub timedCount()
{
#~ print "timedCount\n";
if ($c < 10) {
$c = $c + 1;
print("c=" . $c . "\n");
# if we want to loop with VTALRM - must have these *continuously*
if ($tstart == 0) {
#~ $tstart = 1; # kills the looping
$SIG{VTALRM} = &timedCount;
setitimer(ITIMER_VIRTUAL, 0.1, 0.1);
}
}
};
# main:
$SIG{VTALRM} = &timedCount;
setitimer(ITIMER_VIRTUAL, 0.1, 0.1);
mainCount();
EDIT: Here is even a simpler example with setitimer, which I cannot get to time out correctly (reardless of ITIMER_VIRTUAL or ITIMER_REAL), it simply runs as fast as possible:
use strict;
use warnings;
use Time::HiRes qw ( setitimer ITIMER_VIRTUAL ITIMER_REAL time );
sub ax() {
print time, "\n";
# re-initialize
$SIG{VTALRM} = &ax;
#~ $SIG{ALRM} = &ax;
}
$SIG{VTALRM} = &ax;
setitimer(ITIMER_VIRTUAL, 1e6, 1e6);
#~ $SIG{ALRM} = &ax;
#~ setitimer(ITIMER_REAL, 1e6, 1e6);
I have a data file of almost 9 million lines (soon to be more than 500 million lines) and I'm looking for the fastest way to read it in. The five aligned columns are padded and separated by spaces, so I know where on each line to look for the two fields that I want.
My Python routine takes 45 secs:
import sys,time
start = time.time()
filename = 'test.txt' # space-delimited, aligned columns
trans=[]
numax=0
for line in open(linefile,'r'):
nu=float(line[-23:-11]); S=float(line[-10:-1])
if nu>numax: numax=nu
trans.append((nu,S))
end=time.time()
print len(trans),'transitions read in %.1f secs' % (end-start)
print 'numax =',numax
whereas the routine I've come up with in C is a more pleasing 4 secs:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define BPL 47
#define FILENAME "test.txt"
#define NTRANS 8858226
int main(void) {
size_t num;
unsigned long i;
char buf[BPL];
char* sp;
double *nu, *S;
double numax;
FILE *fp;
time_t start,end;
nu = (double *)malloc(NTRANS * sizeof(double));
S = (double *)malloc(NTRANS * sizeof(double));
start = time(NULL);
if ((fp=fopen(FILENAME,"rb"))!=NULL) {
i=0;
numax=0.;
do {
if (i==NTRANS) {break;}
num = fread(buf, 1, BPL, fp);
buf[BPL-1]='\0';
sp = &buf[BPL-10]; S[i] = atof(sp);
buf[BPL-11]='\0';
sp = &buf[BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
} while (num == BPL);
fclose(fp);
end = time(NULL);
fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax);
fprintf(stdout, "that took %.1f secs\n", difftime(end,start));
} else {
fprintf(stderr, "Error opening file %s\n", FILENAME);
free(nu); free(S);
return EXIT_FAILURE;
}
free(nu); free(S);
return EXIT_SUCCESS;
}
Solutions in Fortran, C++ and Java take intermediate amounts of time (27 secs, 20 secs, 8 secs).
My question is: have I made any outrageous blunders in the above (particularly the C-code)? And is there any way to speed up the Python routine? I quickly realised that storing my data in an array of tuples was better than instantiating a class for each entry.
Some points:
Your C routine is cheating; it is being tipped off with the filesize, and is pre-allocating ...
Python: consider using array.array('d') ... one each for S and nu. Then try pre-allocation.
Python: write your routine as a function and call it -- accessing function-local variables is rather faster than accessing module-global variables.
An approach that could probably be applied to the C, C++ and python version would be to use memory map the file. The most signficant benefit is that it can reduce the amount of double-handling of data as it is copied from one buffer to another. In many cases there are also benefits due to the reduction in the number of system calls for I/O.
In the C implementation, you could try swapping the fopen()/fread()/fclose() library functions for the lower-level system calls open()/read()/close(). A speedup may come from the fact that fread() does a lot of buffering, whereas read() does not.
Additionally, calling read() less often with bigger chunks will reduce the number of system calls and therefore you'll have less switching between userspace and kernelspace. What the kernel does when you issue a read() system call (doesn't matter if it was invoked from the fread() library function) is read the data from the disk and then copy it to the userspace. The copying part becomes expensive if you issue the system call very often in your code. By reading in larger chunks you'll end up with less context switches and less copying.
Keep in mind though that read() isn't guaranteed to return a block of the exact number of bytes you wanted. This is why in a reliable and proper implementation you always have to check the return value of the read().
You have the 1 and the BPL arguments the wrong way around in fread() (the way you have it, it could read a partial line, which you don't test for). You should also be testing the return value of fread() before you try and use the returned data.
You can might be able to speed the C version up a bit by reading more than a line at a time
#define LINES_PER_READ 1000
char buf[LINES_PER_READ][BPL];
/* ... */
while (i < NTRANS && (num = fread(buf, BPL, LINES_PER_READ, fp)) > 0) {
int line;
for (line = 0; i < NTRANS && line < num; line++)
{
buf[line][BPL-1]='\0';
sp = &buf[line][BPL-10]; S[i] = atof(sp);
buf[line][BPL-11]='\0';
sp = &buf[line][BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
}
}
On systems supporting posix_fadvise(), you should also do this upfront, after opening the file:
posix_fadvise(fileno(fp), 0, 0, POSIX_FADV_SEQUENTIAL);
Another possible speed-up, given the number of times you need to do it, is to use pointers to S and nu instead of indexing into arrays, e.g.,
double *pS = S, *pnu = nu;
...
*pS++ = atof(sp);
*pnu = atof(sp);
...
Also, since you are always converting from char to double at the same locations in buf, pre-compute the addresses outside of your loop instead of computing them each time in the loop.
That's a single threaded code.
In particular: ahocorasick Python extension module (easy_install ahocorasick).
I isolated the problem to a trivial example:
import ahocorasick
t = ahocorasick.KeywordTree()
t.add("a")
When I run it in gdb, all is fine, same happens when I enter these instructions into Python CLI. However, when I try to run the script regularily, I get a segfault.
To make it even weirder, the line that causes segfault (identified by core dump analysis) is a regular int incrementation (see the bottom of the function body).
I'm completely stuck by this moment, what can I do?
int
aho_corasick_addstring(aho_corasick_t *in, unsigned char *string, size_t n)
{
aho_corasick_t* g = in;
aho_corasick_state_t *state,*s = NULL;
int j = 0;
state = g->zerostate;
// As long as we have transitions follow them
while( j != n &&
(s = aho_corasick_goto_get(state,*(string+j))) != FAIL )
{
state = s;
++j;
}
if ( j == n ) {
/* dyoo: added so that if a keyword ends up in a prefix
of another, we still mark that as a match.*/
aho_corasick_output(s) = j;
return 0;
}
while( j != n )
{
// Create new state
if ( (s = xalloc(sizeof(aho_corasick_state_t))) == NULL )
return -1;
s->id = g->newstate++;
debug(printf("allocating state %d\n", s->id)); /* debug */
s->depth = state->depth + 1;
/* FIXME: check the error return value of
aho_corasick_goto_initialize. */
aho_corasick_goto_initialize(s);
// Create transition
aho_corasick_goto_set(state,*(string+j), s);
debug(printf("%u -> %c -> %u\n",state->id,*(string+j),s->id));
state = s;
aho_corasick_output(s) = 0;
aho_corasick_fail(s) = NULL;
++j; // <--- HERE!
}
aho_corasick_output(s) = n;
return 0;
}
There are other tools you can use that will find faults that does not necessarily crash the program.
valgrind, electric fence, purify, coverity, and lint-like tools may be able to help you.
You might need to build your own python in some cases for this to be usable. Also, for memory corruption things, there is (or was, haven't built exetensions in a while) a possibility to let python use direct memory allocation instead of pythons own.
Have you tried translating that while loop to a for loop? Maybe there's some subtle misunderstanding with the ++j that will disappear if you use something more intuitive.