Python faster than rust with py03 - python

I am trying to speed up some python code using rust bindings with py03.
i have implemented the following function in both python and rust:
def _play_action(state, action):
temp = state.copy()
i1, j1, i2, j2 = action
h1 = abs(temp[i1][j1])
h2 = abs(temp[i2][j2])
if temp[i1][j1] < 0:
temp[i2][j2] = -(h1 + h2)
else:
temp[i2][j2] = h1 + h2
temp[i1][j1] = 0
return temp
#[pyfunction]
fn play_action(state: [[i32; 9]; 9], action : [usize;4]) -> [[i32; 9]; 9] {
let mut s = state.clone();
let h1 = s[action[0]][action[1]];
let h2 = s[action[2]][action[3]];
s[action[0]][action[1]] = 0;
s[action[2]][action[3]] = h1.signum() * (h1 + h2).abs();
s
And to my great surprise the python version is faster... Any idea why?

This is probably caused by the overhead of the communication between python and Rust, the data you're passing is too small so I assume you're calling play_action many times. a better approach would be to batch your calls
#[pyfunction]
fn play_actions(data: Vec<([[i32; 9]; 9],[usize;4])>) -> Vec<[[i32; 9]; 9]> {
data.into_iter()
.map(|(state,action)| play_action(state,action))
.collect::<Vec<_>>()
}
fn play_action(state: [[i32; 9]; 9], action : [usize;4]) -> [[i32; 9]; 9] {
let mut s = state.clone();
let h1 = s[action[0]][action[1]];
let h2 = s[action[2]][action[3]];
s[action[0]][action[1]] = 0;
s[action[2]][action[3]] = h1.signum() * (h1 + h2).abs();
s
}

If you are calling the function written in rust from Python, there will have to be a conversion from Python objects to rust data structures. The time that this takes is overhead.
Since your function seems pretty small, it could easily be that the overhead overwhelms the runtime of the function.
I would encourage you to profile your python code (using the cProfile module) before trying to make it faster. Profiling and the insight in the behavior of your code that it provides can enable significant performance gains.
Here is a link to the first of a series of articles that I've written about python profiling.
If you do a lot of number crunching, see if your problem is a good fit for numpy.
If a relatively small function takes up a lot of the execution time because it is called very often, try using the functools.cache decorator.
Keep in mind that a better algorithm generally beats optimizations.

Related

Assigned a complex value in cupy RawKernel

I am a beginner learning how to exploit GPU for parallel computation using python and cupy. I would like to implement my code to simulate some problems in physics and require to use complex number, but don't know how to manage it. Although there are examples in Cupy's official document, it only mentions about include complex.cuh library and how to declare a complex variable. I can't find any example about how to assign a complex number correctly, as well ass how to call the function in the complex.cuh library to do calculation.
I am stuck in line 11 of this code. I want to make a complex number value equal x[tIdx]+j*y[t_Idx], j is the imaginary number. I tried several ways and no one works, so I left this one here.
import cupy as cp
import time
add_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>
extern "C" __global__
void test(double* x, double* y, complex<float>* z){
int tId_x = blockDim.x*blockIdx.x + threadIdx.x;
int tId_y = blockDim.y*blockIdx.y + threadIdx.y;
complex<float>* value = complex(x[tId_x],y[tId_y]);
z[tId_x*blockDim.y*gridDim.y+tId_y] = value;
}''',"test")
x = cp.random.rand(1,8,4096,dtype = cp.float32)
y = cp.random.rand(1,8,4096,dtype = cp.float32)
z = cp.zeros((4096,4096), dtype = cp.complex64)
t1 = time.time()
add_kernel((128,128),(32,32),(x,y,z))
print(time.time()-t1)
What is the proper way to assign a complex number in the RawKernel?
Thank you for answering this question!
#plaeonix, thank you very much for your hint. I find out the answer.
This line:
complex<float>* value = complex(x[tId_x],y[tId_y])
should be replaced to:
complex<float> value = complex<float>(x[tId_x],y[tId_y])
Then the assignment of a complex number works.

Why is the R substr function so much slower than slicing in Python?

While attempting to optimize a script I was writing, I stumbled across something that has me confused. In R, if I want to grab 1 character out of a string at a specific location, it seems like the substr function is my best bet. In python, it makes sense to perform the same operation using slice notation. What is confusing to me is just how different the speeds of these two operations are. I wrote two test scripts to showcase this:
In Python:
import time
test = "testtesttesttesttesttesttest"
startTime = time.time()
result = list()
for i in range(100000):
result.append(test[19])
stopTime = time.time()
print(stopTime - startTime)
In R:
library(microbenchmark)
test = "testtesttesttesttesttesttest"
testWithSubstr = function() {
result = character(123456)
for(i in 1:123456) {
result[i] = substr(test, 20, 20)
}
}
print(microbenchmark(testWithSubstr()))
(edited to use a for loop instead of sapply for a better comparison and to use microbenchmark)
The Python code runs over 10 times faster, despite (to my knowledge) doing essentially the same thing. Why is this?

F# library or .Net Numerics equivalent to Python Numpy function

I have the following python Numpy function; it is able to take X, an array with an arbitrary number of columns and rows, and output a Y value predicted by a least squares function.
What is the Math.Net equivalent for such a function?
Here is the Python code:
newdataX = np.ones([dataX.shape[0],dataX.shape[1]+1])
newdataX[:,0:dataX.shape[1]]=dataX
# build and save the model
self.model_coefs, residuals, rank, s = np.linalg.lstsq(newdataX, dataY)
I think you are looking for the functions on this page: http://numerics.mathdotnet.com/api/MathNet.Numerics.LinearRegression/MultipleRegression.htm
You have a few options to solve :
Normal Equations : MultipleRegression.NormalEquations(x, y)
QR Decomposition : MultipleRegression.QR(x, y)
SVD : MultipleRegression.SVD(x, y)
Normal equations are faster but less numerically stable while SVD is the most numerically stable but the slowest.
You can call numpy from .NET using pythonnet (C# CODE BELOW IS COPIED FROM GITHUB):
The only "funky" part right now with pythonnet is passing numpy arrays. It is possible to convert them to Python lists at the interface, though this reduces performance for some situations.
https://github.com/pythonnet/pythonnet/tree/develop
static void Main(string[] args)
{
using (Py.GIL()) {
dynamic np = Py.Import("numpy");
dynamic sin = np.sin;
Console.WriteLine(np.cos(np.pi*2));
Console.WriteLine(sin(5));
double c = np.cos(5) + sin(5);
Console.WriteLine(c);
dynamic a = np.array(new List<float> { 1, 2, 3 });
dynamic b = np.array(new List<float> { 6, 5, 4 }, Py.kw("dtype", np.int32));
Console.WriteLine(a.dtype);
Console.WriteLine(b.dtype);
Console.WriteLine(a * b);
Console.ReadKey();
}
}
outputs:
1.0
-0.958924274663
-0.6752620892
float64
int32
[ 6. 10. 12.]
Here is example using F# posted on github:
https://github.com/pythonnet/pythonnet/issues/112
open Python.Runtime
open FSharp.Interop.Dynamic
open System.Collections.Generic
[<EntryPoint>]
let main argv =
//set up for garbage collection?
use gil = Py.GIL()
//-----
//NUMPY
//import numpy
let np = Py.Import("numpy")
//call a numpy function dynamically
let sinResult = np?sin(5)
//make a python list the hard way
let list = new Python.Runtime.PyList()
list.Append( new PyFloat(4.0) )
list.Append( new PyFloat(5.0) )
//run the python list through np.array dynamically
let a = np?array( list )
let sumA = np?sum(a)
//again, but use a keyword to change the type
let b = np?array( list, Py.kw("dtype", np?int32 ) )
let sumAB = np?add(a,b)
let SeqToPyFloat ( aSeq : float seq ) =
let list = new Python.Runtime.PyList()
aSeq |> Seq.iter( fun x -> list.Append( new PyFloat(x)))
list
//Worth making some convenience functions (see below for why)
let a2 = np?array( [|1.0;2.0;3.0|] |> SeqToPyFloat )
//--------------------
//Problematic cases: these run but don't give good results
//make a np.array from a generic list
let list2 = [|1;2;3|] |> ResizeArray
let c = np?array( list2 )
printfn "%A" c //gives type not value in debugger
//make a np.array from an array
let d = np?array( [|1;2;3|] )
printfn "%A" d //gives type not value in debugger
//use a np.array in a function
let sumD = np?sum(d) //gives type not value in debugger
//let sumCD = np?add(d,d) // this will crash
//can't use primitive f# operators on the np.arrays without throwing an exception; seems
//to work in c# https://github.com/tonyroberts/pythonnet //develop branch
//let e = d + 1
//-----
//NLTK
//import nltk
let nltk = Py.Import("nltk")
let sentence = "I am happy"
let tokens = nltk?word_tokenize(sentence)
let tags = nltk?pos_tag(tokens)
let taggedWords = nltk?corpus?brown?tagged_words()
let taggedWordsNews = nltk?corpus?brown?tagged_words(Py.kw("categories", "news") )
printfn "%A" taggedWordsNews
let tlp = nltk?sem?logic?LogicParser(Py.kw("type_check",true))
let parsed = tlp?parse("walk(angus)")
printfn "%A" parsed?argument
0 // return an integer exit code

Numba function slower than C++ and loop re-order further slows down x10

The following code simulates extracting binary words from different locations within a set of images.
The Numba wrapped function, wordcalc in the code below, has 2 problems:
It is 3 times slower compared to a similar implementation in C++.
Most strangely, if you switch the order of the "ibase" and "ibit" for-loops, speed drops by a factor of 10 (!). This does not happen in the C++ implementation which remains unaffected.
I'm using Numba 0.18.2 from WinPython 2.7
What could be causing this?
imDim = 80
numInsts = 10**4
numInstsSub = 10**4/4
bitsNum = 13;
Xs = np.random.rand(numInsts, imDim**2)
iInstInds = np.array(range(numInsts)[::4])
baseInds = np.arange(imDim**2 - imDim*20 + 1)
ofst1 = np.random.randint(0, imDim*20, bitsNum)
ofst2 = np.random.randint(0, imDim*20, bitsNum)
#nb.jit(nopython=True)
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz):
count = 0
for i in iInstInds:
Xi = Xs[i]
for ibit in range(bitsNum):
for ibase in range(baseInds.shape[0]):
u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]]
newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit))
count += 1
return newXz
ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))
I get 4x speed-up by changing from np.uint16(u * (2**ibit)) to np.uint16(u << ibit); i.e. replace the power of 2 with a bitshift, which should be equivalent (for integers).
It seems reasonably likely that your C++ compiler might be making this substitution itself.
Swapping the order of the two loops makes a small difference for me for both your original version (5%) and my optimized version (15%), so I can't think I can make a useful comment on that.
If you really wanted to compare the Numba and C++ you can look at the compiled Numba function by doing os.environ['NUMBA_DUMP_ASSEMBLY']='1' before you import Numba. (That's clearly quite involved though).
For reference, I'm using Numba 0.19.1.

Rewriting this python function in c++ seems to make it run a lot slower. Is this reasonable?

I'm implementing some graph traversal functions in python but I need better performance so I decided to try to rewrite the functions in c++, but they seem to run slower. I'm a c++ beginner so I'm not sure if this is expected behavior.
The following python functions implements a Breadth-First Search on an unweighted graph. Its objective is to visit every vertex once and measure how many hops away each vertex is from the source.
graph is dict {vertex : set(neighbor1, neighbor2 ... , neighbor n) }
return is dict {vertex : distance_to_source}
def shortest_path_lengths(graph,source):
seen={}
level=0
nextlevel={source}
while nextlevel:
thislevel=nextlevel
nextlevel=set()
for v in thislevel:
if v not in seen:
seen[v]=level
nextlevel.update(graph[v])
level=level+1
return seen
And runs:
%timeit seen = shortest_path_lengths(G,0)
10 loops, best of 3: 79.7 ms per loop
For my c++ implementation:
graph is map< long vertex, set < long > vertex neighbors >
return is map < long vertex ,int distance_from_source >
map<long,int> spl(graph G, long source)
{
int level = 0;
map<long, int> seen;
set<long> nextlevel;
set<long> thislevel;
nextlevel.insert(source);
while (! nextlevel.empty())
{
thislevel = nextlevel;
nextlevel.clear();
for (auto it = thislevel.begin(); it != thislevel.end(); ++it)
{
if (! seen.count(*it))
{
seen[*it] = level;
//cout << G[*it];
nextlevel.insert(G[*it].begin(), G[*it].end());
}
}
level++;
}
return seen;
}
and I measure its execution time with:
clock_t begin = clock();
seen = spl(graph1,0);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("\nTime taken: %.8fs\n", elapsed_secs);
and get output:
Time taken: 0.38512900s
which is almost 5 times slower than the python implementation for the same graph. Seeing as I'm a beginner in c++, I'm not really sure if I'm measuring time wrong, I'm implementing something wrong, or even if this is expected behavior.
EDIT:
After converting maps into unordered_maps, using the -O2 -flto compile parameters, and passing the graph by const reference, the run time of the c++ function for a graph of size 65k drops down to 0.09sec, which is still a bit slower than pythons 0.08s for the same graph.
On a bigger graph of 75k nodes (but over twice as many edges), c++ falls further behind at 0.3s to python's 0.2s
EDIT2:
After changing the nested set inside the map to unordered_set as well, and changing the thislevel/nextlevel sets also to unordered_sets, the c++ code beats the python code on the smaller graph ( 0.063 to 0.081 sec) but only matches it on the bigger one (0.2 to 0.2)
EDIT3:
On an even bigger graph (85k nodes, over 1.5m edges), python needs 0.9sec for the operation, while the C++ code needs 0.75s
First thing that jumped out at me is you used a hash map in Python, which is what dictionaries are, and a tree-based map in C++, which is what set is. The C++ equivalent is unordered_map.
thislevel = nextlevel;
In c++, this makes full copy of the sets. You should rather use pointers to the sets, and swap the pointers instead of the sets.

Categories

Resources