I have M~200k points with X,Y coordinates of cities packed in a Mx2 numpy array.
Intent is, for each city to compute Top N closest cities and return their indices and distances to that city in a MxN numpy matrix.
Numba does a great job speeding up my serial python code on CPU and making it multithreaded using prange generator, such that on a 16 cores SSE 4.2/AVX machine call to compute 30 closest cities completes in 6min 26s while saturating all the cores:
#numba.njit(fastmath=True)#
def get_closest(n,N_CLOSEST,coords,i):
dist=np.empty(n,np.float32)
for j in range(n):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
indices=np.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
#numba.njit(fastmath=True,parallel=True)
def get_top_T_closest_cities(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=np.empty((n,N_CLOSEST),np.float32)
closest_cities=np.empty((n,N_CLOSEST),np.int32)
for i in prange(n):
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,coords,i)
return closest_cities,closest_distances
closest_cities,closest_distances=get_top_T_closest_cities(data,30)
Note: I wanted to use
mrange=np.arange(N_CLOSEST)
and
indices=np.argpartition(dist,mrange)
later to save some cycles, but unfortunately Numba did not support np.argpartition yet.
Then I decided to put my newly bought RTX 2070 to some good use and try offloading these very parallel by the nature computations to the GPU, using again Numba and possibly CuPy.
After some thinking I came up with a relatively dumb rewrite, where GPU kernel handling one city at time was called consecutively for each of M cities. In that kernel, each GPU thread was computing its distance and saving to specific location in dist array. All arrays were allocated on device to minimize PCI data transfer:
import cupy as cp
def get_top_T_closest_cities_gpu(coords,T):
n=len(coords)
N_CLOSEST=min(T,n-1)
closest_distances=cp.empty((n,N_CLOSEST),cp.float32)
closest_cities=cp.empty((n,N_CLOSEST),cp.int32)
device_coords=cp.asarray(coords)
dist=cp.ndarray(n,cp.float32)
for i in range(n):
if (i % 1000)==0:
print(i)
closest_cities[i,:],closest_distances[i,:]=get_closest(n,N_CLOSEST,device_coords,i,dist)
return cp.asnumpy(closest_cities),cp.asnumpy(closest_distances)
#cuda.jit()
def get_distances(coords,i,n,dist):
stride = cuda.gridsize(1)
start = cuda.grid(1)
for j in range(start, n, stride):
dist[j]=(coords[i,0]-coords[j,0])**2+(coords[i,1]-coords[j,1])**2
def get_closest(n,N_CLOSEST,coords,i,dist):
get_distances[512,512](coords,i,n,dist)
indices=cp.argsort(dist)[1:N_CLOSEST+1]
return indices,dist[indices]
Now, computing results on GPU took almost the same 6 mins time, but GPU load was barely 1% (yes I made sure results returned by CPU and GPU versions are the same). I played a bit with block size, did not see any significant change. Interestingly, cp.argsort and get_distances were consuming approximately the same share of processing time.
I feel like it has something to do with streams, how to properly init a lot of them? This would be most straightforward way to reuse my code, by treating not 1 city at time but let's say 16 or whatever my compute capability allows, ideally 1000 maybe.
What would those of you guys experienced in GPU coding in Numba/CuPy recommend to fully exploit power of GPU in my case?
Advice from pure C++ CUDA apologists would also be very welcome, as I haven't seen comparison of native Cuda solution with Numba/CuPy CUDA solution.
Python version: ['3.6.3 |Anaconda, Inc']
platform: AMD64
system: Windows-10-10.0.14393-SP0
Numba version: 0.41.0
You seem to be interested in an answer based on CUDA C++, so I'll provide one. It should be straightforward to convert this kernel to an equivalent numba cuda jit kernel; the translation process is fairly mechanical.
An outline of the method I chose is as follows:
assign one city per thread (the reference city for that thread)
each thread traverses through all the cities computing the pairwise distance to its reference city
as each distance is computed, it is checked to see if it is within the current "closest cities". The approach here is to keep a running list for each thread. As each thread computes a new distance, it checks to see if that distance is less than the current maximum distance in the running list. If it is, the current maximum distance/city is replaced with the one just computed.
The list of "closest cities" along with their distances is maintained in shared memory.
At the completion of computing all distances, each thread then sorts and writes out the values in its shared buffer to global memory
There are several "optimizations" to this basic plan. One is that each warp reads 32 consecutive values at a time, and then uses a shuffle operation to pass the values around among threads, so as to reduce global memory traffic.
A single kernel call performs the entire operation for all cities.
The performance of this kernel will vary quite a bit depending on the GPU you are running it on. For example on a tesla P100 it runs in about 0.5s. On a Tesla K40 it takes ~6s. On a Quadro K2000 it takes ~40s.
Here is a full test case on the Tesla P100:
$ cat t364.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <algorithm>
#include <vector>
#include <thrust/sort.h>
const int ncities = 200064; // should be divisible by nTPB
const int nclosest = 32; // maximum 48
const int nTPB = 128; // should be multiple of 32
const float FLOAT_MAX = 999999.0f;
__device__ void init_shared(float *d, int n){
int idx = threadIdx.x;
for (int i = 0; i < n; i++){
d[idx] = FLOAT_MAX;
idx += nTPB;}
}
__device__ int find_max(float *d, int n){
int idx = threadIdx.x;
int max = 0;
float maxd = d[idx];
for (int i = 1; i < n; i++){
idx += nTPB;
float next = d[idx];
if (maxd < next){
max = i;
maxd = next;}}
return max;
}
__device__ float compute_sqdist(float2 c1, float2 c2){
return (c1.x - c2.x)*(c1.x - c2.x) + (c1.y - c2.y)*(c1.y - c2.y);
}
__device__ void sort_and_store_sqrt(int idx, float *closest_dist, int *closest_id, float *dist, int *city, int n, int n_cities)
{
for (int i = n-1; i >= 0; i--){
int max = find_max(dist, n);
closest_dist[idx + i*n_cities] = sqrtf(dist[max*nTPB+threadIdx.x]);
closest_id[idx + i*n_cities] = city[max*nTPB+threadIdx.x];
dist[max*nTPB+threadIdx.x] = 0.0f;}
};
__device__ void shuffle(int &city_id, float2 &next_city_xy, int my_warp_lane){
int src_lane = (my_warp_lane+1)&31;
city_id = __shfl_sync(0xFFFFFFFF, city_id, src_lane);
next_city_xy.x = __shfl_sync(0xFFFFFFFF, next_city_xy.x, src_lane);
next_city_xy.y = __shfl_sync(0xFFFFFFFF, next_city_xy.y, src_lane);
}
__global__ void k(const float2 * __restrict__ cities, float * __restrict__ closest_dist, int * __restrict__ closest_id, const int n_cities){
__shared__ float dist[nclosest*nTPB];
__shared__ int city[nclosest*nTPB];
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int my_warp_lane = (threadIdx.x & 31);
init_shared(dist, nclosest);
float2 my_city_xy = cities[idx];
float last_max = FLOAT_MAX;
for (int i = 0; i < n_cities; i+=32){
int city_id = i+my_warp_lane;
float2 next_city_xy = cities[city_id];
for (int j = 0; j < 32; j++){
if (j > 0) shuffle(city_id, next_city_xy, my_warp_lane);
if (idx != city_id){
float my_dist = compute_sqdist(my_city_xy, next_city_xy);
if (my_dist < last_max){
int max = find_max(dist, nclosest);
last_max = dist[max*nTPB+threadIdx.x];
if (my_dist < last_max) {
dist[max*nTPB+threadIdx.x] = my_dist;
city[max*nTPB+threadIdx.x] = city_id;}}}}}
sort_and_store_sqrt(idx, closest_dist, closest_id, dist, city, nclosest, n_cities);
}
void on_cpu(float2 *cities, float *dists, int *ids){
thrust::host_vector<int> my_ids(ncities);
thrust::host_vector<float> my_dists(ncities);
for (int i = 0; i < ncities; i++){
for (int j = 0; j < ncities; j++){
my_ids[j] = j;
if (i != j) my_dists[j] = sqrtf((cities[i].x - cities[j].x)*(cities[i].x - cities[j].x) + (cities[i].y - cities[j].y)*(cities[i].y - cities[j].y));
else my_dists[j] = FLOAT_MAX;}
thrust::sort_by_key(my_dists.begin(), my_dists.end(), my_ids.begin());
for (int j = 0; j < nclosest; j++){
dists[i+j*ncities] = my_dists[j];
ids[i+j*ncities] = my_ids[j];}
}
}
int main(){
float2 *h_cities, *d_cities;
float *h_closest_dist, *d_closest_dist, *h_closest_dist_cpu;
int *h_closest_id, *d_closest_id, *h_closest_id_cpu;
cudaMalloc(&d_cities, ncities*sizeof(float2));
cudaMalloc(&d_closest_dist, nclosest*ncities*sizeof(float));
cudaMalloc(&d_closest_id, nclosest*ncities*sizeof(int));
h_cities = new float2[ncities];
h_closest_dist = new float[ncities*nclosest];
h_closest_id = new int[ncities*nclosest];
h_closest_dist_cpu = new float[ncities*nclosest];
h_closest_id_cpu = new int[ncities*nclosest];
for (int i = 0; i < ncities; i++){
h_cities[i].x = (rand()/(float)RAND_MAX)*10.0f;
h_cities[i].y = (rand()/(float)RAND_MAX)*10.0f;}
cudaMemcpy(d_cities, h_cities, ncities*sizeof(float2), cudaMemcpyHostToDevice);
k<<<ncities/nTPB, nTPB>>>(d_cities, d_closest_dist, d_closest_id, ncities);
cudaMemcpy(h_closest_dist, d_closest_dist, ncities*nclosest*sizeof(float),cudaMemcpyDeviceToHost);
cudaMemcpy(h_closest_id, d_closest_id, ncities*nclosest*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < nclosest; i++){
for (int j = 0; j < 5; j++) std::cout << h_closest_id[j+i*ncities] << "," << h_closest_dist[j+i*ncities] << " ";
std::cout << std::endl;}
if (ncities < 5000){
on_cpu(h_cities, h_closest_dist_cpu, h_closest_id_cpu);
for (int i = 0; i < ncities*nclosest; i++)
if (h_closest_id_cpu[i] != h_closest_id[i]) {std::cout << "mismatch at: " << i << " was: " << h_closest_id[i] << " should be: " << h_closest_id_cpu[i] << std::endl; return -1;}
}
return 0;
}
$ nvcc -o t364 t364.cu
$ nvprof ./t364
==31871== NVPROF is profiling process 31871, command: ./t364
33581,0.00466936 116487,0.0163371 121419,0.0119542 138585,0.00741395 187892,0.0205568
56138,0.0106946 105637,0.0195111 175565,0.0132018 121719,0.0090809 198333,0.0269794
36458,0.014851 6942,0.0244329 67920,0.013367 140919,0.014739 91533,0.0348142
48152,0.0221216 146867,0.0250192 14656,0.020469 163149,0.0247639 162442,0.0354359
3681,0.0226946 127671,0.0265515 32841,0.0219539 109520,0.0359874 21346,0.0424094
20903,0.0313963 3075,0.0279635 79787,0.0220388 106161,0.0365807 24186,0.0453916
191226,0.0316818 4168,0.0297535 126726,0.0246147 154598,0.0374218 62106,0.0459001
185573,0.0349628 76270,0.030849 122878,0.0249695 124897,0.0447656 38124,0.0463704
71252,0.037517 18879,0.0350544 112410,0.0299296 102006,0.0462593 12361,0.0464608
172641,0.0399721 134288,0.035077 39252,0.031506 164570,0.0470057 81105,0.0502873
18960,0.0433545 53266,0.0360357 195467,0.0334281 36715,0.0470069 153375,0.0508115
163568,0.0442928 95544,0.0361058 151839,0.0398384 114041,0.0490263 8524,0.0511531
182326,0.047519 59919,0.0362906 90810,0.0408069 52013,0.0494515 16793,0.0525569
169860,0.048417 77412,0.036694 12065,0.0414496 137863,0.0494703 197500,0.0537481
40158,0.0492621 34592,0.0377113 54812,0.041594 58792,0.049532 70501,0.0541114
121444,0.0501154 102800,0.0414865 96238,0.0433548 34323,0.0531493 161527,0.0551868
118564,0.0509228 82832,0.0449587 167965,0.0439488 104475,0.0534779 66968,0.0572788
60136,0.0528873 88318,0.0455667 118562,0.0462537 129853,0.0535594 7544,0.0588875
95975,0.0587857 65792,0.0479467 124929,0.046828 116360,0.0537344 37341,0.0594454
62542,0.0592229 57399,0.0509665 186583,0.0470843 47571,0.0541045 81275,0.0596965
11499,0.0607943 28314,0.0512354 23730,0.0518801 176089,0.0543222 562,0.06527
131594,0.0611795 23120,0.0515408 25933,0.0547776 117474,0.0557752 194499,0.0657885
23959,0.0623019 137283,0.0533193 173000,0.0577509 157537,0.0566689 193091,0.0666895
5660,0.0629772 15498,0.0555821 161025,0.0596721 123148,0.0589929 147928,0.0672529
51033,0.063036 15456,0.0565314 145859,0.0601408 3012,0.0601779 107646,0.0687241
26790,0.0643055 99659,0.0576798 33153,0.0603556 48388,0.0617377 47566,0.0689055
178826,0.0655352 143209,0.058413 101960,0.0604994 22146,0.0620504 91118,0.0705487
32108,0.0672866 172089,0.058676 17885,0.0638383 137318,0.0624543 138223,0.0716578
38125,0.0678566 42847,0.0609811 10879,0.0681518 154360,0.0633921 96195,0.0723226
96583,0.0683073 169703,0.0621889 100839,0.0721133 28595,0.0661302 20422,0.0731882
98329,0.0683718 50432,0.0636618 84204,0.0733909 181919,0.066552 75375,0.0736715
75814,0.0694582 161714,0.0674298 89049,0.0749184 151275,0.0679411 37849,0.0739173
==31871== Profiling application: ./t364
==31871== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 95.66% 459.88ms 1 459.88ms 459.88ms 459.88ms k(float2 const *, float*, int*, int)
4.30% 20.681ms 2 10.340ms 10.255ms 10.425ms [CUDA memcpy DtoH]
0.04% 195.55us 1 195.55us 195.55us 195.55us [CUDA memcpy HtoD]
API calls: 58.18% 482.42ms 3 160.81ms 472.46us 470.80ms cudaMemcpy
41.05% 340.38ms 3 113.46ms 322.83us 339.73ms cudaMalloc
0.55% 4.5490ms 384 11.846us 339ns 503.79us cuDeviceGetAttribute
0.16% 1.3131ms 4 328.27us 208.85us 513.66us cuDeviceTotalMem
0.05% 407.13us 4 101.78us 93.796us 118.27us cuDeviceGetName
0.01% 61.719us 1 61.719us 61.719us 61.719us cudaLaunchKernel
0.00% 23.965us 4 5.9910us 3.9830us 8.9510us cuDeviceGetPCIBusId
0.00% 6.8440us 8 855ns 390ns 1.8150us cuDeviceGet
0.00% 3.7490us 3 1.2490us 339ns 2.1300us cuDeviceGetCount
0.00% 2.0260us 4 506ns 397ns 735ns cuDeviceGetUuid
$
CUDA 10.0, Tesla P100, CentOS 7
The code includes the possibility for verification, but it verifies the results against a CPU-based naive code, which takes quite a bit longer. So I've restricted verification to smaller test cases, eg. 4096 cities.
Here's a "translated" approximation of the above code using numba cuda. It seems to run about 2x slower than the CUDA C++ version, which shouldn't really be the case. (One of the contributing factors seems to be the use of the sqrt function - it is having a significant performance impact in the numba code but no perf impact in the CUDA C++ code. It may be using a double-precision realization under the hood in numba CUDA.) Anyway it gives a reference for how it could be realized in numba:
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
# number of cities, should be divisible by threadblock size
N = 200064
# number of "closest" cities
TN = 32
#number of threads per block
threadsperblock = 128
#shared memory size of each array in elements
smSize=TN*threadsperblock
#cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
#cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.random.rand(N).astype(np.float32)
cy = np.random.rand(N).astype(np.float32)
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
for i in range(TN):
for j in range(1):
print(ti[i*N+j])
print(td[i*N+j])
I haven't included validation in this numba cuda variant, so it may contain defects.
EDIT: By switching the above code examples from 128 threads per block to 64 threads per block, the codes will run noticeably faster. This is due to the increase in the theoretical and achieved occupancy that this allows, in light of the shared memory limiter to occupancy.
As pointed out in the comments by max9111, a better algorithm exists. Here's a comparison (on a very slow GPU) between the above python code and the tree-based algorithm (borrowing from the answer here):
$ cat t27.py
import numpy as np
import numba as nb
import math
from numba import cuda,float32,int32
from scipy import spatial
import time
#Create some coordinates and indices
#It is assumed that the coordinates are unique (only one entry per hydrant)
ncities = 200064
Coords=np.random.rand(ncities*2).reshape(ncities,2)
Coords*=100
Indices=np.arange(ncities) #Indices
nnear = 33
def get_indices_of_nearest_neighbours(Coords,Indices):
tree=spatial.cKDTree(Coords)
#k=4 because the first entry is the nearest neighbour
# of a point with itself
res=tree.query(Coords, k=nnear)[1][:,1:]
return Indices[res]
start = time.time()
a = get_indices_of_nearest_neighbours(Coords, Indices)
end = time.time()
print (a.shape[0],a.shape[1])
print
print(end -start)
# number of cities, should be divisible by threadblock size
N = ncities
# number of "closest" cities
TN = nnear - 1
#number of threads per block
threadsperblock = 64
#shared memory size of each array in elements
smSize=TN*threadsperblock
#cuda.jit('int32(float32[:])',device=True)
def find_max(dist):
idx = cuda.threadIdx.x
mymax = 0
maxd = dist[idx]
for i in range(TN-1):
idx += threadsperblock
mynext = dist[idx]
if maxd < mynext:
mymax = i+1
maxd = mynext
return mymax
#cuda.jit('void(float32[:], float32[:], float32[:], int32[:], int32)')
def k(citiesx, citiesy, td, ti, nc):
dist = cuda.shared.array(smSize, float32)
city = cuda.shared.array(smSize, int32)
idx = cuda.grid(1)
tid = cuda.threadIdx.x
wl = tid & 31
my_city_x = citiesx[idx];
my_city_y = citiesy[idx];
last_max = 99999.0
for i in range(TN):
dist[tid+i*threadsperblock] = 99999.0
for i in xrange(0,nc,32):
city_id = i+wl
next_city_x = citiesx[city_id]
next_city_y = citiesy[city_id]
for j in range(32):
if j > 0:
src_lane = (wl+1)&31
city_id = cuda.shfl_sync(0xFFFFFFFF, city_id, src_lane)
next_city_x = cuda.shfl_sync(0xFFFFFFFF, next_city_x, src_lane)
next_city_y = cuda.shfl_sync(0xFFFFFFFF, next_city_y, src_lane)
if idx != city_id:
my_dist = (next_city_x - my_city_x)*(next_city_x - my_city_x) + (next_city_y - my_city_y)*(next_city_y - my_city_y)
if my_dist < last_max:
mymax = find_max(dist)
last_max = dist[mymax*threadsperblock+tid]
if my_dist < last_max:
dist[mymax*threadsperblock+tid] = my_dist
city[mymax*threadsperblock+tid] = city_id
for i in range(TN):
mymax = find_max(dist)
td[idx+i*nc] = math.sqrt(dist[mymax*threadsperblock+tid])
ti[idx+i*nc] = city[mymax*threadsperblock+tid]
dist[mymax*threadsperblock+tid] = 0
#peform test
cx = np.zeros(N).astype(np.float32)
cy = np.zeros(N).astype(np.float32)
for i in range(N):
cx[i] = Coords[i,0]
cy[i] = Coords[i,1]
td = np.zeros(N*TN, dtype=np.float32)
ti = np.zeros(N*TN, dtype=np.int32)
start = time.time()
d_cx = cuda.to_device(cx)
d_cy = cuda.to_device(cy)
d_td = cuda.device_array_like(td)
d_ti = cuda.device_array_like(ti)
k[N/threadsperblock, threadsperblock](d_cx, d_cy, d_td, d_ti, N)
d_td.copy_to_host(td)
d_ti.copy_to_host(ti)
end = time.time()
print(a)
print
print(end - start)
print(np.flip(np.transpose(ti.reshape(nnear-1,ncities)), 1))
$ python t27.py
(200064, 32)
1.32144594193
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
44.1796119213
[[177220 25281 159413 ..., 133366 45179 27508]
[ 56956 163534 90723 ..., 135081 33025 167104]
[ 88708 42958 162851 ..., 115964 77180 31684]
...,
[186540 52500 124911 ..., 157102 83407 38152]
[151053 144837 34487 ..., 171148 37887 12591]
[ 13820 199968 88667 ..., 7241 172376 51969]]
$
On this very slow Quadro K2000 GPU, the CPU/scipy-based kd-tree algorithm is much faster than this GPU implementation. However, at ~1.3s, the scipy implementation is still a bit slower than the brute-force CUDA C++ code running on a Tesla P100, and faster GPUs exist today. A possible explanation for this disparity is given here. As pointed out there, the brute force approach is easily parallelizable, whereas the kd-tree approach may be non-trivial to parallelize. In any event, a faster solution yet might be a GPU-based kd-tree implementation.
I'm trying to apply parallelism to the following algorithm. This should be easily parallelizable, as the calculations are independent for the first three dimensions (b, i, j).
def nb_forward(np.ndarray[FLOAT64, ndim=4] inputv, np.ndarray[FLOAT64, ndim=4] kernels, np.ndarray[FLOAT64, ndim=1] bias, tuple stride):
cdef unsigned int kernel_size0 = kernels.shape[1], kernel_size1 = kernels.shape[2], \
stride0 = stride[0], stride1 = stride[1], \
num_dim = kernels.shape[0], \
num_filters = kernels.shape[3], \
batch_size = inputv.shape[0]
cdef unsigned int out_size0 = (inputv.shape[1] - kernel_size0) / stride0 + 1, \
out_size1 = (inputv.shape[2] - kernel_size1) / stride1 + 1
cdef double[:, :, :, :] out = np.empty(shape=[batch_size, out_size0, out_size1, num_filters], dtype=np.float64)
cdef unsigned int b, i, j, m, kw, kh, n
cdef unsigned int iin, jin
cdef double acc
with nogil, parallel():
for b in prange(batch_size):
for i in range(out_size0):
for j in range(out_size1):
iin = i*stride0
jin = j*stride1
for n in range(num_filters):
acc = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
acc += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] = acc + bias[n]
return out
Error:
Cannot read reduction variable in loop body
Initially I tried to parallelize only at the level of b, since parallelizing at level b, i, j is at the pixel level and I do not know if it is worth generating as many threads. But I have not succeeded.
I tried to use a temporary array out_batch, but being a numpy array, it is giving me a lot of problems and
Error: malloc problems
I have also tried instead of using numpy array using double arrays (double [:,:,:]) but it gives:
Error: Memoryview slices can only be shared in parallel sections
Does anyone have an idea? Is there any way to apply nogil at the level of b, i, j (or only b) and then compact the data?
Obviously the variable acc is shared between all threads and thus it can come to raise conditions - Cython rightly doesn't let this code to compile.
The variable acc shouldn't be shared between the threads, but be private to a thread. However, to my limited knowledge, there is no way to do it with cython yet (not sure what happened with this proposal).
A usual workaround is to allocate a large enough working array tmp and to accumulate the value for i-th thread in tmp[i]. Often enough (but not always) already presented arrays can be used for this purpose, so also in your case - by replacing acc through out[b,i,j,n]:
for n in range(num_filters):
out[b, i, j, n] = 0.
for kw in range(kernel_size0):
for kh in range(kernel_size1):
for m in range(num_dim):
out[b, i, j, n] += inputv[b, iin + kw, jin + kh, m] * kernels[m, kw, kh, n]
out[b, i, j, n] += bias[n]
I'm trying to convert a python module to cython, it does a lot of serialize and deserialize work.
Currently I have to do this:
import struct
from libc.stdint cimport (
int32_t,
int64_t,
)
cpdef bytes write_int(int32_t i):
return struct.pack("!i", i)
cpdef bytes write_long(int64_t i):
return struct.pack("!q", i)
cdef bytes write_double(double val):
return struct.pack("!d", val)
cdef bytes write_string(bytes val):
cdef int32_t length = len(val)
cdef str fmt
fmt = "!i%ds" % length
return struct.pack(fmt, length, val)
Is there an equal in c lib to struct.pack and struct.unpack? What's the best way to do things like this in cython?
I looked at the modules (this and this) and just translated the code to Cython and removed the PyObject parts. In theory this should work, but some parts (like the float parts) I have no way of rigorously testing:
Some imports:
from cpython.array cimport array, clone
from libc.string cimport memcmp, memcpy
from libc.math cimport frexp, ldexp
from libc.stdint cimport int32_t, int64_t
Save some code with a fused type. It's technically not a stable feature, but it works flawlessly for me:
ctypedef fused integer:
int32_t
int64_t
This part tests the machine's endianness. It works for me, but that's hardly a complete suite. OTOH, it looks about right
cdef enum float_format_type:
unknown_format,
ieee_big_endian_format,
ieee_little_endian_format
# Set-up
cdef array stringtemplate = array('B')
cdef float_format_type double_format
cdef double x = 9006104071832581.0
if sizeof(double) == 8:
if memcmp(&x, b"\x43\x3f\xff\x01\x02\x03\x04\x05", 8) == 0:
double_format = ieee_big_endian_format
elif memcmp(&x, b"\x05\x04\x03\x02\x01\xff\x3f\x43", 8) == 0:
double_format = ieee_little_endian_format
else:
double_format = unknown_format
else:
double_format = unknown_format;
(The stringtemplate is used to be able to make bytes objects quickly)
This part's simple:
cdef void _write_integer(integer x, char* output):
cdef int i
for i in range(sizeof(integer)-1, -1, -1):
output[i] = <char>x
x >>= 8
cpdef bytes write_int(int32_t i):
cdef array output = clone(stringtemplate, sizeof(int32_t), False)
_write_integer(i, output.data.as_chars)
return output.data.as_chars[:sizeof(int32_t)]
cpdef bytes write_long(int64_t i):
cdef array output = clone(stringtemplate, sizeof(int64_t), False)
_write_integer(i, output.data.as_chars)
return output.data.as_chars[:sizeof(int64_t)]
The array is similar to malloc but it's garbage collected :).
This part I mostly have no idea about. My "tests" passed, but it's mostly hope:
cdef void _write_double(double x, char* output):
cdef:
unsigned char sign
int e
double f
unsigned int fhi, flo, i
char *s
if double_format == unknown_format or True:
if x < 0:
sign = 1
x = -x
else:
sign = 0
f = frexp(x, &e)
# Normalize f to be in the range [1.0, 2.0)
if 0.5 <= f < 1.0:
f *= 2.0
e -= 1
elif f == 0.0:
e = 0
else:
raise SystemError("frexp() result out of range")
if e >= 1024:
raise OverflowError("float too large to pack with d format")
elif e < -1022:
# Gradual underflow
f = ldexp(f, 1022 + e)
e = 0;
elif not (e == 0 and f == 0.0):
e += 1023
f -= 1.0 # Get rid of leading 1
# fhi receives the high 28 bits; flo the low 24 bits (== 52 bits)
f *= 2.0 ** 28
fhi = <unsigned int>f # Truncate
assert fhi < 268435456
f -= <double>fhi
f *= 2.0 ** 24
flo = <unsigned int>(f + 0.5) # Round
assert(flo <= 16777216);
if flo >> 24:
# The carry propagated out of a string of 24 1 bits.
flo = 0
fhi += 1
if fhi >> 28:
# And it also progagated out of the next 28 bits.
fhi = 0
e += 1
if e >= 2047:
raise OverflowError("float too large to pack with d format")
output[0] = (sign << 7) | (e >> 4)
output[1] = <unsigned char> (((e & 0xF) << 4) | (fhi >> 24))
output[2] = 0xFF & (fhi >> 16)
output[3] = 0xFF & (fhi >> 8)
output[4] = 0xFF & fhi
output[5] = 0xFF & (flo >> 16)
output[6] = 0xFF & (flo >> 8)
output[7] = 0xFF & flo
else:
s = <char*>&x;
if double_format == ieee_little_endian_format:
for i in range(8):
output[i] = s[7-i]
else:
for i in range(8):
output[i] = s[i]
If you can understand how it works, be sure to check it yourself.
Then we wrap it as before:
cdef bytes write_double(double x):
cdef array output = clone(stringtemplate, sizeof(double), False)
_write_double(x, output.data.as_chars)
return output.data.as_chars[:sizeof(double)]
The string one is actually really simple, and explains why I set it up as I did above:
cdef bytes write_string(bytes val):
cdef:
int32_t int_length = sizeof(int32_t)
int32_t input_length = len(val)
array output = clone(stringtemplate, int_length + input_length, True)
_write_integer(input_length, output.data.as_chars)
memcpy(output.data.as_chars + int_length, <char*>val, input_length)
return output.data.as_chars[:int_length + input_length]
If you're only packing one type of data per command (eg. a group of ints, then a group of floats etc), you can use array.array() for faster results, either via Python or Cython.
Source:
Serialize a group of integers using Cython