PyTorch C++ extension: How to index tensor and update it? - python

I'm creating a PyTorch C++ extension and after much research I can't figure out how to index a tensor and update its values. I found out how to iterate over a tensor's entries using the data_ptr() method, but that's not applicable to my use case.
Given is a matrix M, a list of lists (blocks) of index pairs P and a function f: dtype(M)^2 -> dtype(M)^2 that takes two values and spits out two new values.
I'm trying to implement the following pseudo code:
for each block B in P:
for each row R in M:
for each index-pair (i,j) in B:
M[R,i], M[R,j] = f(M[R,i], M[R,j])
After all, this code is going to run on the GPU using CUDA, but since I don't have any experience with that, I wanted to first write a pure C++ program and then convert it.
Can anyone suggest how to do this or how to convert the algorithm to do something equivalent?

What I wanted to do can be done using the
tensor.accessor<scalar_dtype, num_dimensions>()
method. If executing on the GPU instead use scalars.packed_accessor64<scalar_dtype, num_dimensions, torch::RestrictPtrTraits>()
or
scalars.packed_accessor32<scalar_dtype, num_dimensions, torch::RestrictPtrTraits>() (depending on the size of your tensor).
auto num_rows = scalars.size(0);
matrix = torch::rand({10, 8});
auto a = matrix.accessor<float, 2>();
for (auto i = 0; i < num_rows; ++i) {
auto x = a[i][some_index];
auto new_x = some_function(x);
a[i][some_index] = new_x;
}

Related

Assigned a complex value in cupy RawKernel

I am a beginner learning how to exploit GPU for parallel computation using python and cupy. I would like to implement my code to simulate some problems in physics and require to use complex number, but don't know how to manage it. Although there are examples in Cupy's official document, it only mentions about include complex.cuh library and how to declare a complex variable. I can't find any example about how to assign a complex number correctly, as well ass how to call the function in the complex.cuh library to do calculation.
I am stuck in line 11 of this code. I want to make a complex number value equal x[tIdx]+j*y[t_Idx], j is the imaginary number. I tried several ways and no one works, so I left this one here.
import cupy as cp
import time
add_kernel = cp.RawKernel(r'''
#include <cupy/complex.cuh>
extern "C" __global__
void test(double* x, double* y, complex<float>* z){
int tId_x = blockDim.x*blockIdx.x + threadIdx.x;
int tId_y = blockDim.y*blockIdx.y + threadIdx.y;
complex<float>* value = complex(x[tId_x],y[tId_y]);
z[tId_x*blockDim.y*gridDim.y+tId_y] = value;
}''',"test")
x = cp.random.rand(1,8,4096,dtype = cp.float32)
y = cp.random.rand(1,8,4096,dtype = cp.float32)
z = cp.zeros((4096,4096), dtype = cp.complex64)
t1 = time.time()
add_kernel((128,128),(32,32),(x,y,z))
print(time.time()-t1)
What is the proper way to assign a complex number in the RawKernel?
Thank you for answering this question!
#plaeonix, thank you very much for your hint. I find out the answer.
This line:
complex<float>* value = complex(x[tId_x],y[tId_y])
should be replaced to:
complex<float> value = complex<float>(x[tId_x],y[tId_y])
Then the assignment of a complex number works.

Acces Cell Data from Paraview Programmable Filter

I need to create a programmable filter using Paraview.
The idea is to create a vector called Speed equal to the speed in the non-rotating part equal to the speed+rotational speed in the rotational one.
The problem is that I can't accept the value of the speed in each single cell.
input0 = inputs[0]
radius=3
Speed1=input0.PointData["U"]
K=vtk.vtkDoubleArray()
X=input0.PointData["X"]
Y=input0.PointData["Y"]
Z=input0.PointData["Z"]
pdi = self.GetInput()
numPts = pdi.GetNumberOfPoints()
for i in range(0, numPts):
if X.getvalue(i)^2+Y.getvalue(i)^2<radius:
temp=U.getvalue(i)
else:
temp=U.getvalue(i)+rot
Speed.InsertNextValue(1)
output.PointData.append(Speed, "Speed")
The problem is that X.getvalue(i) is not working.
The correct syntax is
X[i]
The documentation can be found here

How to index an array value in a MATLAB-Function in Simulink?

I'm using a matlab-function in simulink to call a python script, that do some calculations from the input values. The python-script gives me a string back to the matlab-function, that I split to an array. The splitted string has always to be a cell array with 6 variable strings:
dataStringArray = '[[-5.01 0.09785429][-8.01 0.01284927]...' '10.0' '20.0' '80.0' '80.0' '50.0'
To call the functions like strsplit or the python-script itself with a specific m-file, I'm using coder.extrinsic('*') method.
Now I want to index to a specific value for example with dataStringArray(3) to get '20.0' and define it as an output value of the matlab-function, but this doesn't work! I tried to predefine the dataStringArray with dataStringArray = cell(1,6); but get always the same 4 errors:
Subscripting into an mxArray is not supported.
Function 'MATLAB Function' (#23.1671.1689), line 42, column 24:
"dataStringArray(3)"
2x Errors occurred during parsing of MATLAB function 'MATLAB Function'
Error in port widths or dimensions. Output port 1 of 's_function_Matlab/MATLAB Function/constIn5' is a one dimensional vector with 1 elements.
What do I'm wrong?
SAMPLE CODE
The commented code behind the output definitions is what I need.:
function [dataArrayOutput, constOut1, constOut2, constOut3, constOut4, constOut5] = fcn(dataArrayInput, constIn1, constIn2, constIn3, constIn4, constIn5)
coder.extrinsic('strsplit');
% Python-Script String Output
pythonScriptOutputString = '[[-5.01 0.088068861]; [-4.96 0.0]]|10.0|20.0|80.0|80.0|50.0';
dataStringArray = strsplit(pythonScriptOutputString, '|');
% Outputs
dataArrayOutput = dataArrayInput; % str2num(char((dataStringArray(1))));
constOut1 = constIn1; % str2double(dataStringArray(2));
constOut2 = constIn2; % str2double(dataStringArray(3));
constOut3 = constIn3; % str2double(dataStringArray(4));
constOut4 = constIn4; % str2double(dataStringArray(5));
constOut5 = constIn5; % str2double(dataStringArray(6));
SOLUTION 1
Cell arrays are not supported in Matlab function blocks, only the native Simulink datatypes are possible.
A workaround is to define the whole code as normal function and execute it from the MATLAB-Function defined with extrinsic. It`s important to initialize the output variables with a known type and size before executing the extrinsic function.
SOLUTION 2
Another solution is to use the strfind function, that gives you a double matrix with the position of the splitter char. With that, you can give just the range of the char positions back that you need. In this case, your whole code will be in the MATLAB-Function block.
function [dataArrayOutput, constOut1, constOut2, constOut3, constOut4, constOut5] = fcn(dataArrayInput, constIn1, constIn2, constIn3, constIn4, constIn5)
coder.extrinsic('strsplit', 'str2num');
% Python-Script String Output
pythonScriptOutputString = '[[-5.01 0.088068861]; [-4.96 0.0]; [-1.01 7.088068861]]|10.0|20.0|80.0|80.0|50.0';
dataStringArray = strfind(pythonScriptOutputString,'|');
% preallocate
dataArrayOutput = zeros(3, 2);
constOut1 = 0;
constOut2 = 0;
constOut3 = 0;
constOut4 = 0;
constOut5 = 0;
% Outputs
dataArrayOutput = str2num(pythonScriptOutputString(1:dataStringArray(1)-1));
constOut1 = str2num(pythonScriptOutputString(dataStringArray(1)+1:dataStringArray(2)-1));
constOut2 = str2num(pythonScriptOutputString(dataStringArray(2)+1:dataStringArray(3)-1));
constOut3 = str2num(pythonScriptOutputString(dataStringArray(3)+1:dataStringArray(4)-1));
constOut4 = str2num(pythonScriptOutputString(dataStringArray(4)+1:dataStringArray(5)-1));
constOut5 = str2num(pythonScriptOutputString(dataStringArray(5)+1:end));
When using an extrinsic function, the data type returned is of mxArray, which you cannot index into as the error message suggests. To work around this problem, you first need to initialise the variable(s) of interest to cast them to the right data type (e.g. double). See Working with mxArrays in the documentation for examples of how to do that.
The second part of the error message is a dimension. Without seeing the code of the function, the Simulink model and how the inputs/outputs of the function are defined, it's difficult to tell what's going on, but you need to make sure you have the correct size and data type defined in the Ports and Data manager.

Rewriting this python function in c++ seems to make it run a lot slower. Is this reasonable?

I'm implementing some graph traversal functions in python but I need better performance so I decided to try to rewrite the functions in c++, but they seem to run slower. I'm a c++ beginner so I'm not sure if this is expected behavior.
The following python functions implements a Breadth-First Search on an unweighted graph. Its objective is to visit every vertex once and measure how many hops away each vertex is from the source.
graph is dict {vertex : set(neighbor1, neighbor2 ... , neighbor n) }
return is dict {vertex : distance_to_source}
def shortest_path_lengths(graph,source):
seen={}
level=0
nextlevel={source}
while nextlevel:
thislevel=nextlevel
nextlevel=set()
for v in thislevel:
if v not in seen:
seen[v]=level
nextlevel.update(graph[v])
level=level+1
return seen
And runs:
%timeit seen = shortest_path_lengths(G,0)
10 loops, best of 3: 79.7 ms per loop
For my c++ implementation:
graph is map< long vertex, set < long > vertex neighbors >
return is map < long vertex ,int distance_from_source >
map<long,int> spl(graph G, long source)
{
int level = 0;
map<long, int> seen;
set<long> nextlevel;
set<long> thislevel;
nextlevel.insert(source);
while (! nextlevel.empty())
{
thislevel = nextlevel;
nextlevel.clear();
for (auto it = thislevel.begin(); it != thislevel.end(); ++it)
{
if (! seen.count(*it))
{
seen[*it] = level;
//cout << G[*it];
nextlevel.insert(G[*it].begin(), G[*it].end());
}
}
level++;
}
return seen;
}
and I measure its execution time with:
clock_t begin = clock();
seen = spl(graph1,0);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("\nTime taken: %.8fs\n", elapsed_secs);
and get output:
Time taken: 0.38512900s
which is almost 5 times slower than the python implementation for the same graph. Seeing as I'm a beginner in c++, I'm not really sure if I'm measuring time wrong, I'm implementing something wrong, or even if this is expected behavior.
EDIT:
After converting maps into unordered_maps, using the -O2 -flto compile parameters, and passing the graph by const reference, the run time of the c++ function for a graph of size 65k drops down to 0.09sec, which is still a bit slower than pythons 0.08s for the same graph.
On a bigger graph of 75k nodes (but over twice as many edges), c++ falls further behind at 0.3s to python's 0.2s
EDIT2:
After changing the nested set inside the map to unordered_set as well, and changing the thislevel/nextlevel sets also to unordered_sets, the c++ code beats the python code on the smaller graph ( 0.063 to 0.081 sec) but only matches it on the bigger one (0.2 to 0.2)
EDIT3:
On an even bigger graph (85k nodes, over 1.5m edges), python needs 0.9sec for the operation, while the C++ code needs 0.75s
First thing that jumped out at me is you used a hash map in Python, which is what dictionaries are, and a tree-based map in C++, which is what set is. The C++ equivalent is unordered_map.
thislevel = nextlevel;
In c++, this makes full copy of the sets. You should rather use pointers to the sets, and swap the pointers instead of the sets.

Python nest list performance choice

I am trying to understand if there is an advantage in space/time/programming to storing data from a signal processing system as nested list in either :
data[channel][sample]
data[sample][channel]
I can code processing for both - thou I personally find 1) easy to write and index to then 2).
However, 2) is the more common was my local group programs in and stores the data (either in excel/csv or from the data gathering systems). While it is easy to transpose
dataA = map(list, zip(*dataB))
I was wondering if there are any storage or performance - or even - module compatibility issues with 1 over 2?
with 1) I can loop like this
for R in dataA :
for C in R :
process_channel(C)
matplotlib.loglog(dataA[0], dataA[i])
where dataA[0] is time or frequency and i is some other channel to plot
with 2)
for R in dataB :
for C in R
process_sample(C)
matplotlib.loglog([j[0] for j in dataB],[k[i] for k in dataB])
This looks worse in programming style. Maybe I am missing a list method of making this easier? I have also developed code to used dicts ... but this really breaks with general use. So I am less inclined to continue to use dicts. Although the dict storage is
dataC = list(['f':0.1,'chnl1':100.0],['f':0.2,'chnl1':110.0])
or some such. It seems that to be better integrated option 2 is better. However, I am trying to understand how better to code when using option 2) when you wish to process over channels then samples? Just transpose the matrix first and then do the work in option 1) space and transpose back the results:
dataA = smoothing(dataA, smooth_factor)
def smoothing(d, s) :
td = numpy.transpose(d)
td = map(list, zip(*d))
nd=[]
for row in td :
col = []
for i in xrange(0,len(row)-step,step) :
col.append(sum(row[i:i+step]/step)
nd.append(col)
nd = numpy.transpose(nd)
return nd
while this construction works - transposing back and forth all the time looks - um - inefficient.

Categories

Resources