Is there a smart way to parrallelize complex functions on ndarray? - python

There exist a wide range of possibilities in python to give your code performance a boost (e.g. broadcasting, packages like numba. But as far as I know these methods rely on the code being basic in a sense that e.g. numpy.ndarray or functions of numpy.linalg are used.
In my particular case I'm using statsmodels ThetaModel to forecast (many!) time series which are grouped in an ndarray.
Is there any smart way to boost code performance/parallelize the code?
For the moment I'm using list comprehension.
(Simplified) Working Example
import numpy as np
from statsmodels.tsa.forecasting.theta import ThetaModel
def thetaForecast(series):
model = ThetaModel(series, period=50, deseasonalize=True, use_test=False).fit()
forecast = model.forecast(steps=len(series))
return forecast
data = np.random.randn(500,10) # 10 time series each of length 500 (dimensions reduced here for simplification)
dataForecast = np.array([thetaForecast(col) for col in data.transpose()])
Just in case it plays a role, my function thetaForecast actually needs more than one parameter in contrast to this slightly simplified version.
PS: I'm not an experienced stackoverflow user. Tips for how to improve my question are welcome :)

Have you tried using multiprocessing? Unless ThetaModel.forecast() releases the GIL (which it could if it's implemented in C or Fortran), multiprocessing is the main way you can parallelize this.
Or of course you could reimplement forecast() yourself in Numba, C, C++, or Fortran and release the GIL yourself--then you could use multiple threads in a single process.

Related

How to track the "calling chain" from numpy to C implementation?

I have read the tutorial and API guide of Numpy, and I learned how to extend Numpy with my own C code or how to use C to call Numpy function from this helpful documentation.
However, what I really want to know is: how could I track the calling chain from python code to C implementation? Or i.e. how could I know which part of its C implementation corresponds to this simple numpy array addition?
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])
print(x + y)
Can I use some tools like gdb to track its stack frame step by step?
Or can I directly recognize the corresponding codes from variable naming policy? (like if I want to know the code about addition, I can search for something like function PyNumpyArrayAdd(...) )
[EDIT] I found a very useful video about how to point out the C implementation of these basic C-implemented function or operator overrides like "+" "-".
https://www.youtube.com/watch?v=mTWpBf1zewc
Got this from Andras Deak via Numpy mailing-list.
[EDIT2] There is another way to track all the functions called in Numpy using gdb. It's very heavy because it will display all the functions in Numpy that are called, including these trivial ones. And it might take some time.
First you need to download/clone the Numpy repository to your own working space and then compile it with -g option, which will attach debug informations for debugging.
Then you open a terminal in the "path/to/numpy-main" directory where the setup.py of Numpy lies, and then run gdb.
If you want to know what functions in Numpy's C implementation are called in this single python statement:
y = np.exp(x)
you can set breakpoints on all the functions implemented by Numpy using this gdb python script provided by the first answer here:
Can gdb set break at every function inside a directory?
Once you load this python script by source somename.py, you can run this command in gdb: rbreak-dir numpy/core/src
And you can set commands for each breakpoint:
commands 1-5004
> silent
> bt 1
> c
> end
(here 1-5004 is the range of the breakpoints that you want to run commands on)
Once a breakpoint is activated, this command will run and print the first layer of backtrace (which is the info of the current function you are in) and then continue. In this way, you can track all the functions in Numpy, and this is a pic from my own working environment (I took a snapshot since there are rules preventing copying any byte from working computer):
Hope my trials can help the future comers.
However, what I really want to know is: how could I track the calling chain from python code to C implementation? Or i.e. how could I know which part of its C implementation corresponds to this simple numpy array addition?
AFAIK, there is two main way to do that: using a debugger or by tracking the function in the code (typically by looking the wrapping part or by searching keywords in numpy/core/src/XXX/). Numpy has different kind of functions. Some are focusing more on the CPython interaction part (eg. type checking, array creation, generic iterators, etc.) and some are focusing on the computing part (doing the computation efficiently). Regarding what you want, different files needs to be inspected. core/src/umath/loops.c.src is the way to go for core computing functions doing basic independent math operations.
Can I use some tools like gdb to track its stack frame step by step?
Using a debugger is the common way to do unless you are familiar with the code of Numpy. You can try to find the Numpy entry point function by looking the wrapper code but I think it is a bit difficult as this part of the code is not very readable (many related parts are generated certainly to ease the development of avoid mistakes). The hard part with GDB is to find the first entry point of the function in Numpy (the CPython interpreter function calls are hard to track as they are many of them (sometime called recursively) and the call stack is quite big far from being clear (ie. there is no clear information about the actual statement/expression being executed). That being said, AFAIR, the entry point is often something like PyArray_XXX or array_XXX. You can also track the first function executing code of the Numpy library.
Or can I directly recognize the corresponding codes from variable naming policy?
Some functions have a standardized name like typically PyArray_XXX. That being said, core computing function generally does not. They have a name generated by a template system that parse comments and annotations and generate code based on that. For adding two array, the main computing function should be for example #TYPE#_add#isa# where #TYPE# is either INT or LONG regarding your target platform. There is a special version (ie. specialization) for floating-point numbers that makes use of an optimized pair-wise summation so sake of accuracy. This kind of naming convention is quite frequent though so you can search _add in the code or a begin repeat section with add as a kind parameter.
Related post: Numpy argmax source

Python slow on for-loops and hundreds of attribute lookups. Use Numba?

i am working on a simple showcase SPH (smoothed particle hydrodynamics, not relevant here though) implementation in python. The code works, but the execution is kinda sluggish. I often have to compare individual particles with a certain amount of neighbours. In an earlier implementation i kept all particle positions and all distances-to-each-existing-particle in large numpy arrays -> to a certain point this was pretty fast. But visually not pleasing and n**2. Now i want it clean and simple with classes + kdTree to speed up the neighbour search.
this all happens in my global Simulation-Class. Additionally there's a class called "particle" that contains all individual informations. i create hundreds of instances before and loop through them.
def calculate_density(self):
#Using scipys advanced nearest neighbour seach magic
tree = scipy.spatial.KDTree(self.particle_positions)
#here we go... loop through all existing particles. set attributes..
for particle in self.my_particles:
#get the indexes for the nearest neighbours
particle.index_neighbours = tree.query_ball_point(particle.position,self.h,p=2)
#now loop through the list of neighbours and perform some additional math
particle.density = 0
for neighbour in particle.index_neighbours:
r = np.linalg.norm(particle.position - self.my_particles[neighbour].position)
particle.density += particle.mass * (315/(64*math.pi*self.h**9)) *(self.h**2-r**2)**3
i timed 0.2717630863189697s for only 216 particles.
Now i wonder: what to do to speed it up?
Most tools online like "Numba" show how they speed up math-heavy individual functions. I dont know which to choose. On a sidenode, i cannot even get Numba to work in this case. I get a looong error message. And i hoped it is as simple as slapping "#jit" in front of it.
I know its the loops with the attribute calls that crush my performance anyway - not the math or the neighbour search. Sadly iam a novice to programming and i liked the clean approach i got to work here :( any thoughts?
These kind of loop-intensive calculations are slow in Python. In these cases, the first thing you want to do is to see if you can vectorize these operations and get rid of the loops. Then actual calculations will be done in C or Fortran libraries and you will get a lot of speed up. If you can do it usually this is the way to go, since it is much easier to maintain your code.
Some operations, however, are just inherently loop-intensive. In these cases using Cython will help you a lot - you can usually expect 60X+ speed up when you cythonize your loop. I also had similar experiences with numba - when my function becomes complicated, it failed to make it faster, so usually I just use Cython.
Coding in Cython is not too bad - much easier than actually code in C because you can access numpy arrays easily via memoryviews. Another advantage is that it is pretty easy to parallelize the loop with openMP, which can gives you additional 4X+ speedups (of course, depending on the number of cores you have in your machine), so your code can be hundreds times faster.
One issue is that to get the optimal speed, you have to remove all the python calls inside your loop, which means you cannot call numpy/scipy functions. So you have to convert tree.query_ball_point and np.linalg.norm part to Cython for optimal speed.

Truly vectorized routines in python?

Are there really good methods in Python to vectorize matrix like data constructs/containers -operations? What are the according data constructs used?
(I could observe and read that pandas and numpy element-wise operations using vectorize or applymap (may also be the case of apply/apply along axis for rows/columns) are not much of a speed progress compared to for loops.
Given that when trying to use them, you have sometimes to mess with the specificities of the datatypes when it is usually a little bit easier in for loops, what are the benefits? Readability?)
Are there ways to achieve a gap of performance similar to what happens in Matlab when comparing for loops and vectorized operations?
(note it is not to bash numpy or pandas, these are great, whole matrix operations are ok, it is just that when you have to do element-wise operations, it becomes slow).
EDIT to explain the context:
I was only wondering because I received more than once answers mentionning the fact that apply and so on are actually similar to for loops. That's why I was wondering if there were similar functions implemented in such way that it would perform better. The actual problems were varied. They just had to be element-wise, actually, not "doing the sum, product, whatever of the whole matrix". I did a lot of comparisons with differential outputs sometimes based on other matrices, so I had to use complex functions for this. But since the matrices are huge and the implementation depended on "for loop like" mechanisms, in the end I felt that my program would not work well on a more important dataset. Hence my question. But I was not looking for review, only knowledge.
You need to provide a specific example.
Normal per-element MATLAB or Python functions cannot be vectorized in general. The whole point of vectorizing, in both MATLAB and Python, is to off-load the operation onto the much faster underlying C or Fortran libraries that are designed to work on arrays of uniform data. This cannot be done on functions that operate on scalars, either in MATLAB or Python.
For functions that operate on arrays or matrices as a whole (such as mathematical operators, sum, square, etc), MATLAB and Python behave the same. In fact they use most of the same underlying C and Fortran libraries to do their calculations.
So you need to show the actual operation you want to do, and then we can see if there is a way to vectorize it.
If it is working code and you just want to improve its performance, then Code Review stack exchange site is probably a better choice.

Python 3: Parallel diagonalization of multiple matrices

I am trying to improve the performance of some code of mine, that first constructs a 4x4 matrix depending on two indices, diagonalizes this matrix and then stores the eigenvectors of each diagonalization of each matrix in an 4-dimensional array. At the moment I am just going through all the indices serially and then store the eigenvectors in its place in the 4-dimensional array. Now, I am wondering if it is possible to parallelize this a little bit by using threading or something similar such that each thread would diagonalize one matrix and then store it in its place. The problem I have is, what are my limitations in doing this? Would I run into problems when different threads want to write into the resulting 4-dim. array at the same time and do I have to use a lock in order to prevent this? I am sorry if this question is trivial, but by searching I was not able to find anything related and my knowledge about threading is very limited. A minimal example would be
from numpy.linalg import eigh as eigh2
from scipy import *
spectrum = zeros([L//2,L//2,4,4],complex)
for i in range(0,L//2):
for j in range(0,L//2):
k = [-(2 * i*2*pi/L),-(2 * j*2*pi/L)]
H = ones([4,4],complex)
energies, states = eigh2(H)
spectrum[i,j,:,:] = states
Note that I have exchanged the function that constructs the matrix in dependence of k for some constant matrix for sake of brevity.
I would really appreciate any help or pointers to resources how I could implement some parallelizations. Is threading a realistic way of improving the performance?
The short answer is that yes, you probably need locks—but if you can reorganize your problem, that may be a lot better than locking.
The long answer is a bit involved, especially since I don't know how much you already know.
In general, threading doesn't do much good in CPython for CPU-bound code, because of the Global Interpreter Lock, which prevents any threads from interpreting a line (actually, bytecode) of Python if another thread is in the middle of doing so. However, NumPy has code that specifically releases the GIL in certain places to allow threading to work better, so if you're CPU-bound within low-level NumPy algorithms, threading actually can work. The docs are not always clear about which functions do this and which don't, so you may have to test it yourself just to find out if parallelizing will help here. (A quick&dirty way to do this is to hack up a version of your code that just does the computations without storing them anywhere, run it across N threads, and see how many cores are busy while you do it.)
Now, in general, in CPython, locks aren't necessary around certain kinds of operations, including __setitem__ on simple types—but that's because of that same GIL, so it isn't going to help you here. If you have multiple operations all trying to write to the same array, they will need a lock around that array.
But there may be a better way around this. If you can find a way to divide the array into smaller arrays, only one of which is being modified at any given time, you don't need any locks. Or, if you can have the threads return smaller arrays that can be assembled by a single master thread into the final answer, instead of working in-place in the first place, that also works.
But before you go doing that… in some cases, NumPy (or, rather, one of the libraries it's using) is already auto-parallelizing things for you, or could be if you built it differently. Or it could be SIMD-vectorizing things in a way that actually gives more speedup than threading, which you could end up breaking. And so on.
So, make sure you have a properly-optimized NumPy with all the optional prereqs installed before you try anything. Then make sure it's only using one core as-is. Then build a test scaffolding so you can compare different implementations. And then you can try out each lock-based, non-sharing, and non-mutating algorithm you can come up with to see if the parallelism helps more than the extra stuff hurts.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.
It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate
Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

Categories

Resources