How to track the "calling chain" from numpy to C implementation? - python

I have read the tutorial and API guide of Numpy, and I learned how to extend Numpy with my own C code or how to use C to call Numpy function from this helpful documentation.
However, what I really want to know is: how could I track the calling chain from python code to C implementation? Or i.e. how could I know which part of its C implementation corresponds to this simple numpy array addition?
x = np.array([1, 2, 3])
y = np.array([1, 2, 3])
print(x + y)
Can I use some tools like gdb to track its stack frame step by step?
Or can I directly recognize the corresponding codes from variable naming policy? (like if I want to know the code about addition, I can search for something like function PyNumpyArrayAdd(...) )
[EDIT] I found a very useful video about how to point out the C implementation of these basic C-implemented function or operator overrides like "+" "-".
https://www.youtube.com/watch?v=mTWpBf1zewc
Got this from Andras Deak via Numpy mailing-list.
[EDIT2] There is another way to track all the functions called in Numpy using gdb. It's very heavy because it will display all the functions in Numpy that are called, including these trivial ones. And it might take some time.
First you need to download/clone the Numpy repository to your own working space and then compile it with -g option, which will attach debug informations for debugging.
Then you open a terminal in the "path/to/numpy-main" directory where the setup.py of Numpy lies, and then run gdb.
If you want to know what functions in Numpy's C implementation are called in this single python statement:
y = np.exp(x)
you can set breakpoints on all the functions implemented by Numpy using this gdb python script provided by the first answer here:
Can gdb set break at every function inside a directory?
Once you load this python script by source somename.py, you can run this command in gdb: rbreak-dir numpy/core/src
And you can set commands for each breakpoint:
commands 1-5004
> silent
> bt 1
> c
> end
(here 1-5004 is the range of the breakpoints that you want to run commands on)
Once a breakpoint is activated, this command will run and print the first layer of backtrace (which is the info of the current function you are in) and then continue. In this way, you can track all the functions in Numpy, and this is a pic from my own working environment (I took a snapshot since there are rules preventing copying any byte from working computer):
Hope my trials can help the future comers.

However, what I really want to know is: how could I track the calling chain from python code to C implementation? Or i.e. how could I know which part of its C implementation corresponds to this simple numpy array addition?
AFAIK, there is two main way to do that: using a debugger or by tracking the function in the code (typically by looking the wrapping part or by searching keywords in numpy/core/src/XXX/). Numpy has different kind of functions. Some are focusing more on the CPython interaction part (eg. type checking, array creation, generic iterators, etc.) and some are focusing on the computing part (doing the computation efficiently). Regarding what you want, different files needs to be inspected. core/src/umath/loops.c.src is the way to go for core computing functions doing basic independent math operations.
Can I use some tools like gdb to track its stack frame step by step?
Using a debugger is the common way to do unless you are familiar with the code of Numpy. You can try to find the Numpy entry point function by looking the wrapper code but I think it is a bit difficult as this part of the code is not very readable (many related parts are generated certainly to ease the development of avoid mistakes). The hard part with GDB is to find the first entry point of the function in Numpy (the CPython interpreter function calls are hard to track as they are many of them (sometime called recursively) and the call stack is quite big far from being clear (ie. there is no clear information about the actual statement/expression being executed). That being said, AFAIR, the entry point is often something like PyArray_XXX or array_XXX. You can also track the first function executing code of the Numpy library.
Or can I directly recognize the corresponding codes from variable naming policy?
Some functions have a standardized name like typically PyArray_XXX. That being said, core computing function generally does not. They have a name generated by a template system that parse comments and annotations and generate code based on that. For adding two array, the main computing function should be for example #TYPE#_add#isa# where #TYPE# is either INT or LONG regarding your target platform. There is a special version (ie. specialization) for floating-point numbers that makes use of an optimized pair-wise summation so sake of accuracy. This kind of naming convention is quite frequent though so you can search _add in the code or a begin repeat section with add as a kind parameter.
Related post: Numpy argmax source

Related

Chapel-Python integration questions

I'm trying to see if I can use Chapel for writing parallel code for use in a Python-based climate model:
https://github.com/CliMT/climt
I don't have any experience with Chapel, but it seems very promising for my use-case. I had a few questions about how to integrate Chapel code into my current workflow:
I know you can build importable .so files, but can the compilation stop when the Cython file is generated? I can then include it into the distribution and use standard setuptools to compile my project on Travis.
Can I pass numpy arrays to a Python extension written in Chapel?
If answer to 2. is yes, and my computation is embarassingly parallel in one dimension of the array, is there an elegant way to express this paralellism in Chapel?
If I write Chapel code that works on multiple nodes and compile it to a Python extension, how do I run it? Can I use mpirun python my_code.py kind of a command?
Unfortunately not currently. However, we do leave the generated .pxd and .py(x) files in the directory with the .so, so you could make use of those in the meanwhile (this wasn't a feature request we've considered, so if you felt motivated, definitely feel free to open an issue on our Github page: https://github.com/chapel-lang/chapel/issues).
For reference, we do this because the Cython compilation command is rather tricky. I had thought we printed the Cython command used with the chpl compilation flag --print-commands, but that doesn't look to be the case (I'll make an issue for that).
You can pass 1 dimensional numpy arrays of known primitive types to Chapel from Python. We're hoping to add support for other numpy arrays soon (hopefully in 1.21, slated for March 2020)
This is definitely doable on arrays in Chapel - I would recommend using a forall loop when traversing this dimension of your array for your computation, which will divide the indices in that dimension into a number of tasks determined by Chapel. (For those not familiar with forall loops, this link gives a good overview of the concept)
For example:
forall x in arr.domain.dim(1) {
// traverses the first dimension of arr's domain in parallel
...
}
If you compile your Chapel library into a Python extension with multilocale settings, you can specify the number of locales (nodes) needed using the numlocales argument to the extension's chpl_setup function. Doing so will take care of distributing the Chapel code for you when you run your Python program.
For example, you could write:
import MyChplLib
MyChplLib.chpl_setup(4)
...
to run your program with 4 locales (nodes).
I should probably mention that as of the 1.20 release, we don't have support for array arguments in multilocale libraries. We're still figuring out priorities for the 1.21 release, so feedback on how fast you want that would be super helpful!

Where is #cupy.fuse cupy python decorator documented?

I've seen some demos of #cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. (
This is why one might be better off using numba #jit)
#cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated.
https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with #fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with #numba.jit then subsequently decorating with #fuse. Or would again the #jit render the resulting python in a non-fusable form?
What breaks #fuse? What are the pitfalls? is #fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py
SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
fusing kernels is such a huge advance I don't understand when I would ever not want to use #fuse. isn't it always better? When is
it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empty_like(x) does not work, nor does referring to globals. Hence it simply cannot be applied universally.
I'm wondering about it's composability
will #fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
I see that a bug fix in the release notes says that it can now handle calling other functions decorated with #fuse. But this does
not say if their kernels are fused or remain separate.
Answer: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whether you do or do not.
Why isn't it documented?
Answer: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.

Seasonal-Trend-Loess Method for Time Series in Python

Does anyone know if there is a Python-based procedure to decompose time series utilizing STL (Seasonal-Trend-Loess) method?
I saw references to a wrapper program to call the stl function
in R, but I found that to be unstable and cumbersome from the environment set-up perspective (Python and R together). Also, link was 4 years old.
Can someone point out something more recent (e.g. sklearn, spicy, etc.)?
I haven't tried STLDecompose but I took a peek at it and I believe it uses a general purpose loess smoother. This is hard to do right and tends to be inefficient. See the defunct STL-Java repo.
The pyloess package provides a python wrapper to the same underlying Fortran that is used by the original R version. You definitely don't need to go through a bridge to R to get this same functionality! This package is not actively maintained and I've occasionally had trouble getting it to build on some platforms (thus the fork here). But once built, it does work and is the fastest one you're likely to find. I've been tempted to modify it to include some new features, but just can't bring myself to modify the Fortran (which is pre-processed RATFOR - very assembly-language like Fortran, and I can't find a RATFOR preprocessor anywhere).
I wrote a native Java implementation, stl-decomp-4j, that can be called from python using the pyjnius package. This started as a direct port of the original Fortran, refactored to a more modern programming style. I then extended it to allow quadratic loess interpolation and to support post-decomposition smoothing of the seasonal component, features that are described in the original paper but that were not put into the Fortran/R implementation. (They apparently are in the S-plus implementation, but few of us have access to that.) The key to making this efficient is that the loess smoothing simplifies when the points are equidistant and the point-by-point smoothing is done by simply modifying the weights that one is using to do the interpolation.
The stl-decomp-4j examples include one Jupyter notebook demonstrating how to call this package from python. I should probably formalize that as a python package but haven't had time. Quite willing to accept pull requests. ;-)
I'd love to see a direct port of this approach to python/numpy. Another thing on my "if I had some spare time" list.
Here you can find an example of Seasonal-Trend decomposition using LOESS (STL), from statsmodels.
Basicaly it works this way:
from statsmodels.tsa.seasonal import STL
stl = STL(TimeSeries, seasonal=13)
res = stl.fit()
fig = res.plot()
There is indeed:
https://github.com/jrmontag/STLDecompose
In the repo you will find a jupyter notebook for usage of the package.
RSTL is a Python port of R's STL: https://github.com/ericist/rstl. It works pretty well except it is 3~5 times slower than R's STL according to the author.
If you just want to get lowess trend line, you can just use Statsmodels' lowess function
https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html.

Automatically generate data for unit testing in Python

I have a module to test, module includes a serie of functions / simple classes.
Wondering if there any attempts(ie package) to generate automatically:
1) Generate Python code from initial Python file containing function definition.
2) This code list of call to the functions with random/parametric data as parameters.
It is technically feasible by using inspect and python meta classes,
usually limited to numerical type functions....(numpy array).
Because string (ie url input) would be impossible (only parametrized...).
EDIT: By random, it means obviously "parametric random".
Suppose we have
def f(x1,x2,x3)
For all xi of f
if type(xi) = array1D ->
Do those tests: empty array, zeros array, negative array(random),
positivearray(random), high values, low values, integer array, real
number array, ordered array, equal space array,.....
if type(xi)=int -> test zero, 1, 2,3,4, randomValues, Negative
Do people think such project is possible using inspect and meta class? (limited to numpy/numerical items).
Suppose you have a very large library..., things can be done in background.
You might be thinking of fuzz testing, where a bunch of garbage data is submitted to a function to see if anything makes it behave badly. It sounds like the Hypothesis library will let you generate different test cases based on some parameters.
I spent searching, it seems this kind of project does not really exist (to my knowledge):
Technically, this is a mix of packages (issues):
Hypothese : data generation for input, running the code with crash/error.
(without the invariant part of Hypothese)
Jedi: Static analysis of code/Inference of the type
Type inference is a difficult issue in Python (in general)
implementing type inference
If type is num/array of num:
Boundary exists/ typical usage is clearly defined
If type is string: Inference is pretty difficult without human guessing.
Same for others, Context guessing is important

Optimizing Python Code [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I've been working on one of the coding challenges on InterviewStreet.com and I've run into a bit of an efficiency problem. Can anyone suggest where I might change the code to make it faster and more efficient?
Here's the code
Here's the problem statement if you're interested
If your question is about optimising python code generally (which I think it should be ;) then there are all sorts of intesting things you can do, but first:
You probably shouldn't be obsessively optimising python code! If you're using the fastest algorithm for the problem you're trying to solve and python doesn't do it fast enough you should probably be using a different language.
That said, there are several approaches you can take (because sometimes, you really do want to make python code faster):
Profile (do this first!)
There are lots of ways of profiling python code, but there are two that I'll mention: cProfile (or profile) module, and PyCallGraph.
cProfile
This is what you should actually use, though interpreting the results can be a bit daunting.
It works by recording when each function is entered or exited, and what the calling function was (and tracking exceptions).
You can run a function in cProfile like this:
import cProfile
cProfile.run('myFunction()', 'myFunction.profile')
Then to view the results:
import pstats
stats = pstats.Stats('myFunction.profile')
stats.strip_dirs().sort_stats('time').print_stats()
This will show you in which functions most of the time is spent.
PyCallGraph
PyCallGraph provides a prettiest and maybe the easiest way of profiling python programs -- and it's a good introduction to understanding where the time in your program is spent, however it adds significant execution overhead
To run pycallgraph:
pycallgraph graphviz ./myprogram.py
Simple! You get a png graph image as output (perhaps after a while...)
Use Libraries
If you're trying to do something in python that a module already exists for (maybe even in the standard library), then use that module instead!
Most of the standard library modules are written in C, and they will execute hundreds of times faster than equivilent python implementations of, say, bisection search.
Make the Interpreter do as Much of Your Work as You Can
The interpreter will do some things for you, like looping. Really? Yes! You can use the map, reduce, and filter keywords to significantly speed up tight loops:
consider:
for x in xrange(0, 100):
doSomethingWithX(x)
vs:
map(doSomethingWithX, xrange(0,100))
Well obviously this could be faster because the interpreter only has to deal with a single statement, rather than two, but that's a bit vague... in fact, this is faster for two reasons:
all flow control (have we finished looping yet...) is done in the interpreter
the doSomethingWithX function name is only resolved once
In the for loop, each time around the loop python has to check exactly where the doSomethingWithX function is! even with cacheing this is a bit of an overhead.
Remember that Python is an Interpreted Language
(Note that this section really is about tiny tiny optimisations that you shouldn't let affect your normal, readable coding style!)
If you come from a background of a programming in a compiled language, like c or Fortran, then some things about the performance of different python statements might be surprising:
try:ing is cheap, ifing is expensive
If you have code like this:
if somethingcrazy_happened:
uhOhBetterDoSomething()
else:
doWhatWeNormallyDo()
And doWhatWeNormallyDo() would throw an exception if something crazy had happened, then it would be faster to arrange your code like this:
try:
doWhatWeNormallyDo()
except SomethingCrazy:
uhOhBetterDoSomething()
Why? well the interpreter can dive straight in and start doing what you normally do; in the first case the interpreter has to do a symbol look up each time the if statement is executed, because the name could refer to something different since the last time the statement was executed! (And a name lookup, especially if somethingcrazy_happened is global can be nontrivial).
You mean Who??
Because of cost of name lookups it can also be better to cache global values within functions, and bake-in simple boolean tests into functions like this:
Unoptimised function:
def foo():
if condition_that_rarely_changes:
doSomething()
else:
doSomethingElse()
Optimised approach, instead of using a variable, exploit the fact that the interpreter is doing a name lookup on the function anyway!
When the condition becomes true:
foo = doSomething # now foo() calls doSomething()
When the condition becomes false:
foo = doSomethingElse # now foo() calls doSomethingElse()
PyPy
PyPy is a python implementation written in python. Surely that means it will run code infinitely slower? Well, no. PyPy actually uses a Just-In-Time compiler (JIT) to run python programs.
If you don't use any external libraries (or the ones you do use are compatible with PyPy), then this is an extremely easy way to (almost certainly) speed up repetitive tasks in your program.
Basically the JIT can generate code that will do what the python interpreter would, but much faster, since it is generated for a single case, rather than having to deal with every possible legal python expression.
Where to look Next
Of course, the first place you should have looked was to improve your algorithms and data structures, and to consider things like caching, or even whether you need to be doing so much in the first place, but anyway:
This page of the python.org wiki provides lots of information about how to speed up python code, though some of it is a bit out of date.
Here's the BDFL himself on the subject of optimising loops.
There are quite a few things, even from my own limited experience that I've missed out, but this answer was long enough already!
This is all based on my own recent experiences with some python code that just wasn't fast enough, and I'd like to stress again that I don't really think any of what I've suggested is actually a good idea, sometimes though, you have to....
First off, profile your code so you know where the problems lie. There are many examples of how to do this, here's one: https://codereview.stackexchange.com/questions/3393/im-trying-to-understand-how-to-make-my-application-more-efficient
You do a lot of indexed access as in:
for pair in range(i-1, j):
if coordinates[pair][0] >= 0 and coordinates[pair][1] >= 0:
Which could be written more plainly as:
for coord in coordinates[i-1:j]:
if coord[0] >= 0 and cood[1] >= 0:
List comprehensions are cool and "pythonic", but this code would probably run faster if you didn't create 4 lists:
N = int(raw_input())
coordinates = []
coordinates = [raw_input() for i in xrange(N)]
coordinates = [pair.split(" ") for pair in coordinates]
coordinates = [[int(pair[0]), int(pair[1])] for pair in coordinates]
I would instead roll all those together into one simple loop or if you're really dead set on list comprehensions, encapsulate the multiple transformations into a function which operates on the raw_input().
This answer shows how I locate code to optimize.
Suppose there is some line of code you could replace, and it is costing, say, 40% of the time.
Then it resides on the call stack 40% of the time.
If you take 10 samples of the call stack, it will appear on 4 of them, give or take.
It really doesn't matter how many samples show it.
If it appears on two or more, and if you can replace it, you will save whatever time it costs.
Most of the interview street problems seem to be tested in a way that will verify that you have found an algorithm with the right big O complexity rather than that you have coded the solution in the most optimal way possible.
In other words if you are failing some of the test cases due to running out of time the problem is likely that you need to figure out a solution with lower algorithmic complexity rather than micro-optimize the algorithm you have. This is why they generally state that N can be quite large.

Categories

Resources