tf.reciprocal and tf.inv seem to be equivalent. Is there any difference? They are implemented as separate TF ops and also have separate gradient implementations, which also seem equivalent.
They mean the same. In fact, tf.inv was renamed to tf.reciprocal and tf.inv is no longer exposed to the top level module in the latest versions (though both still exist in gen_math_ops.py).
From the migration documentation:
Many functions have been renamed to match NumPy. This was done to make the transition between NumPy and TensorFlow as easy as possible. There are still numerous cases where functions do not match, so this is far from a hard and fast rule, but we have removed several commonly noticed inconsistencies.
tf.inv
should be renamed to tf.reciprocal
This was done to avoid confusion with NumPy's matrix inverse np.inv
You can see there several more functions that were renamed, like tf.mul and tf.neg.
Related
Aside from the additional functionality that trunc possesses as a ufunc, is there any difference between these two functions? I expected fix to be defined using trunc as is fairly common throughout NumPy, but it is actually defined using floor and ceil.
If the ufunc functionality is the only difference, why does fix exist? (Or at least, why does fix not just wrap trunc?
There is no difference in the results. Both functions, though implemented differently internally, produce the same results on the same set of input types and produce the same results in both value and type. There is no good reason for having both functions, particularly when one will work more slowly and/or take more resources than the other and both must be maintained for as long as numpy survives. Quite frankly, this is one problem with Python and many other design-by-committee projects, and you see debates over this trunc/fix issue in other language and library development projects like Julia.
Numpy documentation on np.random.permutation suggests all new code use np.random.default_rng() from the Random Generator package. I see in the documentation that the Random Generator package has standardized the generation of a wide variety of random distributions around the BitGenerator vs using Mersenne Twister, which I'm vaguely familiar with.
I see one downside, what used to be a single line of code to do simple permutations:
np.random.permutation(10)
turns into two lines of code now, which feels a little awkward for such a simple task:
rng = np.random.default_rng()
rng.permutation(10)
Why is this new approach an improvement over the previous approach?
And why wouldn't existing methods like np.random.permutation just wrap this new preferred method?
Is there a good reason not to use this new method as a one-liner np.random.default_rng().permutation(10), assuming it's not being called at high volumes?
Is there an argument for switching existing code to this method?
Some context:
Does numpy.random.seed() always give the same random number every time?
NumPy: Decide on new PRNG BitGenerator default
To your questions, in a logical order:
And why wouldn't existing methods like np.random.permutation just wrap this new preferred method?
Probably because of backwards compatibility concerns. Even if the "top-level" API would not be changing, its internals would be significantly enough to be deemed a break in compatability.
Why is this new approach an improvement over the previous approach?
"By default, Generator uses bits provided by PCG64 which has better statistical properties than the legacy MT19937 used in RandomState." (source). The PCG64 docstring provides more technical detail.
Is there a good reason not to use this new method as a one-liner np.random.default_rng().permutation(10), assuming it's not being called at high volumes?
I very much agree that it's a slightly awkward added line of code if it's done at the module-start. I would only point out that the NumPy docs do directly use this form in docstring examples, such as:
n = np.random.default_rng().standard_exponential((3, 8000))
The slight difference would be that one is instantiating a class at module load/import time, whereas in your form it might come later. But that should be a minuscule difference (again, assuming it's only used once or a handful of times). If you look at the default_rng(seed) source, when called with None, it just returns Generator(PCG64(seed)) after a few quick checks on seed.
Is there an argument for switching existing code to this method?
Going to pass on this one since I don't have anywhere near the depth of technical knowledge to give a good comparison of the algorithms, and also because it depends on some other variables such as whether you're concerned about making your downstream code compatibility with older versions of NumPy, where default_rng() simply doesn't exist.
I've seen some demos of #cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. (
This is why one might be better off using numba #jit)
#cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated.
https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with #fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with #numba.jit then subsequently decorating with #fuse. Or would again the #jit render the resulting python in a non-fusable form?
What breaks #fuse? What are the pitfalls? is #fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py
SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
fusing kernels is such a huge advance I don't understand when I would ever not want to use #fuse. isn't it always better? When is
it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empty_like(x) does not work, nor does referring to globals. Hence it simply cannot be applied universally.
I'm wondering about it's composability
will #fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
I see that a bug fix in the release notes says that it can now handle calling other functions decorated with #fuse. But this does
not say if their kernels are fused or remain separate.
Answer: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whether you do or do not.
Why isn't it documented?
Answer: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.
I implemented a physics simulation in Python (most of the heavy lifting is done in numerical libraries anyways, thus performance is good enough).
Now that the project has grown a bit, I added extra functionality via parameters that do not change during the simulation. With that comes the necessity to have the program do one thing or another based on their values, i.e., quite a few if-else scattered around the code.
My question is simple: does Python implement some form of branch prediction? Am I going to wear the performance significantly or is the interpreter smart enough to see that some parameters never change? Having a constant if-else inside a function that is called a million times, is the conditional evaluated every time or some magic happens? When there is no easy way to remove the conditional altogether, is there a way to give the interpreter some hints and favour/emulate branch prediction?
You could in theory benefit here from some JIT functionality that may observe the control flow over time and could effectively suppress never-taken branches by rearranging the code. Some of the Python interpreters contain JIT compilers (I think PyPy does in newer versions, maybe Jython as well), and may be able to do this optimization, but that of course depends on the actual code.
However, the main form of branch prediction is done in HW, and is unrelated to the SW or language constructs used (in Python's case - quite a few levels of abstraction above). This mechanism eventually observes these conditional code paths as branches, and may be able to learn them if they are indeed statically determined. However, as any prediction mechanism, it has limited capacity, and since your code is supposed to be big, it may not be able to accommodate predictions for all these branches. It's still considered quite good, so chances are that the critical ones may work.
Finally, if you really want to optimize your code, you can convert some of these conditions to constants (assigning an argument a constant value instead of parsing the command line), or hiding the condition completely with something like __debug__. This way you won't have to worry about predicting them, but can restore the capability with minimal work if you need them in the future.
Dear stackoverflow community!
In a previous stackoverflow question, I mentioned that python's np.fft.fftn() routine seems somehow slow compared to MATLAB, provided that the datacubes are rather big (grids of dimension 512x512x1921, datatype float) (see Comparatively slow python numpy 3D Fourier Transformation). I think that MATLAB adopts the FFTW algorithm and could therefore be faster (~5s compared to ~185s (time.time())), so I was suggested to try pyFFTW for a time reduction.
The problem now is that at my work place python packages are implemented via anaconda for a large number of computers and the pyFFTW package cannot be easily integrated therein. There's somehow also a problem that long datatypes are not recognized and therefore compilation does not work at all. pyFFTW also conflicts with the internal FFTW implementation. Even if somehow installed, it would be overriden by the next update of the system.
I'm however not sure whether the different algorithm alone would explain the difference in computation time. As already written in the previous questions, I really need these FFTs for my work.
Another issue concerns the striding of the output array of np.fft.fftn(), which is switched to FORTRAN structure automatically (which again is the opposite of the default). This causes low performace when operating on the output in combination with C-strided grids (see Python numpy.fft changes strides).
So as a follow-up to my original questions, I want to ask you:
(MAIN) What other reasons might there for python to be so much slower? What can be done about it? I'd like to stay with python if possible and rather not switch to MATLAB just because of such things...
(SIDE) Is there any keyword to preserve striding? Using scipy is not a good option and copying the array to a new one to get the strides correctly also seems an unnecessarily complicated step requiring additional computation time.
Thanks for the help!