Tensorflow: why tf.nn.conv2d runs faster than tf.layers.conv2d?

Tensorflow: why tf.nn.conv2d runs faster than tf.layers.conv2d? - python

I am writing a simple implementation of AlexNet. I tried with using tf.nn.conv2d and tf.layers.conv2d, and the results turn out that the loss dropped faster when using tf.nn.conv2d, even the structure is exactly the same. Does anyone know any explanation for that?

If you try to follow the chain of function calls, you will find that tf.layers.conv2D() makes calls to tf.nn.conv2D() so no matter what you use, tf.nn.conv2d() will be called, it will be just faster if you call it yourself. You can use traceback.print_stack() method to verify that for yourself.
NOTE This does not mean that they are one and the same, select the function based on your need as there are various other tasks undertaken by tf.layers.conv2D().

Related

Why map,filter,reduce takes the function parameter as first argument meanwhile function like sorted expects it as second parameter

How map defined in python is
map(function, iterable, ...)
as can be seen function is the first parameter the same goes for filter,reduce
but when I am checking functions like sorted they are defined
sorted(iterable, key=None, reverse=False)
key being the function that can be used while sorting. I don't know Python well to say if there are other examples like sorted. But for starters, this seems a little a bit unorganized. Since I am coming from C++/D background in which I can almost all the time tell where the function parameter will go in the standard library it is a bit unorthodox for me.
Is there any historical or hierarchical reason why the function parameter is expected in different orders?

The actual signature of map is:
map(function, iterable, ...)
It can take more than one iterable, so making the function the first argument is the most sensible design.
You can argue about filter, there's no one "correct" way to design it, but making it the same order as map rather makes sense.
sorted doesn't require a key function, so it makes no sense to put it first.

Anyone can contribute a module into the python ecosystem. The lineage of any particular module is fairly unique (though I'm sure there are general families that share common lineages). While there will be attempts to standardise and agree on common conventions, there is a limit to what is possible.
As a result, some modules will have a set of paradigms that will differ vastly from other modules - they will have different focuses, and you just can't standardise down to the level you're looking for.
That being said, if you wanted to make that a priority, there's nothing stopping you from recasting all the non-standard things you find into a new suite of open source libraries and encouraging people to adopt them as the standard.

The map() function design is different, the purpose of the design will surely determine the parameters that you pass into the function. The map() function executes a specified function for each item in an iterable, which is very different for the purpose of the sorted() function.

Where is #cupy.fuse cupy python decorator documented?

I've seen some demos of #cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. (
This is why one might be better off using numba #jit)
#cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated.
https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with #fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with #numba.jit then subsequently decorating with #fuse. Or would again the #jit render the resulting python in a non-fusable form?
What breaks #fuse? What are the pitfalls? is #fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py

SOME) ANSWERS: I have found answers to some of these questions that I'm positing here
questions:
fusing kernels is such a huge advance I don't understand when I would ever not want to use #fuse. isn't it always better? When is
it a bad idea?
Answer: Fuse does not support many useful operations yet. For example, z = cupy.empty_like(x) does not work, nor does referring to globals. Hence it simply cannot be applied universally.
I'm wondering about it's composability
will #fuse inline the functions it finds within the decorated function?
Answer: Looking at timings, and nvvm markings it looks like it does pull in subroutines and fuse them into the kernel. So dividing things into subroutines rather than monolithic code will work with fuse.
I see that a bug fix in the release notes says that it can now handle calling other functions decorated with #fuse. But this does
not say if their kernels are fused or remain separate.
Answer: Looking at NVVM output it appears they are joined. It's hard to say is there is some residual overhead, but the timing doesn't show significant overheads indicating two separate kernels. The key thing is that it now works. As of cupy 4.1 you could not call a fused function from a fused function as the return types were wrong. But since 5.1 you can. However you do not need to decorate those functions. It just works whether you do or do not.
Why isn't it documented?
Answer: It appears to have some bugs and some incomplete functionality. The code also advises the API for it is subject to change.
However this is basically a miracle function when it can be used, easily improving speed by an order of magnitude on small to medium size arrays. So it would be nice if even this alpha version were documented.

Dask.bag.map_partitions function receives a generator instead of a list

I'm running a dask graph that looks something like this:
dask.bag.from_delayed(...).pluck(FEATURE_NAME).map(map_func).map_paritions(part_func)
And I'm having errors inside the execution of part_func, which turns out to be receiving generators instead of the bag items map_func is returning.
This felt like a graph optimization and I did find lazify_task and figured that it has something to do with the issue, as well as the reify graph nodes (which I couldn't find any documentation for).
While adding a values = list(values) line at the beginning of part_func seems to solve the issue at hand and gets my graph going, I feel like I may be missing something here about the internal implementation, optimization and/or approach towards building a graph.

Yes, your understanding is correct that partitions within a dask bag are generally finite generators rather than lists. This allows them to operate in less memory.
If you want to always interact with lists then you can, as you suggest, call list on the input or else call a map_partitions(list) call in between your operations.
Optimizations like lazify_task and reify are generally considered internal and can change at any time. I don't recommend building applications that depend on them. This is also partially why they have not been prioritized for documentation.

Defining tensorflow operations in python with attributes

I am trying to register a python function and its gradient as a tensorflow operation.
I found many useful examples e.g.:
Write Custom Python-Based Gradient Function for an Operation? (without C++ Implementation)
https://programtalk.com/python-examples/tensorflow.python.framework.function.Defun/
Nonetheless I would like to register attributes in the operation and use these attributes in the gradient definition by calling op.get_attr('attr_name').
Is this possible without going down to C implementation?
May you give me an example?

Unfortunately I don't believe it is possible to add attributes without using a C++ implementation of the operation. One feature that may help though is that you can define 'private' attributes by prepending an underscore to the start. I'm not sure if this is well documented or what the long-term guarantees are, but you can try setting '_my_attr_name' and you should be able to retrieve it later.

Scipy: Brute (grid-search) with multi threading?

I'm using scipy.optimize.brute(), but I noticed that it's only using one of my cores. One big advantage of a grid-search is to have all iterations of the solutions algorithm independent of each other.
Given that that's the case - why is brute() not implemented to run on multiple cores? If there is no good reason - is there a quick way to extend it / make it work, or does it make more sense to write the whole routine from scratch?

scipy.optimize.brute takes an arbitrary Python function. There is no guarantee this function is threadsafe. Even if it is, Python's global interpreter lock means that unless the function bypasses the GIL in C, it can't be run on more than one core anyway.
If you want to parallelize your brute-force search, you should write it yourself. You may have to write some Cython or C to get around the GIL.

Do you have scikit-learn installed? With a bit of refactoring you could use sklearn.grid_search.GridSearchCV, which supports multiprocessing via joblib.
You would need to wrap your local optimization function as an object that exposes the generic scikit-learn estimator interface, including a .score(...) method (or you could pass in a separate scoring function to the GridSearchCV constructor via the scoring= kwarg).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: why tf.nn.conv2d runs faster than tf.layers.conv2d? - python

I am writing a simple implementation of AlexNet. I tried with using tf.nn.conv2d and tf.layers.conv2d, and the results turn out that the loss dropped faster when using tf.nn.conv2d, even the structure is exactly the same. Does anyone know any explanation for that?

Related

Why map,filter,reduce takes the function parameter as first argument meanwhile function like sorted expects it as second parameter

Where is #cupy.fuse cupy python decorator documented?

Dask.bag.map_partitions function receives a generator instead of a list

Defining tensorflow operations in python with attributes

Scipy: Brute (grid-search) with multi threading?

Categories

Resources