I want to parallelize the numpy.bincount function using the apply_ufunc API of xarray and the following code is what I've tried:
import numpy as np
import xarray as xr
da = xr.DataArray(np.random.rand(2,16,32),
dims=['time', 'y', 'x'],
coords={'time': np.array(['2019-04-18', '2019-04-19'],
dtype='datetime64'),
'y': np.arange(16), 'x': np.arange(32)})
f = xr.DataArray(da.data.reshape((2,512)),dims=['time','idx'])
x = da.x.values
y = da.y.values
r = np.sqrt(x[np.newaxis,:]**2 + y[:,np.newaxis]**2)
nbins = 4
if x.max() > y.max():
ri = np.linspace(0., y.max(), nbins)
else:
ri = np.linspace(0., x.max(), nbins)
ridx = np.digitize(np.ravel(r), ri)
func = lambda a, b: np.bincount(a, weights=b)
xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
but I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-974a8f0a89e8> in <module>()
12
13 func = lambda a, b: np.bincount(a, weights=b)
---> 14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_ufunc(func, *args, **kwargs)
979 signature=signature,
980 join=join,
--> 981 exclude_dims=exclude_dims)
982 elif any(isinstance(a, Variable) for a in args):
983 return variables_ufunc(*args)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_dataarray_ufunc(func, *args, **kwargs)
208
209 data_vars = [getattr(a, 'variable', a) for a in args]
--> 210 result_var = func(*data_vars)
211
212 if signature.num_outputs > 1:
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, *args, **kwargs)
558 raise ValueError('unknown setting for dask array handling in '
559 'apply_ufunc: {}'.format(dask))
--> 560 result_data = func(*input_data)
561
562 if signature.num_outputs == 1:
<ipython-input-203-974a8f0a89e8> in <lambda>(a, b)
11 ridx = np.digitize(np.ravel(r), ri)
12
---> 13 func = lambda a, b: np.bincount(a, weights=b)
14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
ValueError: object too deep for desired array
I am kind of lost where the error is stemming from and help would be greatly appreciated...
The issue is that apply_along_axis iterates over 1D slices of the first argument to the applied function and not any of the others. If I understand your use-case correctly, you actually want to iterate over 1D slices of the weights (weights in the np.bincount signature), not the integer array (x in the np.bincount signature).
One way to work around this is to write a thin wrapper function around np.bincount that simply switches the order of the arguments:
def wrapped_bincount(weights, x):
return np.bincount(x, weights=weights)
We can then use np.apply_along_axis with this function for your use-case:
def apply_bincount_along_axis(x, weights, axis=-1):
return np.apply_along_axis(wrapped_bincount, axis, weights, x)
Finally, we can wrap this new function for use with xarray using apply_ufunc, noting that it can be automatically parallelized with dask (also note that that we do not need to provide an axis argument, because xarray will automatically move the input core dimension dim to the last position in the weights array before applying the function):
def xbincount(x, weights):
if len(x.dims) != 1:
raise ValueError('x must be one-dimensional')
dim, = x.dims
nbins = x.max() + 1
return xr.apply_ufunc(apply_bincount_along_axis, x, weights,
input_core_dims=[[dim], [dim]],
output_core_dims=[['bin']], dask='parallelized',
output_dtypes=[np.float], output_sizes={'bin': nbins})
Applying this function to your example then looks like:
xbincount(ridx, f)
<xarray.DataArray (time: 2, bin: 5)>
array([[ 0. , 7.934821, 34.066872, 51.118065, 152.769169],
[ 0. , 11.692989, 33.262936, 44.993856, 157.642972]])
Dimensions without coordinates: time, bin
As desired it also works with dask arrays:
xbincount(ridx, f.chunk({'time': 1}))
<xarray.DataArray (time: 2, bin: 5)>
dask.array<shape=(2, 5), dtype=float64, chunksize=(1, 5)>
Dimensions without coordinates: time, bin
I know this is a bit late, but here is an alternative for computing bincount with multiple sets of weights. Please refer to #spencerkclark's answer for information about parallelizing the function.
There is a WARNING before using this: the function bincount_2d_SLOW is used to demonstrate the idea! Do not use this function in your code directly, it is very slow!
I will explain at the end why the idea of the function can greatly speed up your code relative to the solution posted by #spencerkclark, but only if you are computing the bincount multiple times using the same set of groups.
The idea of the code is that while we can't use np.bincount with 2d weights, we can convert 2d weights into 1d data that is directly usable by np.bincount.
The way we do this is:
we repeat our grouping column by the 2nd dimension of the weights, so the grouping and weights have the same shape.
we adjust the grouping values along the 2nd dimension, so that each set of weights has its own unique grouping values. This way, we are grouping along each set of weights separately.
We flatten the data, so it's 1d. Now we can run np.bincount.
Finally, reshape the data.
def bincount_2d_SLOW(x, weights=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_groups = x.max() + 1
n_dims = weights.shape[1]
# Expand x to the same number of dimensions as weights
repeated_x = np.tile(x, (n_dims, 1)).T
# Take Kronecker product, so bincount works separately along each dimension
repeated_x = repeated_x + n_groups * np.arange(n_dims)
# Flatten
repeated_x = repeated_x.flatten()
# Compute bincount
return np.bincount(repeated_x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
Here is why the idea of this function can speed up your code: if you are computing the bincount many times using the same set of groups, you can pre-compute the tiled-and-flattened groupings, and suddenly the code is incredibly fast. Here is an alternative function (I also added an option to specify n_groups, which can speed up the code even more):
def bincount_2d(x, weights=None, n_groups=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_dims = weights.shape[1]
if n_groups is None:
n_groups = (x.max() + 1) // n_dims
return np.bincount(x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
In some testing I did, bincount_2d_SLOW is about 1/3 slower than apply_bincount_along_axis. But bincount_2d was about 2x faster than apply_bincount_along_axis when I didn't specify n_groups, and when I did specify n_groups, it was about 3x faster.
Related
I was reading about attention and came across this equation:
import einops
from fancy_einsum import einsum
import torch
x = torch.rand((200, 10, 768))
y = torch.rand((20, 768, 64))
res = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", x, y)
And I am not able to understand the underlying operations that give the result res
I thought it might be matmul and tried this:
import torch
x_ = x.unsqueeze(dim = 2).unsqueeze(dim = 2)
y_ = torch.broadcast_to(y, (1, 1, 20, 768, 64))
res2 = x_ # y_
res2 = res2.squeeze(dim = -2)
(res == res2).all() # Prints False
But that does not seem to be right.
Any help regarding this is greatly appreciated
So whenever using einsum you best think about the meaning of the dimensions. Basically we perform a multiplication between the two inputs in this case. The signature passed to einsum shows what dimensions will be preserved and which ones will be "summed away". I simplified the signature with single letters here:
res = einsum("b q m, n m h -> b q n h", x, y)
We can read from this that both x and y have three dimensions. Furthermore both have a dimension called m, and this doesn't appear in the output. So we can conclude that it gets "summed away". So for each entry of the output we have following formula. For simplicity I reused the dimension names as indices, so for every b,q,n,h we get
___
\
res[b,q,n,h] = / x[b,q,m] * y[n,m,h]
/__
m
To do this with any other function than einsum is usually more cumbersome. So first we need to reorder and unsqueeze the dimensions in a way that they are compatible to be multiplied, so we can do the following (the shapes annotated above):
#(b,q,m,n,h) (b, q, m, 1, 1) (m, n, h)
product = x[:, :, :, None, None] * y.permute([1,0,2])
Due to the broadcasting rules, the second (y-) term will implicitly get the required leading dummy dimensions.
Then we can "sum away" the dimension m:
res = product.sum(dim=2) # (b,q,n,h)
So you can interpret that as a matrix multiplication if you want, or also just a scalar product, but of course with many "batch"-dimensions.
Goal
I would like to parallelize a loop with dask that uses a library function inside the loop. This function, mhw.detect(), calculates some statistics on a slice of a numpy array. None of the slices of the array depend on the other slices, so I was hoping that dask could be used to compute them in parallel and store them all in the same output array.
Code
The flow of the code I am working on is:
import numpy as np
import marineHeatWaves as mhw
from dask import delayed
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.random_integers(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# loop through each pixel in the data array
for idx_lat in range(lat_size):
for idx_long in range(long_size):
# Extract a slice of data
data_slice = data[:, idx_lat, idx_long]
# Use the library function to calculate the stats for the pixel
# `library_output` is a dictionary that has a numpy array inside it
_, library_output = delayed(mhw.detect)(time, data_slice)
# Update the output array with the calculated values from the library
output_array[:, idx_lat, idx_long] = library_output['seas']
Previous efforts
When I run this code I get the error TypeError: Delayed objects of unspecified length are not iterable. Another stack overflow post discusses this issue and resolves the issue by converting the output of the delayed function to a delayed object. However, because I didn't create the output object myself I am not sure if I can convert it to a delayed object.
I've also tried wrapping the last line in da.from_delayed(), as in output_array[:, idx_lat, idx_long] = da.from_delayed(library_output['seas']) and initalizing the output_array with da.empty(data.shape). I get the same error, though, since I think the code doesn't make it past the line with the library function delayed(mhw.detect)(time, data_slice).
Is it possible to parallelize this? Is this approach of asking dask to compute all the slices in parallel and put them together in an output array even a reasonable approach?
Full Traceback
TypeError Traceback (most recent call last)
/home/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 44' in <cell line: 10>()
13 data_slice = data[:, idx_lat, idx_long]
14 # Use the library function to calculate the stats for the pixel
---> 15 _, point_clim = delayed(mhw.detect)(time_ordinal, data_slice)
16 # Update the output array with the calculated values from the library
17 output_array[:, idx_lat, idx_long] = point_clim['seas']
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/delayed.py:581, in Delayed.__iter__(self)
579 def __iter__(self):
580 if self._length is None:
--> 581 raise TypeError("Delayed objects of unspecified length are not iterable")
582 for i in range(self._length):
583 yield self[i]
TypeError: Delayed objects of unspecified length are not iterable
Update
Using .apply_along_axis() as suggested:
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.randint(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
data = dask.array.from_array(data, chunks=(-1, 100, 100))
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# define a wrapper to rearrange arguments
def func1d(arr, time, shape=(10000,)):
print(arr.shape)
return mhw.detect(time, arr)
res = dask.array.apply_along_axis(func1d, 0, data, time=time)
With the output:
(1,)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in <cell line: 15>()
12 print(arr.shape)
13 return mhw.detect(time, arr)
---> 15 res = dask.array.apply_along_axis(func1d, 0, data, time=time)
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/array/routines.py:508, in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
506 if shape is None or dtype is None:
507 test_data = np.ones((1,), dtype=arr.dtype)
--> 508 test_result = np.array(func1d(test_data, *args, **kwargs))
509 if shape is None:
510 shape = test_result.shape
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in func1d(arr, time, shape)
11 def func1d(arr, time, shape=(10000,)):
12 print(arr.shape)
---> 13 return mhw.detect(time, arr)
File ~/.conda/envs/dask/lib/python3.10/site-packages/marineHeatWaves-0.28-py3.10.egg/marineHeatWaves.py:280, in detect(t, temp, climatologyPeriod, pctile, windowHalfWidth, smoothPercentile, smoothPercentileWidth, minDuration, joinAcrossGaps, maxGap, maxPadLength, coldSpells, alternateClimatology, Ly)
278 tt = tt[tt>=0] # Reject indices "before" the first element
279 tt = tt[tt<TClim] # Reject indices "after" the last element
--> 280 thresh_climYear[d-1] = np.nanpercentile(tempClim[tt.astype(int)], pctile)
281 seas_climYear[d-1] = np.nanmean(tempClim[tt.astype(int)])
282 # Special case for Feb 29
IndexError: index 115 is out of bounds for axis 0 with size 1
Rather than using delayed, this seems like a good case for dask.array.
You can create the dask array by partitioning the numpy array:
da = dask.array.from_array(output_array, chunks=(-1, 10, 10))
Now you can call mhw.detect using dask.array.map_blocks alongside np.apply_along_axis within each block:
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
def block_func(block, **kwargs):
return np.apply_along_axis(func1d, 0, block, **kwargs)
res = data.map_blocks(block_func, meta=data, time=time)
res = res.compute()
The map_blocks answer above works great! Additionally, apply_along_axis() was suggested and discussed in comments. I was able to get that method to work, but in order for it to function properly you need to use both the dtype and shape inputs to da.apply_along_axis(). If these aren't supplied the function can't figure out the shape of the data it should pass as an argument.
So, another solution:
import dask.array as da
# Create fake input data
lat_size, long_size = 100, 100
data = da.random.random_integers(0, 30, size=(1_000, long_size, lat_size), chunks=(-1, 10, 10)) # size = (time, longitude, latitude)
time = np.arange(730_000, 731_000) # time in ordinal days
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
result = da.apply_along_axis(func1d, 0, data, time=time, dtype=data.dtype, shape=(1000,))
result.compute()
I need help to compute a mathematical expression using only numpy operations. The expression I want to compute is the following :
Where : x is an (N, S) array and f is a numpy function (that can work with broadcastable arrays e.g np.maximum, np.sum, np.prod, ...). If that is of importance, in my case f is a symetric function.
So far my code looks like this:
s = 0
for xp in x: # Loop over N...
s += np.sum(np.prod(f(xp, x), axis=1))
And still has loop that I'd like to get rid of.
Typically N is "large" (around 30k) but S is small (less than 20) so if anyone can find a trick to only loop over S this would still be a major improvement.
I belive the problem is easy by N-plicating the array but one of size (32768, 32768, 20) requires 150Go of RAM that I don't have. However, (32768, 32768) fits in memory though I would appreciate a solution that does not allocate such array.
Maybe a use of np.einsum with well-chosen arrays is possible?
Thanks for your replies. If any information is missing let me know!
Have a nice day !
Edit 1 :
Form of f I'm interested in includes (for now) : f(x, y) = |x - y|, f(x, y) = |x - y|^2, f(x, y) = 2 - max(x, y).
Your loop is very efficient. Some possible ways are
Method-1 (looping over S)
import numpy as np
def f(x,y):
return np.abs(x-y)
N = 200
S = 20
x_data = random.rand(N,S) #(i,s)
y_data = random.rand(N,S) #(i',s)
product = f(broadcast_to(x_data[:,0][...,None],(N,N)) ,broadcast_to(y_data[:,0][...,None],(N,N)).T)
for i in range(1,S):
product *= f(broadcast_to(x_data[:,i][...,None],(N,N)) ,broadcast_to(y_data[:,i][...,None],(N,N)).T)
sum = np.sum(product)
Method-2 (dispatching S number of blocks)
import numpy as np
def f(x,y):
x1 = np.broadcast_to(x[:,None,...],(x.shape[0],y.shape[0],x.shape[1]))
y1 = np.broadcast_to(y[None,...],(x.shape[0],y.shape[0],x.shape[1]))
return np.abs(x1-y1)
def f1(x1,y1):
return np.abs(x1-y1)
N = 5000
S = 20
x_data = np.random.rand(N,S) #(i,s)
y_data = np.random.rand(N,S) #(i',s)
def fun_new(x_data1,y_data1):
s = 0
pp =np.split(x_data1,S,axis=0)
for xp in pp:
s += np.sum(np.prod(f(xp, y_data1), axis=2))
return s
def fun_op(x_data1,y_data1):
s = 0
for xp in x_data1: # Loop over N...
s += np.sum(np.prod(f1(xp, y_data1), axis=1))
return s
fun_new(x_data,y_data)
takes two 2D NumPy matrices as arguments
returns 1 2D NumPy array: the product of the two input matrices
This function will perform the operation Z=X×Y, where X and Y are the function arguments. Remember how to perform matrix-matrix multiplication:
First, you need to make sure the matrix dimensions line up. For computing X×Y, this means the number of columns of X (first matrix) should match the number of rows of Y (second matrix). These are referred to as the "inner dimensions"--matrix dimensions are usually cited as "rows by columns", so the second dimension of the first operand X is on the "inside" of the operation; same with the first dimension of the second operand, Y. If the operation were instead Y×X, you would need to make sure that the number of columns of Y matches the number of rows of X. If these numbers don't match, you should return None.
Second, you'll need to create your output matrix, Z. The dimensions of this matrix will be the "outer dimensions" of the two operands: if we're computing X×Y, then Z's dimensions will have the same number of rows as X (the first matrix), and the same number of columns as Y (the second matrix).
Third, you'll need to compute pairwise dot products. If the operation is X×Y, then these dot products will be between the ith row of X with the jth column of Y. This resulting dot product will then go in Z[i][j]. So first, you'll find the dot product of row 0 of X with column 0 of Y, and put that in Z[0][0]. Then you'll find the dot product of row 0 of X with column 1 of Y, and put that in Z[0][1]. And so on, until all rows and columns of X and Y have been dot-product-ed with each other.
You can use numpy.array, but no functions associated with computing matrix products (and definitely not the # operator).
You CAN use numpy.dot, but ONLY to multiply vectors, since it can also be used to multiply full matrices.
import numpy as np
def mm_multiply(X, Y):
X = [[1,2], [2,1]]
Y = [[2,3], [3,3]]
X = np.array(X)
Y = np.array(Y)
[I,J] = X.shape
[K,H] = Y.shape
Z = np.zeros((I,H))
for i in range(I):
for j in range(H):
for k in range(K):
Z[i,j] += X[i,k]*Y[k,j]
print(Z)
________________
My submission code is mv_multiply is not defined, but that is the previous problem - but it might go away with the correct coding. and the other error im getting is:
Any help will be much appreciated!! Thanks in advance!
TypeError Traceback (most recent call last)
<ipython-input-38-1b8bf5d47d82> in <module>
4 A = np.random.random((48, 683))
5 B = np.random.random((683, 58))
----> 6 np.testing.assert_allclose(mm_multiply(A, B), A # B)
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
1491 header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
1492 assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1493 verbose=verbose, header=header, equal_nan=equal_nan)
1494
1495
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
779 return
780
--> 781 val = comparison(x, y)
782
783 if isinstance(val, bool):
/opt/conda/lib/python3.7/site-packages/numpy/testing/_private/utils.py in compare(x, y)
1486 def compare(x, y):
1487 return np.core.numeric.isclose(x, y, rtol=rtol, atol=atol,
-> 1488 equal_nan=equal_nan)
1489
1490 actual, desired = np.asanyarray(actual), np.asanyarray(desired)
/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in isclose(a, b, rtol, atol, equal_nan)
2519 y = array(y, dtype=dt, copy=False, subok=True)
2520
-> 2521 xfin = isfinite(x)
2522 yfin = isfinite(y)
2523 if all(xfin) and all(yfin):
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I need to use this loss function for a CNN the list_distance and list_residual are output tensors from hidden layers which are important to compute the loss, but when i execute the code it gives me back this error
TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.
Is there another way to iterate over tensors without the use of the costruct
x in X or convert it in a numpy array or using the backend function of keras?
def DBL(y_true, y_pred, list_distances, list_residual, l=0.65):
prob_dist = []
Li = []
# mean of the images power spectrum
S = np.sum([np.power(np.abs(fp.fft2(residual)), 2)
for residual in list_residual], axis=0) / K.shape(list_residual)[0]
# log-ratio between the geometric and arithmetic of S
R = np.log10((scistats.gmean(S) / np.mean(S)))
for c_i, dis_i in enumerate(list_distances):
prob_dist.append([
np.exp(-dis_i) / sum([np.exp(-dis_j) if c_j != c_i else 0 for c_j, dis_j in enumerate(list_distances)])
])
for count, _ in enumerate(prob_dist):
Li.append(
-1 * np.log10(sum([p_j for c_j, p_j in enumerate(prob_dist[count])
if y_pred[count] == 1 and count != c_j])))
L0 = np.sum(Li)
return L0 - l * R
You need to define a custom function to feed into tf.map_fn() - Tensorflow dox
Mapper functions map (funnily enough) the existing object (tensor) into a new one using a function you define.
They apply the custom function to every element in the object, without all the mucking about with for loops.
For instance (non tested code, may not run - on my phone atm):
def custom(a):
b = a + 1
return b
original = np.array([2,2,2])
mapped = tf.map_fn(custom, original)
# mapped == [3, 3, 3] ... hopefully
Tensorflow examples all use lambda functions, so you might need to define your functions like that if the above doesn’t work. Tensorflow example:
elems = np.array([1, 2, 3, 4, 5, 6])
squares = map_fn(lambda x: x * x, elems)
# squares == [1, 4, 9, 16, 25, 36]
Edit:
As an aside, map functions are much easier to parallelise than for loops - it is assumed that each element of an object is processed uniquely - so you can see a performance uplift by using them.
Edit 2:
For the "reduce sum, but not on this index" part, I would heavily recommend you start looking back at matrix operations... As mentioned, map functions work element-wise - they are not aware of other elements. A reduce function is what you want, but even they are finiky when you try and do "not this index" sums... also tensorflow is built around matrix ops... Not the MapReduce paradigm.
Something along these lines might help:
sess = tf.Session()
var = np.ones([3, 3, 3]) * 5
zero_identity = tf.linalg.set_diag(
var, tf.zeros(var.shape[0:-1], dtype=tf.float64)
)
exp_one = tf.exp(var)
exp_two = tf.exp(zero_identity)
summed = tf.reduce_sum(exp_two, axis = [0,1])
final = exp_one / summed
print("input matrix: \n", var, "\n")
print("Identities of the matrix to Zero: \n", zero_identity.eval(session=sess), "\n")
print("Exponential Values numerator: \n", exp_one.eval(session=sess), "\n")
print("Exponential Values to Sum: \n", exp_two.eval(session=sess), "\n")
print("Summed values for zero identity matrix\n ... along axis [0,1]: \n", summed.eval(session=sess), "\n")
print("Output:\n", final.eval(session=sess), "\n")