Is this a Python/Numpy bug or a subtle gotcha?

Is this a Python/Numpy bug or a subtle gotcha? - python

Consider the following two implemtations of the same piece of code. I would have thought they are identical but they are not.
Is this a Python/Numpy bug or a subtle gotcha? If the latter, what rule would make it obvious why it does not work as expected?
I was working with multiple arrays of data and having to process each array item by item, with each array manipulated by a table depending on it's metadata.
In the real world example 'n' is multiple factors and offsets but the following code still demonstrates the issue that I was getting the wrong result in all but one case.
import numpy as np
# Change the following line to True to show different behaviour
NEEDS_BUGS = False # Changeme
# Create some data
data = np.linspace(0, 1, 10)
print(data)
# Create an array of vector functions each of which does a different operation on a set of data
vfuncd = dict()
# Two implementations
if NEEDS_BUGS:
# Lets do this in a loop because we like loops - However WARNING this does not work!!
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
else:
# Unwrap the loop - NOTE: Spoiler - this works
vfuncd[0] = np.vectorize(lambda x: x * 0)
vfuncd[1] = np.vectorize(lambda x: x * 1)
vfuncd[2] = np.vectorize(lambda x: x * 2)
vfuncd[3] = np.vectorize(lambda x: x * 3)
vfuncd[4] = np.vectorize(lambda x: x * 4)
vfuncd[5] = np.vectorize(lambda x: x * 5)
vfuncd[6] = np.vectorize(lambda x: x * 6)
vfuncd[7] = np.vectorize(lambda x: x * 7)
vfuncd[8] = np.vectorize(lambda x: x * 8)
vfuncd[9] = np.vectorize(lambda x: x * 9)
# Prove we have multiple different vectorised functions
for k, vfunc in vfuncd.items():
print(k, vfunc)
# Do the work
res = {k: vfuncd[k](data) for k in vfuncd.keys()}
# Show the result
for k, r in res.items():
print(k, r)

I don't know what exactly you're trying to achieve and if it's a bad idea or not (in terms of np.vectorize), but the issue you're facing is because of the way python makes closures. Quoting from an answer to the linked question:
Scoping in Python is lexical. A closure will always
remember the name and scope of the variable, not the object it's
pointing to. Since all the functions in your example are created in
the same scope and use the same variable name, they always refer to
the same variable.
in other words when you make that closure over n, you're not actually closing off the state of n, just the name. So when n changes, the value in your closure also changes. This is quite unexpected to me, but others find it natural.
Here is one fix using partial:
from functools import partial
.
.
.
def func(x, n):
return x * n
for n in range(10):
vfuncd[n] = np.vectorize(partial(func, n=n))
Or another using a factory method
def func_factory(n):
return lambda x: x * n
for n in range(10):
vfuncd[n] = np.vectorize(func_factory(n))

It seems that the python variable n is bound to the vectorized expression:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
This fixes it as it creates a new object with which to bind:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * np.scalar(n))
In fact this has implications in terms of performance as I assume the value of the python variable would have to be fetched repeatedly.

In [13]: data = np.linspace(0,1,11)
Since the data array can be multiplied with a simple:
In [14]: data*3
Out[14]: array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ])
we don't need the complication of np.vectorize to see the closure issue. A simple lambda is enough.
In [15]: vfuncd = {}
...: for n in range(3):
...: vfuncd[n] = lambda x:x*n
...:
In [16]: vfuncd
Out[16]:
{0: <function __main__.<lambda>(x)>,
1: <function __main__.<lambda>(x)>,
2: <function __main__.<lambda>(x)>}
In [17]: {k:v(data) for k,v in vfuncd.items()}
Out[17]:
{0: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
1: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
We won't get the closure problem if we use a proper numpy "vectorization":
In [18]: data * np.arange(3)[:,None]
Out[18]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])
Or a simple iteration is we need a dictionary:
In [20]: {k:data*k for k in range(3)}
Out[20]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
np.vectorize has a speed disclaimer. But it is justified where the function only takes scalar inputs, and we want the flexibility of numpy broadcasting - i.e. for 2 or more arguments.
Creating multiple vectorize is clearly an 'anti-pattern'. I'd rather see one vectorize with the appropriate arguments:
In [25]: f = np.vectorize(lambda x,n: x*n)
In [26]: {n: f(data,n) for n in range(3)}
Out[26]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
That f can also produce the array Out[18] (but is slower):
In [27]: f(data, np.arange(3)[:,None])
Out[27]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])

Related

How to mask row in Tensorflow without for loop

I want to create a custom Layer for a Tensorflow model but the logic I have uses a for loop, which Tensorflow doesn't like. How can I modify my code to remove the for loop but still achieve the same result?
class CustomMask(tf.keras.layers.Layer):
def call(self, inputs):
mask = tf.where(inputs[:, 0] < 0.5, 1, 0)
for i,m in enumerate(mask):
if m:
inputs = inputs[i, 1:].assign(tf.zeros(4, dtype=tf.float32))
else:
first = tf.where(inputs[:, 1] >= 0.5, 0, 1)
assign = tf.multiply(tf.cast(first, tf.float32), inputs[:, 2])
inputs = inputs[:, 2].assign(assign)
third = tf.where(inputs[:, 1] >= 0.5, 1, 0)
assign = tf.multiply(tf.cast(third, tf.float32), inputs[:, 1])
inputs = inputs[:, 1].assign(assign)
return inputs
Example input Tensor:
<tf.Variable 'Variable:0' shape=(3, 5) dtype=float32, numpy=
array([[0.8, 0.7, 0.2, 0.6, 0.9],
[0.8, 0.4, 0.8, 0.3, 0.7],
[0.3, 0.2, 0.4, 0.3, 0.8]], dtype=float32)>
Corresponding output:
<tf.Variable 'UnreadVariable' shape=(3, 5) dtype=float32, numpy=
array([[0.8, 0.7, 0. , 0.6, 0.9],
[0.8, 0. , 0.8, 0.3, 0.7],
[0.3, 0. , 0. , 0. , 0. ]], dtype=float32)>
EDIT:
The layer should take an array of shape (batch_size, 5) and if the first value of a row is less than 0.5, set the rest of the row values to 0, otherwise if the 2nd element is above 0.5, set the 3rd element to 0 and if the 3rd element is greater than 0.5, set the 2nd element to 0

Without using any foor loop, ask in comments if it doesn't solve your issue (tested in colab)
import tensorflow as tf
mask1 = tf.convert_to_tensor([0.0,1.0,1.0,1.0,1.0])
mask2 = tf.convert_to_tensor([0.0,0.0,1.0,0.0,0.0])
mask3 = tf.convert_to_tensor([0.0,1.0,0.0,0.0,0.0])
def masking(x):
mask = tf.ones(x.shape, tf.float32)
cond1 = tf.cast(x[0] < 0.5, tf.float32)
x = tf.multiply(x, tf.subtract(mask, tf.multiply(mask1, cond1)))
cond2 = tf.cast(x[1] > 0.5, tf.float32)
x = tf.multiply(x, tf.subtract(mask, tf.multiply(mask2, cond2)))
cond3 = tf.cast(x[2] > 0.5, tf.float32)
x = tf.multiply(x, tf.subtract(mask, tf.multiply(mask3, cond3)))
return x
inputs = tf.convert_to_tensor([[0.8, 0.7, 0.2, 0.6, 0.9],
[0.8, 0.4, 0.8, 0.3, 0.7],
[0.3, 0.2, 0.4, 0.3, 0.8]])
res = tf.vectorized_map(masking, inputs)
print (res)
tf.Tensor(
[[0.8 0.7 0. 0.6 0.9]
[0.8 0. 0.8 0.3 0.7]
[0.3 0. 0. 0. 0. ]], shape=(3, 5), dtype=float32)
I tested it with
%timeit tf.map_fn(masking, inputs)
%timeit tf.vectorized_map(masking, inputs)
and the tf.vectorized_map(masking, inputs) get faster when the batch size increase

NumPy - generate multiple intervals

I have an array like this:
[[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]]
I want, given the above array, to create a result like this:
[[0, 0.12],
[0.13, 0.19],
[0.20, 0.24],
[0.25, 0.60],
[0.61, 0.69],
[0.70, 0.89],
[0.90, 1]]
Namely, I want to create a total matrix of intervals, given a pre-defined intervals.

This isn't specific to numpy, but maybe it will point you in the correct direction.
Basically, you need to know where to start, end, and the 'resolution' (for lack of a better word) — how far apart the gaps are. With that you can loop through the existing intervals and fill in the others. You'll want to watch the edge cases where the intervals are already filled in — like one starting a 0 or [0.6, 0.8], [0.9, 0.95] so you don't fill those in twice. This might look something like:
def fill_intervals(existing_intervals, start=0, end=1.0, inc=0.01):
l2 = []
for i in l:
if start < i[0]:
l2.append([start, i[0] - inc])
l2.append(i)
start = i[1] + inc
if start < end:
l2.append([start, end])
return l2
l = [
[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]
]
fill_intervals(l)
Returning:
[[0, 0.12],
[0.13, 0.19],
[0.2, 0.24],
[0.25, 0.6],
[0.61, 0.69],
[0.7, 0.89],
[0.9, 1.0]]

You can duplicate items and then make it quite close:
arr = np.array([[0.13, 0.19], [0.25, 0.6 ], [0.7 , 0.89]])
consecutive = np.r_[0, np.repeat(arr, 2), 1]
intervals = consecutive.reshape(-1, 2)
intervals:
array([[0. , 0.13], # required: [0, 0.12]
[0.13, 0.19], # OK
[0.19, 0.25], # required: [0.20, 0.24]
[0.25, 0.6 ], # OK
[0.6 , 0.7 ], # required: [0.61, 0.69]
[0.7 , 0.89], # OK
[0.89, 1. ]])# required: [0.9, 1]
It seems you need to fix alternate intervals so just do:
intervals[2::2,0] = intervals[2::2,0] + 0.01
intervals[:-1:2,1] = intervals[:-1:2,1] - 0.01
intervals:
array([[0. , 0.12],
[0.13, 0.19],
[0.2 , 0.24],
[0.25, 0.6 ],
[0.61, 0.69],
[0.7 , 0.89],
[0.9 , 1. ]])

You can use linspace to create your intervals
import numpy as np
>>> np.linspace(0, 1, num=3, endpoint=False)
array([0. , 0.33333333, 0.66666667])

Extract arrays based on positions indicated in another array

I have the data below as an example:
import numpy as np
data=[np.array([[0.9,0.6,0.5,0.4,0.7],[0.8,0.0,0.0,0.8,0.2],
[0.9,0.0,0.4,0.4,0.3],[0.9,0.6,0.3,0.2,0.5],[0.8,0.0,0.3,0.1,0.5]]),
np.array([[0.9,0.0,0.2,0.4,0.3],[0.0,0.2,0.4,0.0,0.0],
[0.0,0.0,0.0,0.2,0.0],[0.5,0.0,0.3,0.6,0.8],[0.5,0.6,0.9,0.0,0.0]])]
and I want to extract the relevant data based on these positions below:
positions_non_zero=[np.array([2,3,4]),np.array([1,4])]
the desired output should be this:
[array([[0.9, 0. , 0.4, 0.4, 0.3],
[0.9, 0.6, 0.3, 0.2, 0.5],
[0.8, 0. , 0.3, 0.1, 0.5]]),
array([[0. , 0.2, 0.4, 0. , 0. ],
[0.5, 0.6, 0.9, 0. , 0. ]])]
The reason is this:
The problem with my code is that only the np.array([1,4]) is taken under consideration.
My code:
df_class11=[]
for n in data:
def data_target(df_class_target):
for z in df_class_target:
x_classA=[n[i] for i in z]
x_classA=np.vstack(x_classA)
return x_classA
df_class11.append(data_target(positions_non_zero))
df_class11

Slicing 2D arrays using indices from arrays in python

I'm working with slices of a 2D numpy array. To select the slices, I have the indices stored in arrays. For example, I have:
mat = np.zeros([xdim,ydim], float)
xmin = np.array([...]) # Array of minimum indices in x
xmax = np.array([...]) # Array of maximum indices in x
ymin = np.array([...]) # Array of minimum indices in y
ymax = np.array([...]) # Array of maximum indices in y
value = np.array([...]) # Values
Where ... just denotes some integer numbers previously calculated. All arrays are well-defined and have lengths of ~265000. What I want to do is something like:
mat[xmin:xmax, ymin:ymax] += value
In such a way that for the first elements I would have:
mat[xmin[0]:xmax[0], ymin[0]:ymax[0]] += value[0]
mat[xmin[1]:xmax[1], ymin[1]:ymax[1]] += value[1]
and so on, for the ~265000 elements of the array. Unfortunately what I just wrote is not working, and it is throwing the error: IndexError: invalid slice.
I've been trying to use np.meshgrid as suggested here: NumPy: use 2D index array from argmin in a 3D slice, but it hasn't worked for me yet. Besides, I'm looking for a pythonic way to do so, avoiding the for loops.
Any help will be much appreciated!
Thanks!

I don't think there is a satisfactory way of vectorizing your problem without resorting to Cython or the like. Let me outline what a pure numpy solution could look like, which should make clear why this is probably not a very good approach.
First, lets look at a 1D case. There's not much you can do with a bunch of slices in numpy, so the first task is to expand them into individual indices. Say that your arrays were:
mat = np.zeros((10,))
x_min = np.array([2, 5, 3, 1])
x_max = np.array([5, 9, 8, 7])
value = np.array([0.2, 0.6, 0.1, 0.9])
Then the following code expands the slice limits into lists of (possibly repeating) indices and values, joins them together with bincount, and adds them to the original mat:
x_len = x_max - x_min
x_cum_len = np.cumsum(x_len)
x_idx = np.arange(x_cum_len[-1])
x_idx[x_len[0]:] -= np.repeat(x_cum_len[:-1], x_len[1:])
x_idx += np.repeat(x_min, x_len)
x_val = np.repeat(value, x_len)
x_cumval = np.bincount(x_idx, weights=x_val)
mat[:len(x_cumval)] += x_cumval
>>> mat
array([ 0. , 0.9, 1.1, 1.2, 1.2, 1.6, 1.6, 0.7, 0.6, 0. ])
It is possible to expand this to your 2D case, although it is anything but trivial, and things start getting hard to follow:
mat = np.zeros((10, 10))
x_min = np.array([2, 5, 3, 1])
x_max = np.array([5, 9, 8, 7])
y_min = np.array([1, 7, 2, 6])
y_max = np.array([6, 8, 6, 9])
value = np.array([0.2, 0.6, 0.1, 0.9])
x_len = x_max - x_min
y_len = y_max - y_min
total_len = x_len * y_len
x_cum_len = np.cumsum(x_len)
x_idx = np.arange(x_cum_len[-1])
x_idx[x_len[0]:] -= np.repeat(x_cum_len[:-1], x_len[1:])
x_idx += np.repeat(x_min, x_len)
x_val = np.repeat(value, x_len)
y_min_ = np.repeat(y_min, x_len)
y_len_ = np.repeat(y_len, x_len)
y_cum_len = np.cumsum(y_len_)
y_idx = np.arange(y_cum_len[-1])
y_idx[y_len_[0]:] -= np.repeat(y_cum_len[:-1], y_len_[1:])
y_idx += np.repeat(y_min_, y_len_)
x_idx_ = np.repeat(x_idx, y_len_)
xy_val = np.repeat(x_val, y_len_)
xy_idx = np.ravel_multi_index((x_idx_, y_idx), dims=mat.shape)
xy_cumval = np.bincount(xy_idx, weights=xy_val)
mat.ravel()[:len(xy_cumval)] += xy_cumval
Which produces:
>>> mat
array([[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.2, 0.2, 0.2, 0.2, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.3, 0.3, 0.3, 0.3, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.3, 0.3, 0.3, 0.3, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0.9, 1.5, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0.9, 1.5, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0. , 0.6, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
But if you have 265,000 two dimensional slices of arbitrary size, then the indexing arrays are going to get into the many millions of items really fast. Having to handle reading and writing so much data can negate the speed improvements that come with using numpy. Frankly, I doubt this is a good option at all, if nothing else because of how cryptic your code is going to become.

What is the equivalent of "zip()" in Python's numpy?

I am trying to do the following but with numpy arrays:
x = [(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)]
normal_result = zip(*x)
This should give a result of:
normal_result = [(0.1, 0.1, 0.1, 0.1, 0.1), (1., 2., 3., 4., 5.)]
But if the input vector is a numpy array:
y = np.array(x)
numpy_result = zip(*y)
print type(numpy_result)
It (expectedly) returns a:
<type 'list'>
The issue is that I will need to transform the result back into a numpy array after this.
What I would like to know is what is if there is an efficient numpy function that will avoid these back-and-forth transformations?

You can just transpose it...
>>> a = np.array([(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)])
>>> a
array([[ 0.1, 1. ],
[ 0.1, 2. ],
[ 0.1, 3. ],
[ 0.1, 4. ],
[ 0.1, 5. ]])
>>> a.T
array([[ 0.1, 0.1, 0.1, 0.1, 0.1],
[ 1. , 2. , 3. , 4. , 5. ]])

Try using dstack:
>>> from numpy import *
>>> a = array([[1,2],[3,4]]) # shapes of a and b can only differ in the 3rd dimension (if present)
>>> b = array([[5,6],[7,8]])
>>> dstack((a,b)) # stack arrays along a third axis (depth wise)
array([[[1, 5],
[2, 6]],
[[3, 7],
[4, 8]]])
so in your case it would be:
x = [(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)]
y = np.array(x)
np.dstack(y)
>>> array([[[ 0.1, 0.1, 0.1, 0.1, 0.1],
[ 1. , 2. , 3. , 4. , 5. ]]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is this a Python/Numpy bug or a subtle gotcha? - python

Related

How to mask row in Tensorflow without for loop

NumPy - generate multiple intervals

Extract arrays based on positions indicated in another array

Slicing 2D arrays using indices from arrays in python

What is the equivalent of "zip()" in Python's numpy?

Categories

Resources