Extract arrays based on positions indicated in another array - python

I have the data below as an example:
import numpy as np
data=[np.array([[0.9,0.6,0.5,0.4,0.7],[0.8,0.0,0.0,0.8,0.2],
[0.9,0.0,0.4,0.4,0.3],[0.9,0.6,0.3,0.2,0.5],[0.8,0.0,0.3,0.1,0.5]]),
np.array([[0.9,0.0,0.2,0.4,0.3],[0.0,0.2,0.4,0.0,0.0],
[0.0,0.0,0.0,0.2,0.0],[0.5,0.0,0.3,0.6,0.8],[0.5,0.6,0.9,0.0,0.0]])]
and I want to extract the relevant data based on these positions below:
positions_non_zero=[np.array([2,3,4]),np.array([1,4])]
the desired output should be this:
[array([[0.9, 0. , 0.4, 0.4, 0.3],
[0.9, 0.6, 0.3, 0.2, 0.5],
[0.8, 0. , 0.3, 0.1, 0.5]]),
array([[0. , 0.2, 0.4, 0. , 0. ],
[0.5, 0.6, 0.9, 0. , 0. ]])]
The reason is this:
The problem with my code is that only the np.array([1,4]) is taken under consideration.
My code:
df_class11=[]
for n in data:
def data_target(df_class_target):
for z in df_class_target:
x_classA=[n[i] for i in z]
x_classA=np.vstack(x_classA)
return x_classA
df_class11.append(data_target(positions_non_zero))
df_class11

Related

NumPy - generate multiple intervals

I have an array like this:
[[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]]
I want, given the above array, to create a result like this:
[[0, 0.12],
[0.13, 0.19],
[0.20, 0.24],
[0.25, 0.60],
[0.61, 0.69],
[0.70, 0.89],
[0.90, 1]]
Namely, I want to create a total matrix of intervals, given a pre-defined intervals.
This isn't specific to numpy, but maybe it will point you in the correct direction.
Basically, you need to know where to start, end, and the 'resolution' (for lack of a better word) — how far apart the gaps are. With that you can loop through the existing intervals and fill in the others. You'll want to watch the edge cases where the intervals are already filled in — like one starting a 0 or [0.6, 0.8], [0.9, 0.95] so you don't fill those in twice. This might look something like:
def fill_intervals(existing_intervals, start=0, end=1.0, inc=0.01):
l2 = []
for i in l:
if start < i[0]:
l2.append([start, i[0] - inc])
l2.append(i)
start = i[1] + inc
if start < end:
l2.append([start, end])
return l2
l = [
[0.13, 0.19],
[0.25, 0.6 ],
[0.7 , 0.89]
]
fill_intervals(l)
Returning:
[[0, 0.12],
[0.13, 0.19],
[0.2, 0.24],
[0.25, 0.6],
[0.61, 0.69],
[0.7, 0.89],
[0.9, 1.0]]
You can duplicate items and then make it quite close:
arr = np.array([[0.13, 0.19], [0.25, 0.6 ], [0.7 , 0.89]])
consecutive = np.r_[0, np.repeat(arr, 2), 1]
intervals = consecutive.reshape(-1, 2)
intervals:
array([[0. , 0.13], # required: [0, 0.12]
[0.13, 0.19], # OK
[0.19, 0.25], # required: [0.20, 0.24]
[0.25, 0.6 ], # OK
[0.6 , 0.7 ], # required: [0.61, 0.69]
[0.7 , 0.89], # OK
[0.89, 1. ]])# required: [0.9, 1]
It seems you need to fix alternate intervals so just do:
intervals[2::2,0] = intervals[2::2,0] + 0.01
intervals[:-1:2,1] = intervals[:-1:2,1] - 0.01
intervals:
array([[0. , 0.12],
[0.13, 0.19],
[0.2 , 0.24],
[0.25, 0.6 ],
[0.61, 0.69],
[0.7 , 0.89],
[0.9 , 1. ]])
You can use linspace to create your intervals
import numpy as np
>>> np.linspace(0, 1, num=3, endpoint=False)
array([0. , 0.33333333, 0.66666667])

Is this a Python/Numpy bug or a subtle gotcha?

Consider the following two implemtations of the same piece of code. I would have thought they are identical but they are not.
Is this a Python/Numpy bug or a subtle gotcha? If the latter, what rule would make it obvious why it does not work as expected?
I was working with multiple arrays of data and having to process each array item by item, with each array manipulated by a table depending on it's metadata.
In the real world example 'n' is multiple factors and offsets but the following code still demonstrates the issue that I was getting the wrong result in all but one case.
import numpy as np
# Change the following line to True to show different behaviour
NEEDS_BUGS = False # Changeme
# Create some data
data = np.linspace(0, 1, 10)
print(data)
# Create an array of vector functions each of which does a different operation on a set of data
vfuncd = dict()
# Two implementations
if NEEDS_BUGS:
# Lets do this in a loop because we like loops - However WARNING this does not work!!
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
else:
# Unwrap the loop - NOTE: Spoiler - this works
vfuncd[0] = np.vectorize(lambda x: x * 0)
vfuncd[1] = np.vectorize(lambda x: x * 1)
vfuncd[2] = np.vectorize(lambda x: x * 2)
vfuncd[3] = np.vectorize(lambda x: x * 3)
vfuncd[4] = np.vectorize(lambda x: x * 4)
vfuncd[5] = np.vectorize(lambda x: x * 5)
vfuncd[6] = np.vectorize(lambda x: x * 6)
vfuncd[7] = np.vectorize(lambda x: x * 7)
vfuncd[8] = np.vectorize(lambda x: x * 8)
vfuncd[9] = np.vectorize(lambda x: x * 9)
# Prove we have multiple different vectorised functions
for k, vfunc in vfuncd.items():
print(k, vfunc)
# Do the work
res = {k: vfuncd[k](data) for k in vfuncd.keys()}
# Show the result
for k, r in res.items():
print(k, r)
I don't know what exactly you're trying to achieve and if it's a bad idea or not (in terms of np.vectorize), but the issue you're facing is because of the way python makes closures. Quoting from an answer to the linked question:
Scoping in Python is lexical. A closure will always
remember the name and scope of the variable, not the object it's
pointing to. Since all the functions in your example are created in
the same scope and use the same variable name, they always refer to
the same variable.
in other words when you make that closure over n, you're not actually closing off the state of n, just the name. So when n changes, the value in your closure also changes. This is quite unexpected to me, but others find it natural.
Here is one fix using partial:
from functools import partial
.
.
.
def func(x, n):
return x * n
for n in range(10):
vfuncd[n] = np.vectorize(partial(func, n=n))
Or another using a factory method
def func_factory(n):
return lambda x: x * n
for n in range(10):
vfuncd[n] = np.vectorize(func_factory(n))
It seems that the python variable n is bound to the vectorized expression:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
This fixes it as it creates a new object with which to bind:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * np.scalar(n))
In fact this has implications in terms of performance as I assume the value of the python variable would have to be fetched repeatedly.
In [13]: data = np.linspace(0,1,11)
Since the data array can be multiplied with a simple:
In [14]: data*3
Out[14]: array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ])
we don't need the complication of np.vectorize to see the closure issue. A simple lambda is enough.
In [15]: vfuncd = {}
...: for n in range(3):
...: vfuncd[n] = lambda x:x*n
...:
In [16]: vfuncd
Out[16]:
{0: <function __main__.<lambda>(x)>,
1: <function __main__.<lambda>(x)>,
2: <function __main__.<lambda>(x)>}
In [17]: {k:v(data) for k,v in vfuncd.items()}
Out[17]:
{0: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
1: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
We won't get the closure problem if we use a proper numpy "vectorization":
In [18]: data * np.arange(3)[:,None]
Out[18]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])
Or a simple iteration is we need a dictionary:
In [20]: {k:data*k for k in range(3)}
Out[20]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
np.vectorize has a speed disclaimer. But it is justified where the function only takes scalar inputs, and we want the flexibility of numpy broadcasting - i.e. for 2 or more arguments.
Creating multiple vectorize is clearly an 'anti-pattern'. I'd rather see one vectorize with the appropriate arguments:
In [25]: f = np.vectorize(lambda x,n: x*n)
In [26]: {n: f(data,n) for n in range(3)}
Out[26]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
That f can also produce the array Out[18] (but is slower):
In [27]: f(data, np.arange(3)[:,None])
Out[27]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])

Two denominational array slicing

I have a portion of Viterbi algorithm that I want to manipulate. I need to understand the slicing part in this code:
import numpy as np
A = np.array([[0.6, 0.2, 0.2], [0.5, 0.3, 0.2], [0.4, 0.1, 0.5]])
pi = np.array([0.5, 0.2, 0.3])
O = np.array([[0.7, 0.1, 0.2], [0.1, 0.6, 0.3], [0.3, 0.3, 0.4]])
states = UP, DOWN, UNCHANGED = 0, 1, 2
observations = [UP, UP, DOWN]
alpha = np.zeros((len(observations), len(states))) # time steps x states
alpha[:,:] = float('-inf')
backpointers = np.zeros((len(observations), len(states)), 'int')
***alpha[0, :] = pi * O[:,UP]***
in the last line print out the O[:,UP] what is should give me:
[0.7, 0.1, 0.2] I believe
instead, it gives me:
O[:,UP]
Out[15]: array([ 0.7, 0.1, 0.3])
I tried to look into this Understanding Python's slice notation
I couldn't understand why it changes the last element of the array.
Also, I run this:
O[:,UNCHANGED]
Out[17]: array([ 0.2, 0.3, 0.4])
I'm still newbie in python, I need some help
You are mixing up the notation for columns and rows.
You print O[:,UP] which gives you all the rows and just the "UP"th column (index 0).
Your O is:
array([[ 0.7, 0.1, 0.2],
[ 0.1, 0.6, 0.3],
[ 0.3, 0.3, 0.4]])
And O[:,0] is
#↓ this column
array([[ 0.7, 0.1, 0.2],
[ 0.1, 0.6, 0.3],
[ 0.3, 0.3, 0.4]])
where O[0,:] would be
array([[ 0.7, 0.1, 0.2], #This row
[ 0.1, 0.6, 0.3],
[ 0.3, 0.3, 0.4]])
And just to make the last part clear, O[:,UNCHANGED] is O[:,2] which is here:
#↓ this column
array([[ 0.7, 0.1, 0.2],
[ 0.1, 0.6, 0.3],
[ 0.3, 0.3, 0.4]])

TensorFlow - dense vector to one-hot

Suppose I have the following tensor:
T = [[0.1, 0.3, 0.7],
[0.2, 0.5, 0.3],
[0.1, 0.1, 0.8]]
I want to transform this into a one-hot tensor, such that the indexes with the maximum value over dimension 0 get set to 1 and all the other ones get set to zero, like this:
T_onehot = [[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
I know there's tf.argmax to get the indices of the largest elements in the tensor, but is there any method which allows me to do what I want to do in one step?
I don't know if there's a way to do this in one step, but there's a one_hot function in tensorflow:
import tensorflow as tf
T = tf.constant([[0.1, 0.3, 0.7], [0.2, 0.5, 0.3], [0.1, 0.1, 0.8]])
T_onehot = tf.one_hot(tf.argmax(T, 1), T.shape[1])
tf.InteractiveSession()
print(T_onehot.eval())
# [[ 0. 0. 1.]
# [ 0. 1. 0.]
# [ 0. 0. 1.]]

Slicing 2D arrays using indices from arrays in python

I'm working with slices of a 2D numpy array. To select the slices, I have the indices stored in arrays. For example, I have:
mat = np.zeros([xdim,ydim], float)
xmin = np.array([...]) # Array of minimum indices in x
xmax = np.array([...]) # Array of maximum indices in x
ymin = np.array([...]) # Array of minimum indices in y
ymax = np.array([...]) # Array of maximum indices in y
value = np.array([...]) # Values
Where ... just denotes some integer numbers previously calculated. All arrays are well-defined and have lengths of ~265000. What I want to do is something like:
mat[xmin:xmax, ymin:ymax] += value
In such a way that for the first elements I would have:
mat[xmin[0]:xmax[0], ymin[0]:ymax[0]] += value[0]
mat[xmin[1]:xmax[1], ymin[1]:ymax[1]] += value[1]
and so on, for the ~265000 elements of the array. Unfortunately what I just wrote is not working, and it is throwing the error: IndexError: invalid slice.
I've been trying to use np.meshgrid as suggested here: NumPy: use 2D index array from argmin in a 3D slice, but it hasn't worked for me yet. Besides, I'm looking for a pythonic way to do so, avoiding the for loops.
Any help will be much appreciated!
Thanks!
I don't think there is a satisfactory way of vectorizing your problem without resorting to Cython or the like. Let me outline what a pure numpy solution could look like, which should make clear why this is probably not a very good approach.
First, lets look at a 1D case. There's not much you can do with a bunch of slices in numpy, so the first task is to expand them into individual indices. Say that your arrays were:
mat = np.zeros((10,))
x_min = np.array([2, 5, 3, 1])
x_max = np.array([5, 9, 8, 7])
value = np.array([0.2, 0.6, 0.1, 0.9])
Then the following code expands the slice limits into lists of (possibly repeating) indices and values, joins them together with bincount, and adds them to the original mat:
x_len = x_max - x_min
x_cum_len = np.cumsum(x_len)
x_idx = np.arange(x_cum_len[-1])
x_idx[x_len[0]:] -= np.repeat(x_cum_len[:-1], x_len[1:])
x_idx += np.repeat(x_min, x_len)
x_val = np.repeat(value, x_len)
x_cumval = np.bincount(x_idx, weights=x_val)
mat[:len(x_cumval)] += x_cumval
>>> mat
array([ 0. , 0.9, 1.1, 1.2, 1.2, 1.6, 1.6, 0.7, 0.6, 0. ])
It is possible to expand this to your 2D case, although it is anything but trivial, and things start getting hard to follow:
mat = np.zeros((10, 10))
x_min = np.array([2, 5, 3, 1])
x_max = np.array([5, 9, 8, 7])
y_min = np.array([1, 7, 2, 6])
y_max = np.array([6, 8, 6, 9])
value = np.array([0.2, 0.6, 0.1, 0.9])
x_len = x_max - x_min
y_len = y_max - y_min
total_len = x_len * y_len
x_cum_len = np.cumsum(x_len)
x_idx = np.arange(x_cum_len[-1])
x_idx[x_len[0]:] -= np.repeat(x_cum_len[:-1], x_len[1:])
x_idx += np.repeat(x_min, x_len)
x_val = np.repeat(value, x_len)
y_min_ = np.repeat(y_min, x_len)
y_len_ = np.repeat(y_len, x_len)
y_cum_len = np.cumsum(y_len_)
y_idx = np.arange(y_cum_len[-1])
y_idx[y_len_[0]:] -= np.repeat(y_cum_len[:-1], y_len_[1:])
y_idx += np.repeat(y_min_, y_len_)
x_idx_ = np.repeat(x_idx, y_len_)
xy_val = np.repeat(x_val, y_len_)
xy_idx = np.ravel_multi_index((x_idx_, y_idx), dims=mat.shape)
xy_cumval = np.bincount(xy_idx, weights=xy_val)
mat.ravel()[:len(xy_cumval)] += xy_cumval
Which produces:
>>> mat
array([[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.2, 0.2, 0.2, 0.2, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.3, 0.3, 0.3, 0.3, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0.2, 0.3, 0.3, 0.3, 0.3, 0.9, 0.9, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0.9, 1.5, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0.9, 1.5, 0.9, 0. ],
[ 0. , 0. , 0.1, 0.1, 0.1, 0.1, 0. , 0.6, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
But if you have 265,000 two dimensional slices of arbitrary size, then the indexing arrays are going to get into the many millions of items really fast. Having to handle reading and writing so much data can negate the speed improvements that come with using numpy. Frankly, I doubt this is a good option at all, if nothing else because of how cryptic your code is going to become.

Categories

Resources