Make numpy matrix with insufficient length of data - python

I have some data, say a list of 10 numbers and I have to convert that list to a matrix of shape (3,4). What would be the best way to do so, if I say I wanted the data to fill by columns/rows and the unfilled spots to have some default value like -1.
Eg:
data = [0,4,1,3,2,5,9,6,7,8]
>>> output
array([[ 0, 4, 1, 3],
[ 2, 5, 9, 6],
[ 7, 8, -1, -1]])
What I thought of doing is
data += [-1]*(row*col - len(data))
output = np.array(data).reshape((row, col))
Is there a simpler method that allows me to achieve the same result without having to modify the original data or sending in data + [-1]*remaining to the np.array function?

I'm sure there are various ways of doing this. My first inclination is to make a output array filled with the 'fill', and copy the data to it. Since the fill is 'ragged', not a full column or row, I'd start out 1d and reshape to the final shape.
In [730]: row,col = 3,4
In [731]: data = [0,4,1,3,2,5,9,6,7,8]
In [732]: output=np.zeros(row*col,dtype=int)-1
In [733]: output[:len(data)]=data
In [734]: output = output.reshape(3,4)
In [735]: output
Out[735]:
array([[ 0, 4, 1, 3],
[ 2, 5, 9, 6],
[ 7, 8, -1, -1]])
Regardless of whether data starts as a list or a 1d array, it will have to be copied to output. With a change in the total number of characters we can't just reshape it.
This isn't that different from your approach of adding the extra values via [-1]*n.
There is a pad function, but it works on whole columns or rows, and internally is quite complex because it's written for general cases.

Use np.ndarray.flat to index into the flattened version of the array.
data = [0, 4, 1, 3, 2, 5, 9, 6, 7, 8]
default_value = -1
desired_shape = (3, 4)
output = default_value * np.ones(desired_shape)
output.flat[:len(data)] = data
# output is now:
# array([[ 0., 4., 1., 3.],
# [ 2., 5., 9., 6.],
# [ 7., 8., -1., -1.]])
As hpaulj says, the extra copy is really hard to avoid.
If you are reading data from a file somehow, you could read it into the flattened array directly, either using flat, or by reshaping the array afterward. Then the data gets directly loaded into the array with the desired shape.

I checked the solutions given based on speed. The tests was done using IPython 4.2.0 with Python 3.5.2|Anaconda 4.1.1 (64-bit).
The data array starts with 100,000 elements. The new dimentions are 150,000 x 150,000.
M. Klugerford's solution (icreasing the data and reshaping):
%timeit data = [x for x in range(100000)]; col=15000; row=15000; data+= [-1]*(row*col-len(data)); output = np.array(data).reshape((row, col))
1 loop, best of 3: 38.8 s per loop
Psidom's solution (using np.pad):
%timeit import numpy as np; data = [x for x in range(100000)]; col=15000; row=15000; np.pad(data, (0, row * col - len(data)), 'constant', constant_values = -1).reshape(row, col)
1 loop, best of 3: 20.4 s per loop
Praveen's solution (using np.ndarray.flat):
%timeit import numpy as np; data = [x for x in range(100000)]; col=15000; row=15000; output = -1 * np.ones((col, row)); output.flat[:len(data)] = data
1 loop, best of 3: 12.2 s per loop
hpaulj's solution (create output first; coping later and best solution so far!!):
%timeit import numpy as np; data = [x for x in range(100000)]; col=15000; row=15000; output=np.zeros(row*col,dtype=int)-1; output[:len(data)]=data; output = output.reshape(col, row)
1 loop, best of 3: 6.28 s per loop

Here is one option using numpy.pad, pad the data with -1 at the end of array and then reshape it:
import numpy as np
data = [0,4,1,3,2,5,9,6,7,8]
row, col = 3, 4
np.pad(data, (0, row * col - len(data)), 'constant', constant_values = -1).reshape(row, col)
# array([[ 0, 4, 1, 3],
# [ 2, 5, 9, 6],
# [ 7, 8, -1, -1]])

Related

How to reshape matrix with numpy without using explicit values for argument?

I am trying to create a function and calculate the inner product using numpy.
I got the function to work however I am using explicit numbers in my np.reshape function and I need to make it to use based on input.
my code looks like this:
import numpy as np
X = np.array([[1,2],[3,4]])
Z = np.array([[1,4],[2,5],[3,6]])
# Calculating S
def calculate_S(X, n, m):
assert n == X.shape[0]
n,d1=X.shape
m,d2=X.shape
S = np.diag(np.inner(X,X))
return S
S= calculate_S(X,n,m)
S = S.reshape(2,1)
print(s)
output:
---------------------------------
[[ 5]
[25]]
So the output is correct however instead of specifying 2,1 I need that those values to be automatically placed there based on the shape of my matrix.
How do I do that?
In [163]: X = np.array([[1,2],[3,4]])
In [164]: np.inner(X,X)
Out[164]:
array([[ 5, 11],
[11, 25]])
In [165]: np.diag(np.inner(X,X))
Out[165]: array([ 5, 25])
reshape with -1 gets around having to specify the 2:
In [166]: np.diag(np.inner(X,X)).reshape(-1,1)
Out[166]:
array([[ 5],
[25]])
another way of adding dimension:
In [167]: np.diag(np.inner(X,X))[:,None]
Out[167]:
array([[ 5],
[25]])
You can get the "diagonal" directly with:
In [175]: np.einsum('ij,ij->i',X,X)
Out[175]: array([ 5, 25])
another
In [177]: (X[:,None,:]#X[:,:,None])[:,0,:]
Out[177]:
array([[ 5],
[25]])

How is one supposed to use the arange function when indexing an ndarray?

Let's say I want to select a value from a different column for each row. Then, I might do something like this:
a = np.arange(12).reshape(3, 4)
columns = np.array([1, 2, 0])
a[np.arange(a.shape[0]), columns]
It seems a bit 'ugly' to me to need to specify the entire range; moreover, even the arange call takes time:
%timeit np.arange(int(1e6))
1.03 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there a way to avoid using arange?
Generalizing the above question; how would one go about selecting not single values, but different adjacent sets of columns (each set of equal size) for each row? I would like to avoid creating many manual aranges, like so:
rows = np.array([0, 2])
start_values = np.array([0, 1])
window_length = 3
column_ranges = np.array(list(map(lambda j: np.arange(j, j + window_length), start_values)))
Right now, the only way I see to use the above column ranges is to index like so:
a[rows, :][:, column_ranges][np.arange(len(rows)), np.arange(len(rows)), :]
Ideally, I'd like to use a notation like a[:, columns] instead of a[np.arange(a.shape[0]), columns], and a[:, columns:columns + window_length] instead of a[rows, :][:, column_ranges][np.arange(len(rows)), np.arange(len(rows)), :].
We can get sliding windows and then index those with the start indices along the rows and cols for our desired output. To get those windows, we can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows. More info on use of as_strided based view_as_windows. This would be mostly inspired by this post.
from skimage.util.shape import view_as_windows
def windows_per_row_vas(arr, rows, cols, W):
w = view_as_windows(a,(1,W))[...,0,:]
return w[rows,cols]
If you want to get your hands dirty with a crude implementation using np.lib.stride_tricks.as_strided -
def windows_per_row_strided(arr, rows, cols, W):
strided = np.lib.stride_tricks.as_strided
m,n = arr.shape
s0,s1 = arr.strides
windows = strided(arr, shape=(m,n-W+1,W), strides=(s0,s1,s1))
return windows[rows, cols]
Why use views/strided?
Because the windows are simply views into input, hence no memory overhead. It's only at the final step, when getting the output, we need extra memory space to hold the required slices, which are required anyway.
Sample run -
In [9]: a
Out[9]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [10]: rows = np.array([0, 2])
...: start_values = np.array([0, 1])
...: window_length = 3
In [11]: windows_per_row_strided(a, rows, start_values, window_length)
Out[11]:
array([[ 0, 1, 2],
[ 9, 10, 11]])
In [29]: windows_per_row_vas(a, rows, start_values, window_length)
Out[29]:
array([[ 0, 1, 2],
[ 9, 10, 11]])

For a given condition, get indices of values in 2D tensor A, use those to index a 3D tensor B

For a given 2D tensor I want to retrieve all indices where the value is 1. I expected to be able to simply use torch.nonzero(a == 1).squeeze(), which would return tensor([1, 3, 2]). However, instead, torch.nonzero(a == 1) returns a 2D tensor (that's okay), with two values per row (that's not what I expected). The returned indices should then be used to index the second dimension (index 1) of a 3D tensor, again returning a 2D tensor.
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print('a_size', a.size())
# a_size torch.Size([3, 4])
print('b_size', b.size())
# b_size torch.Size([3, 4, 8])
idxs = torch.nonzero(a == 1)
print('idxs_size', idxs.size())
# idxs_size torch.Size([3, 2])
print(b.gather(1, idxs))
Evidently, this does not work, leading to aRunTimeError:
RuntimeError: invalid argument 4: Index tensor must have same
dimensions as input tensor at
C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensorEvenMoreMath.cpp:453
It seems that idxs is not what I expect it to be, nor can I use it the way I thought. idxs is
tensor([[0, 1],
[1, 3],
[2, 2]])
but reading through the documentation I don't understand why I also get back the row indices in the resulting tensor. Now, I know I can get the correct idxs by slicing idxs[:, 1] but then still, I cannot use those values as indices for the 3D tensor because the same error as before is raised. Is it possible to use the 1D tensor of indices to select items across a given dimension?
You could simply slice them and pass it as the indices as in:
In [193]: idxs = torch.nonzero(a == 1)
In [194]: c = b[idxs[:, 0], idxs[:, 1]]
In [195]: c
Out[195]:
tensor([[0.3411, 0.3944, 0.8108, 0.3986, 0.3917, 0.1176, 0.6252, 0.4885],
[0.5698, 0.3140, 0.6525, 0.7724, 0.3751, 0.3376, 0.5425, 0.1062],
[0.7780, 0.4572, 0.5645, 0.5759, 0.5957, 0.2750, 0.6429, 0.1029]])
Alternatively, an even simpler & my preferred approach would be to just use torch.where() and then directly index into the tensor b as in:
In [196]: b[torch.where(a == 1)]
Out[196]:
tensor([[0.3411, 0.3944, 0.8108, 0.3986, 0.3917, 0.1176, 0.6252, 0.4885],
[0.5698, 0.3140, 0.6525, 0.7724, 0.3751, 0.3376, 0.5425, 0.1062],
[0.7780, 0.4572, 0.5645, 0.5759, 0.5957, 0.2750, 0.6429, 0.1029]])
A bit more explanation about the above approach of using torch.where(): It works based on the concept of advanced indexing. That is, when we index into the tensor using a tuple of sequence objects such as tuple of tensors, tuple of lists, tuple of tuples etc.
# some input tensor
In [207]: a
Out[207]:
tensor([[12., 1., 0., 0.],
[ 4., 9., 21., 1.],
[10., 2., 1., 0.]])
For basic slicing, we would need a tuple of integer indices:
In [212]: a[(1, 2)]
Out[212]: tensor(21.)
To achieve the same using advanced indexing, we would need a tuple of sequence objects:
# adv. indexing using a tuple of lists
In [213]: a[([1,], [2,])]
Out[213]: tensor([21.])
# adv. indexing using a tuple of tuples
In [215]: a[((1,), (2,))]
Out[215]: tensor([21.])
# adv. indexing using a tuple of tensors
In [214]: a[(torch.tensor([1,]), torch.tensor([2,]))]
Out[214]: tensor([21.])
And the dimension of the returned tensor would always be one dimension less than the dimension of the input tensor.
Assuming that b's three dimensions are batch_size x sequence_length x features (b x s x feats), the expected results can be achieved as follows.
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print(b.size())
# b x s x feats
idxs = torch.nonzero(a == 1)[:, 1]
print(idxs.size())
# b
c = b[torch.arange(b.size(0)), idxs]
print(c.size())
# b x feats
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print('a_size', a.size())
# a_size torch.Size([3, 4])
print('b_size', b.size())
# b_size torch.Size([3, 4, 8])
#idxs = torch.nonzero(a == 1, as_tuple=True)
idxs = torch.nonzero(a == 1)
#print('idxs_size', idxs.size())
print(torch.index_select(b,1,idxs[:,1]))
As a supplementary of #kmario23's solution, you can still achieve the same results like
b[torch.nonzero(a==1,as_tuple=True)]

How to find the median of different-sized lists

I have a list of numbers which I want to sort into bins and find the median of each bin. If the bins all had the same number of data points, this would be easy to do reasonably efficiently using numpy arrays:
import numpy as np
indices=np.array([0,1,0,1,1,2,3,3,3,2,0,2])
length=np.max(indices)+1
data = np.arange(len(indices))
binned = np.array([data[indices == i] for i in range(length)])
The binned data (in the array binned) is then
array([[ 0, 2, 10],
[ 1, 3, 4],
[ 5, 9, 11],
[ 6, 7, 8]])
The median of each bin is:
np.median(binned, axis=1)
Result:
array([2., 3., 9., 7.])
However, if the list is such that there are different numbers of points in each bin (or no points in some of the bins), I can't create a numpy array or use np.median and instead have to do the heavy lifting in a for loop:
indices=np.array([0,1,1,1,3,1,1,0,0,0,3])
data = np.arange(len(indices))
The binned data is
[data[indices == i] for i in range(length)]
[array([0, 7, 8, 9]),
array([1, 2, 3, 5, 6]),
array([], dtype=int64),
array([ 4, 10])]
But I can't take a median of the list of arrays. Instead, I can do
[np.median(data[indices == i]) for i in range(length)]
and get
[7.5, 3.0, nan, 7.0]
But that for loop is pretty slow. (I have a few million data points and tens or hundreds of thousands of bins in my real data.)
Is there a way to do this that avoids heavy reliance on for loops (or even gets rid of for loops altogether)?
Just put your two columns in a pandas DataFrame and you can easily compute your medians by grouping by 'indices'. Let's see it in practice :
import numpy as np , pandas as pd
indices = [0,1,1,1,3,1,1,0,0,0,3]
data = np.arange(len(indices))
df = pd.DataFrame({"indices": indices, "data": data}) # Your DataFrame
df.head() # Take a look
indices data
0 0 0
1 1 1
2 1 2
3 1 3
4 3 4
medians = df.groupby("indices").median()# median for each value of `indices`
medians
data
indices
0 7.5
1 3.0
3 7.0
# Finding indices with no data point
desired_indices = pd.Series([0, 1, 10, -5, 2])
is_in_index = desired_indices.isin(medians.index)
has_no_data = desired_indices[~ is_in_index]
has_no_data
2 10
3 -5
4 2
dtype: int64

numpy sum antidiagonals of array

Given a numpy ndarray, I would like to take the first two axes, and replace them with a new axis, which is the sum of their antidiagonals.
In particular, suppose I have variables x,y,z,..., and the entries of my array represent the probability
array[i,j,k,...] = P(x=i, y=j, z=k, ...)
I would like to obtain
new_array[l,k,...] = P(x+y=l, z=k, ...) = sum_i P(x=i, y=l-i, z=k, ...)
i.e., new_array[l,k,...] is the sum of all array[i,j,k,...] such that i+j=l.
What is the most efficient and/or cleanest way to do this in numpy?
EDIT to add:
On recommendation of #hpaulj, here is the obvious iterative solution:
array = numpy.arange(30).reshape((2,3,5))
array = array / float(array.sum()) # make it a probability
new_array = numpy.zeros([array.shape[0] + array.shape[1] - 1] + list(array.shape[2:]))
for i in range(array.shape[0]):
for j in range(array.shape[1]):
new_array[i+j,...] += array[i,j,...]
new_array.sum() # == 1
There is a trace function that gives the sum of a diagonal. You can specify the offset and 2 axes (0 and 1 are the defaults). And to get the antidiagonal, you just need to flip one dimension. np.flipud does that, though it's just [::-1,...] indexing.
Putting those together,
np.array([np.trace(np.flipud(array),offset=k) for k in range(-1,3)])
matches your new_array.
It still loops over the possible values of l (4 in this case). trace itself is compiled.
In this small case, it's actually slower than your double loop (2x3 steps). Even if I move the flipud out of the inner loop, it is still slower. I don't know how this scales for larger arrays.
Part of the problem with vectorizing this even further is that fact that each diagonal has a different length.
In [331]: %%timeit
array1 = array[::-1]
np.array([np.trace(array1,offset=k) for k in range(-1,3)])
.....:
10000 loops, best of 3: 87.4 µs per loop
In [332]: %%timeit
new_array = np.zeros([array.shape[0] + array.shape[1] - 1] + list(array.shape[2:]))
for i in range(2):
for j in range(3):
new_array[i+j] += array[i,j]
.....:
10000 loops, best of 3: 43.5 µs per loop
scipy.sparse has a dia format, which stores the values of nonzero diagonals. It stores a padded array of values, along with the offsets.
array([[12, 0, 0, 0],
[ 8, 13, 0, 0],
[ 4, 9, 14, 0],
[ 0, 5, 10, 15],
[ 0, 1, 6, 11],
[ 0, 0, 2, 7],
[ 0, 0, 0, 3]])
array([-3, -2, -1, 0, 1, 2, 3])
While that's a way of getting around the issue of variable diagonal lengths, I don't think it helps in this case where you just need their sums.

Categories

Resources