Representing a ragged array in numpy by padding - python

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.
I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:
scores.reshape((numGroups, groupSize))
Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.
To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.
scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]),
f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)
The desired value of scoresByGroup would be:
[[f(a[0]), f(a[1]), f(a[2]), -1.0],
[f(b[0]), f(b[1]), -1.0, -1.0]
[f(c[0]), f(c[1]), f(c[2]), f(c[3])]]
Is there some numpy function or composition of functions I can use to create groupIntoRows?
Background:
This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.

Try this:
scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
padding_value).reshape(-1, np.max(lens))
>>> padded_scores
array([[ 0.05878244, 0.40804443, 0.35640463, -1. ],
[ 0.39365072, 0.85313545, -1. , -1. ],
[ 0.133687 , 0.73651147, 0.98531828, 0.78940163]])

Related

Finding mode of unique array combination in the rows of 2d numpy array

I have a 2d numpy array which I'm trying to return the mode array along axis = 0 (rows). However, I would like to return the most frequent unique row combination. And not the three modes for all three columns which is what scipy stats mode does. The desired output in the example below would be [9,9,9], because thats the most common unique row. Thanks
from scipy import stats
arr1 = np.array([[2,3,4],[2,1,5],[1,2,3],[2,4,4],[2,8,2],[2,3,1],[9,9,9],[9,9,9]])
stats.mode(arr1, axis = 0)
output:
ModeResult(mode=array([[2, 3, 4]]), count=array([[5, 2, 2]]))
you could use the numpy unique funtion and return counts.
unique_arr1, count = np.unique(arr1,axis=0, return_counts=True)
unique_arr1[np.argmax(count)]
output:
array([9, 9, 9])
np.unique return the unique array in sorted order, which means it is guranteed that last one is the maximum. you could simply do:
out = np.unique(arr1,axis=0)[-1]
however, I do not know for what purpose you want to use this but just to mention that you could have all counts just in case you want to verify or account for multiple rows with same counts as well.
Update
given additional information that this is for images (which could be big) and most importantly second dim could fit in a int (either each values is uin8 or 16) could fir in int32 or 64. (considering of values of each pixel in uint8):
pixel, count = np.unique(np.dot(arr, np.array([2**16,2**8,1])), return_counts=True)
pixel = pixel[np.argmax(count)]
b,g, r, = np.ndarray((3,), buffer=pixel, dtype=np.uint8)
This could result in a big speedup.

3d array to matrix multiplication

I have a matrix called vec with two columns, vec[:,0] and vec[:,1]. P contains two matrices, P[0,:,:] and P[1,:,:]. I want to mulitiply P[0,:,:] with the first column of vec and multiply P[1,:,:] with the second column of vec. However, the operation P#vec also gives me the matrix product of P[0,:,:] with the second column of vec and the matrix product of P[1,:,:] with the first column of vec, which slows my code.
Is it possible to directly compute the pairs column 1 to matrix 1 and column 2 to matrix 2 without the "off" products?
import numpy as np
P=np.arange(50).reshape(2, 5, 5)
vec=np.arange(10).reshape(5,2)
have=P#vec
want=np.column_stack((have[0,:,0],have[1,:,1]))
have,want
There is a very powerful function in numpy called np.einsum. It can perform all kind of tensor contractions, axis reordering and matrix multiplication. For your example you could use
res = np.einsum('nij,jn->in', P, vec)
after which res is exactly like want.
How does this work:
You give the np.einsum function both your arrays as well as a signature (that 'nij,jn->in' string) that tells the function how to multiply the arrays. In short, you want the third axis of the P tensor to be contracted with the first axis of vec. Therefore you choose the same index j in the signature string and leave it out in the part after the ->. A mere broadcast is done if indices appear on the left and right hand side of the ->, which is done here for the n and i indices.
A more complete explanation of this very powerful function with many examples of how to use it can be found at the corresponding numpy documentation.
#/matmul handles batches nicely, but the rules are that for 3d arrays, the first dimension is the batch, and dot is done on the last 2 dimensions, with the usual "last of A with the second to the last of B" pairing.
It took a bit of reading to decipher you description but it appears that you want the first of p to the batch, and last of vec to be the batch. That means vec needs to transformed to a (2,5,1) to work with the (2,5,5) p.
In [176]: P#vec.T[:,:,None]
Out[176]:
array([[[ 60],
[ 160],
[ 260],
[ 360],
[ 460]],
[[ 695],
[ 820],
[ 945],
[1070],
[1195]]])
The result is (2,5,1). We can squeeze out the the last to get (2,5), but apparently you want a (5,2)
In [179]: (P#vec.T[:,:,None])[...,0].T
Out[179]:
array([[ 60, 695],
[ 160, 820],
[ 260, 945],
[ 360, 1070],
[ 460, 1195]])
np.einsum('nij,jn->in', P, vec) does effectively the same, with the n as the batch dimension that is 'carried through' to the result, and sum-of-products on the shared j dimension.

How to multiply a 3D matrix with a 2D matrix efficiently in numpy

I have two multidimensional arrays, which I want to multiply with each other. One has the shape N,N,3 and the other has the shape N,N.
Let me set the stage:
I have an array of atom positions of the shape N,3:
atom_positions = [[x1,y1,z1],
[x2,y2,z2],
[x3,y3,z3],
...
]
From these I calculate an upper triangular matrix of distance vectors so that the resulting N,N,3 matrix contains all unique pair distance vectors r_ij of the vectors inside atom_positions:
pair_distance_vectors = [[[0,0,0],[x2-x1,y2-y1,z2-z1],[x3-x1,y3-y1,z3-z1],...],
[[0,0,0],[0,0,0] ,[x3-x2,y3-y2,z3-z2],...],
...
]
Now I want to normalize each of these pair distance vectors. For that I want to use my N,N pair_distances array, which contains the length of every vector inside pair_distance_vectors.
The formula for a single vector is:
r_ij/|r_ij|
I want to do that by doing a matrix multiplication, where every entry in the N,N array becomes a scalar by which a vector inside the N,N,3 array is multiplied. I'm pretty sure that this can be achieved somehow with numpy by using numpy.dot() or a different function, but I just can't find the answer myself. Also, I'm afraid if I do find a transformation which allows for this, that my maths will be faulty.
Here's some demonstration code, which achieves what I want in a very inefficient fashion:
import numpy as np
pair_distance_vectors = np.ones(shape=(2,2,3))
pair_distances = np.array(((1,2),(3,4)))
normalized_pair_distance_vectors = np.zeros(shape=(2,2,3))
for i,vec_list in enumerate(pair_distance_vectors):
for j,vec in enumerate(vec_list):
normalized_pair_distance_vectors[i,j] = vec*pair_distances[i,j]
print(normalized_pair_distance_vectors)
Thanks in advance.
EDIT: Maybe this is clearer:
distance_vectors = [[[x11,y11,z11],[x12,y12,z12],[x13,y13,z13],...],
[[x21,y21,z21],[x22,y22,z22],[x23,y23,z23],...],
... ]
distance_matrix = [[r_11,r_12,r_13,...],
[r_21,r_22,r_23,...],
... ]
norm_distance_vectors = some_operation(distance_vectors,distance_matrix)
norm_distance_vectors = [[r_11*[x11,y11,z11],r_12*[x12,y12,z12],r_13*[x13,y13,z13],...],
[r_21*[x21,y21,z21],r_22*[x22,y22,z22],r_23*[x23,y23,z23],...],
... ]
You won't need a loop. Trick is to expand your pair_distance in the 3rd dimension by repeating it m times (m being the dimension of your vectors, here 3D) and then divide two arrays element wise (works for any m-dimensional vectors, replace 3 with m):
pair_distances = np.repeat(pair_distances[:,:,None], 3, axis=2)
normalized_pair_distance_vectors = np.nan_to_num(pair_distance_vectors/ pair_distances)
Output for your example inputs:
[[[1. 1. 1. ]
[0.5 0.5 0.5 ]]
[[0.33333333 0.33333333 0.33333333]
[0.25 0.25 0.25 ]]]

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

sklearn: How to get column indexes deleted by model.transform from original dataframe

I'm trying to apply future selection. The problem is that using the whole dataframe raises the memory error. So I've decided to cut my dataframe to be able to apply use next future selection:
# this is original dataframes
X_full = df_train[df_train.columns[0:size]] # 76000(rows)*300(cols)
y_full = df_train[[len(df_train.columns)-1]] # 76000(rows)*1(col)
y_full contains 0 and 1, and the number 1's are below 5%. All other columns contains only numbers, but we don't know that they mean.
#this is way, I reduce the number of rows to 10%
test_frac = 0.10
count = len(X_full)
X = X_full.iloc[-int(count*test_frac):]
y = y_full.iloc[-int(count*test_frac):]
#Then I use Linear models penalized with the L1 norm to reduce the dimensionality of the data
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print "X_new.shape", X_new.shape
print X_new
The problem is that I need to get list of colums, which were deleted, to get rid of them from the original dataframe. How can I do that?
Sounds like you're looking for SelectFromModel.get_support(). Per the docs, it can return either (1) a boolean array with length equal to the number of all features (2) integer indices of the included features:
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

Categories

Resources