Remove numpy concat from algo - python

I have a function called gen_data which will make a single pass through a list and construct a 3D array. I then iterate across a list of list, applying the function gen_data, and then concat the results together.
fst = lambda x: x[0]
snd = lambda x: x[1]
def gen_data(data,p=0, batch_size = BATCH_SIZE, n_session = N_SESSION,
x = np.zeros((batch_size,SEQ_LENGTH,vocab_size))
y = np.zeros(batch_size)
for n in range(batch_size):
ptr = n
for i in range(SEQ_LENGTH):
x[n,i,char_to_ix[data[p+ptr+i]]] = 1.
if(return_target):
y[n] = char_to_ix[data[p+ptr+SEQ_LENGTH]]
return x, np.array(y,dtype='int32')
def batch_data(data):
nest = [gen_data(datum) for datum in data]
x = np.concatenate(map(fst,nest))
y = np.concatenate(map(snd,nest))
return (x,y)
What is the best way to combine these functions so I do not need to make multiple passes back through the data to concatenate the results?
To clarify, the goal would be remove the need to zip/concat/splat/list comp in general. To be able to initialize the x tensor to the correct dimensions and then iterate across each datum/SEQ_LENGTH, batch_size in a single pass.

Without testing things, here are a few quick fixes:
def gen_data(data,p=0, batch_size = BATCH_SIZE, n_session = N_SESSION,
x = np.zeros((batch_size,SEQ_LENGTH,vocab_size))
y = np.zeros(batch_size, dtype=int) # initial to desired type
for n in range(batch_size):
ptr = n
for i in range(SEQ_LENGTH):
x[n,i,char_to_ix[data[p+ptr+i]]] = 1.
if(return_target):
y[n] = char_to_ix[data[p+ptr+SEQ_LENGTH]]
return x, y
# y is already an array; don't need this: np.array(y,dtype='int32')
nest = [gen_data(datum) for datum in data] produces, I think,
[(x0,y0), (x1,y1),...] where x is 3d (n,m,y), and y is 1d (n)
x = np.concatenate([n[0] for n in nest]) (I like this format over mapping) look ok to me. Compared to all the list comprehension operations, concatenate is relatively cheap. Look at the guts of np.vstack, etc to see how those use comprehensions along with concatenate.
A small example:
In [515]: def gen():
return np.arange(8).reshape(2,4),np.arange(1,3)
.....:
In [516]: gen()
Out[516]:
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2]))
In [517]: nest=[gen() for _ in range(3)]
In [518]: nest
Out[518]:
[(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2])),
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2])),
(array([[0, 1, 2, 3],
[4, 5, 6, 7]]), array([1, 2]))]
In [519]: np.concatenate([x[0] for x in nest])
Out[519]:
array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
In [520]: np.concatenate([x[1] for x in nest])
Out[520]: array([1, 2, 1, 2, 1, 2])
zip* effectively does a 'tanspose' on a nested list, so the arrays could be constructed with:
In [532]: nest1=zip(*nest)
In [533]: np.concatenate(nest1[0])
Out[533]:
array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
In [534]: np.concatenate(nest1[1])
Out[534]: array([1, 2, 1, 2, 1, 2])
Still requires concatenates.
Since nest is a list of tuples, it could serve as input to a structured array:
In [524]: arr=np.array(nest,dtype=[('x','(2,4)int'),('y','(2,)int')])
In [525]: arr['x']
Out[525]:
array([[[0, 1, 2, 3],
[4, 5, 6, 7]],
[[0, 1, 2, 3],
[4, 5, 6, 7]],
[[0, 1, 2, 3],
[4, 5, 6, 7]]])
In [526]: arr['y']
Out[526]:
array([[1, 2],
[1, 2],
[1, 2]])
Another possibility is to initial x and y, and iterate. But you are already doing this in gen_data. Only thing new is that I'd be assigning larger blocks.
x = ...
y = ...
for i in range(...):
x[i,...], y[i] = gen(data[i])
I like the comprehension solutions better, but I won't speculate on speeds.
In terms of speed I think it's the low level iteration in gen_data that is the time consumer. Concatenating larger blocks is relatively fast.
Another idea - since you are iterating over the rows of arrays within gen_data, how about passing views to that function, and iterate over those.
def gen_data(data,x=None,y=None):
# accept array or make own
if x is None:
x = np.zeros((3,4),int)
if y is None:
y = np.zeros(3,int)
for n in range(3):
x[n,...] = np.arange(4)+n
y[n] = n
return x,y
with no inputs, generate arrays as before:
In [543]: gen_data(None)
Out[543]:
(array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]]),
array([0, 1, 2]))
or initial a pair, and iterate over views:
In [544]: x,y = np.zeros((9,4),int),np.zeros(9,int)
In [546]: for i in range(0,9,3):
.....: gen_data(None,x[i:i+3,...],y[i:i+3])
In [547]: x
Out[547]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
In [548]: y
Out[548]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])

Related

Shuffling two 2D tensors in PyTorch and maintaining same order correlation

Is it possible to shuffle two 2D tensors in PyTorch by their rows, but maintain the same order for both? I know you can shuffle a 2D tensor by rows with the following code:
a=a[torch.randperm(a.size()[0])]
To elaborate:
If I had 2 tensors
a = torch.tensor([[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3]])
b = torch.tensor([[4, 4, 4, 4, 4],
[5, 5, 5, 5, 5],
[6, 6, 6, 6, 6]])
And ran them through some function/block of code to shuffle randomly but maintain correlation and produce something like the following
a = torch.tensor([[2, 2, 2, 2, 2],
[1, 1, 1, 1, 1],
[3, 3, 3, 3, 3]])
b = torch.tensor([[5, 5, 5, 5, 5],
[4, 4, 4, 4, 4],
[6, 6, 6, 6, 6]])
My current solution is converting to a list, using the random.shuffle() function like below.
a_list = a.tolist()
b_list = b.tolist()
temp_list = list(zip(a_list , b_list ))
random.shuffle(temp_list) # Shuffle
a_temp, b_temp = zip(*temp_list)
a_list, b_list = list(a_temp), list(b_temp)
# Convert back to tensors
a = torch.tensor(a_list)
b = torch.tensor(b_list)
This takes quite a while and was wondering if there is a better way.
You mean
indices = torch.randperm(a.size()[0])
a=a[indices]
b=b[indices]
?

numpy.concatenate float64(101,1) and float64(101,)

I'm a MatLab user who recently converted to python. I am running a for loop that cuts a longer signal into individual trials, normalizes them to 100% trial and then would like to have the trials listed horizontally in a single variable. My code is
RHipFE=np.empty([101, 1])
newlength = 101
for i in range(0,len(R0X)-1,2):
iHipFE=redataf.RHipFE[R0X[i]:R0X[i+1]]
x=np.arange(0,len(iHipFE),1)
new_x = np.linspace(x.min(), x.max(), newlength)
iHipFEn = interpolate.interp1d(x, iHipFE)(new_x)
RHipFE=np.concatenate((RHipFE,iHipFEn),axis=1)
When I run this, I get the error "ValueError: all the input arrays must have same number of dimensions". Which I assume is because RHipFE is (101,1) while iHipFEn is (101,). Is the best solution to make iHipFEn (101,1)? If so, how does one do this in the above for loop?
Generally it's faster to collect arrays in a list, and use some form of concatenate once. List append is faster than concatenate:
In [51]: alist = []
In [52]: for i in range(3):
...: alist.append(np.arange(i,i+5))
...:
In [53]: alist
Out[53]: [array([0, 1, 2, 3, 4]), array([1, 2, 3, 4, 5]), array([2, 3, 4, 5, 6])]
Various ways of joining
In [54]: np.vstack(alist)
Out[54]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [55]: np.column_stack(alist)
Out[55]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6]])
In [56]: np.stack(alist, axis=1)
Out[56]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6]])
In [57]: np.array(alist)
Out[57]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
Internally, vstack, column_stack, stack expand the dimension of the components, and concatenate on the appropriate axis:
In [58]: np.concatenate([l[:,None] for l in alist],axis=1)
Out[58]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6]])

Numpy: Imposing row dependent maximum on array

Suppose I have the following array:
a = [[1, 4, 2, 3]
[3, 1, 5, 4]
[4, 3, 1, 2]]
What I'd like to do is impose a maximum value on the array, but have that maximum vary by row. For instance if I wanted to limit the 1st and 3rd row to a maximum value of 3, and the 2nd row to a value of 4, I could create something like:
[[1, 3, 2, 3]
[3, 1, 4, 4]
[3, 3, 1, 2]
Is there any better way than just looping over each row individually and setting it with 'nonzero'?
With numpy.clip (using the method version here):
a.clip(max=np.array([3, 4, 3])[:, None]) # np.clip(a, ...)
# array([[1, 3, 2, 3],
# [3, 1, 4, 4],
# [3, 3, 1, 2]])
Generalized:
def clip_2d_rows(a, maxs):
maxs = np.asanyarray(maxs)
if maxs.ndim == 1:
maxs = maxs[:, np.newaxis]
return np.clip(a, a_min=None, a_max=maxs)
You might be safer using the module-level function (np.clip) rather than the class method (np.ndarray.clip). The former uses a_max as a parameter, while the latter uses the builtin max as a parameter which is never a great idea.
With masking -
In [50]: row_lims = np.array([3,4,3])
In [51]: np.where(a > row_lims[:,None], row_lims[:,None], a)
Out[51]:
array([[1, 3, 2, 3],
[3, 1, 4, 4],
[3, 3, 1, 2]])
With
>>> a
array([[1, 4, 2, 3],
[3, 1, 5, 4],
[4, 3, 1, 2]])
Say you have
>>> maxs = np.array([[3],[4],[3]])
>>> maxs
array([[3],
[4],
[3]])
What about doing
>>> a.clip(max=maxs)
array([[1, 3, 2, 3],
[3, 1, 4, 4],
[3, 3, 1, 2]])

How to replace each array element by 4 copies in Python?

How do I use numpy / python array routines to do this ?
E.g. If I have array [ [1,2,3,4,]] , the output should be
[[1,1,2,2,],
[1,1,2,2,],
[3,3,4,4,],
[3,3,4,4]]
Thus, the output is array of double the row and column dimensions. And each element from original array is repeated three times.
What I have so far is this
def operation(mat,step=2):
result = np.array(mat,copy=True)
result[::2,::2] = mat
return result
This gives me array
[[ 98.+0.j 0.+0.j 40.+0.j 0.+0.j]
[ 0.+0.j 0.+0.j 0.+0.j 0.+0.j]
[ 29.+0.j 0.+0.j 54.+0.j 0.+0.j]
[ 0.+0.j 0.+0.j 0.+0.j 0.+0.j]]
for the input
[[98 40]
[29 54]]
The array will always be of even dimensions.
Use np.repeat():
In [9]: A = np.array([[1, 2, 3, 4]])
In [10]: np.repeat(np.repeat(A, 2).reshape(2, 4), 2, 0)
Out[10]:
array([[1, 1, 2, 2],
[1, 1, 2, 2],
[3, 3, 4, 4],
[3, 3, 4, 4]])
Explanation:
First off you can repeat the arrya items:
In [30]: np.repeat(A, 3)
Out[30]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
then all you need is reshaping the result (based on your expected result this can be different):
In [32]: np.repeat(A, 3).reshape(2, 3*2)
array([[1, 1, 1, 2, 2, 2],
[3, 3, 3, 4, 4, 4]])
And now you should repeat the result along the the first axis:
In [34]: np.repeat(np.repeat(A, 3).reshape(2, 3*2), 3, 0)
Out[34]:
array([[1, 1, 1, 2, 2, 2],
[1, 1, 1, 2, 2, 2],
[1, 1, 1, 2, 2, 2],
[3, 3, 3, 4, 4, 4],
[3, 3, 3, 4, 4, 4],
[3, 3, 3, 4, 4, 4]])
Another approach could be with np.kron -
np.kron(a.reshape(-1,2),np.ones((2,2),dtype=int))
Basically, we reshape input array into a 2D array keeping the second axis of length=2. Then np.kron essentially replicates the elements along both rows and columns for a length of 2 each with that array : np.ones((2,2),dtype=int).
Sample run -
In [45]: a
Out[45]: array([7, 5, 4, 2, 8, 6])
In [46]: np.kron(a.reshape(-1,2),np.ones((2,2),dtype=int))
Out[46]:
array([[7, 7, 5, 5],
[7, 7, 5, 5],
[4, 4, 2, 2],
[4, 4, 2, 2],
[8, 8, 6, 6],
[8, 8, 6, 6]])
If you would like to have 4 rows, use a.reshape(2,-1) instead.
The better solution is to use numpy but you could use iteration also:
a = [[1, 2, 3, 4]]
v = iter(a[0])
b = []
for i in v:
n = next(v)
[b.append([i for k in range(2)] + [n for k in range(2)]) for j in range(2)]
print b
>>> [[1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4]]

Identify vectors with same value in one column with numpy in python

I have a large 2d array of vectors. I want to split this array into several arrays according to one of the vectors' elements or dimensions. I would like to receive one such small array if the values along this column are consecutively identical. For example considering the third dimension or column:
orig = np.array([[1, 2, 3],
[3, 4, 3],
[5, 6, 4],
[7, 8, 4],
[9, 0, 4],
[8, 7, 3],
[6, 5, 3]])
I want to turn into three arrays consisting of rows 1,2 and 3,4,5 and 6,7:
>>> a
array([[1, 2, 3],
[3, 4, 3]])
>>> b
array([[5, 6, 4],
[7, 8, 4],
[9, 0, 4]])
>>> c
array([[8, 7, 3],
[6, 5, 3]])
I'm new to python and numpy. Any help would be greatly appreciated.
Regards
Mat
Edit: I reformatted the arrays to clarify the problem
Using np.split:
>>> a, b, c = np.split(orig, np.where(orig[:-1, 2] != orig[1:, 2])[0]+1)
>>> a
array([[1, 2, 3],
[1, 2, 3]])
>>> b
array([[1, 2, 4],
[1, 2, 4],
[1, 2, 4]])
>>> c
array([[1, 2, 3],
[1, 2, 3]])
Nothing fancy here, but this good old-fashioned loop should do the trick
import numpy as np
a = np.array([[1, 2, 3],
[1, 2, 3],
[1, 2, 4],
[1, 2, 4],
[1, 2, 4],
[1, 2, 3],
[1, 2, 3]])
groups = []
rows = a[0]
prev = a[0][-1] # here i assume that the grouping is based on the last column, change the index accordingly if that is not the case.
for row in a[1:]:
if row[-1] == prev:
rows = np.vstack((rows, row))
else:
groups.append(rows)
rows = [row]
prev = row[-1]
groups.append(rows)
print groups
## [array([[1, 2, 3],
## [1, 2, 3]]),
## array([[1, 2, 4],
## [1, 2, 4],
## [1, 2, 4]]),
## array([[1, 2, 3],
## [1, 2, 3]])]
if a looks like this:
array([[1, 1, 2, 3],
[2, 1, 2, 3],
[3, 1, 2, 4],
[4, 1, 2, 4],
[5, 1, 2, 4],
[6, 1, 2, 3],
[7, 1, 2, 3]])
than this
col = a[:, -1]
indices = np.where(col[:-1] != col[1:])[0] + 1
indices = np.concatenate(([0], indices, [len(a)]))
res = [a[start:end] for start, end in zip(indices[:-1], indices[1:])]
print(res)
results in:
[array([[1, 2, 3],
[1, 2, 3]]), array([[1, 2, 4],
[1, 2, 4],
[1, 2, 4]]), array([[1, 2, 3],
[1, 2, 3]])]
Update: np.split() is much nicer. No need to add first and last index:
col = a[:, -1]
indices = np.where(col[:-1] != col[1:])[0] + 1
res = np.split(a, indices)

Categories

Resources