In place insertion into list (or array) - python

I'm running a script in Python, where I need to insert new numbers into an array (or list) at certain index locations. The problem is that obviously as I insert new numbers, the index locations are invalidated. Is there a clever way to insert the new values at the index locations all at once? Or is the only solution to increment the index number (first value of the pair) as I add?
Sample test code snippets:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = [(insertion_indices[i], new_numbers[i]) for i in range(len(insertion_indices))]
for pair in pairs:
original_list.insert(pair[0], pair[1])
Results in:
[0, 8, 1, 2, 9, 10, 3, 4, 5, 6, 7]
whereas I want:
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]

Insert those values in backwards order. Like so:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
new = zip(insertion_indices, new_numbers)
new.sort(reverse=True)
for i, x in new:
original_list.insert(i, x)
The reason this works is based on the following observation:
Inserting a value at the beginning of the list offsets the indexes of all other values by 1. Inserting a value at the end though, and the indexes remain unchanged. As a consequence, if you start by inserting the value with the largest index (10) and continue "backwards" you would not have to update any indexes.

Being NumPy tagged and since input is mentioned as list/array, you can simply use builtin numpy.insert -
np.insert(original_list, insertion_indices, new_numbers)
To roll out the theory as a custom made one (mostly for performance), we could use mask, like so -
def insert_numbers(original_list,insertion_indices, new_numbers):
# Length of output array
n = len(original_list)+len(insertion_indices)
# Setup mask array to selecrt between new and old numbers
mask = np.ones(n,dtype=bool)
mask[insertion_indices+np.arange(len(insertion_indices))] = 0
# Setup output array for assigning values from old and new lists/arrays
# by using mask and inverted mask version
out = np.empty(n,dtype=int)
out[mask] = original_list
out[~mask] = new_numbers
return out
For list output, append .tolist().
Sample run -
In [83]: original_list = [0, 1, 2, 3, 4, 5, 6, 7]
...: insertion_indices = [1, 4, 5]
...: new_numbers = [8, 9, 10]
...:
In [85]: np.insert(original_list, insertion_indices, new_numbers)
Out[85]: array([ 0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7])
In [86]: np.insert(original_list, insertion_indices, new_numbers).tolist()
Out[86]: [0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
Runtime test on a 10000x scaled dataset -
In [184]: original_list = range(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0)).tolist()
...: new_numbers = np.random.randint(0,10, len(insertion_indices)).tolist()
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [185]: %timeit np.insert(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 5.37 ms per loop
In [186]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 4.8 ms per loop
Let's test out with arrays as inputs -
In [190]: original_list = np.arange(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0))
...: new_numbers = np.random.randint(0,10, len(insertion_indices))
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [191]: %timeit np.insert(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.48 ms per loop
In [192]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.07 ms per loop
The performance just shoots up, because there's no runtime overhead on conversion to list.

Add this before your for loop:
for i in range(len(insertion_indices)):
insertion_indices[i]+=i

Increment the required index by 1 after every insert
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
for i in range(len(insertion_indices)):
original_list.insert(insertion_indices[i]+i,new_numbers[i])
print(original_list)
Output
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
#Required list
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]

less elegant, but working too: use numpy ndarray, to increment indice each time:
import numpy as np
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = np.array([[insertion_indices[i], new_numbers[i]] for i in range(len(insertion_indices))])
for pair in pairs:
original_list.insert(pair[0], pair[1])
pairs[:, 0] += 1

Related

How to create a numpy array from 2 lists

I have a count of integer frequencies that I am trying to get into an array. L1 are the integers from 1 to 9, but only if they occur, I want to use this as the array index. L2 is the frequency of the integer and I want that to be entered in the array.
L1 = [1,3,4,5,6,7,8,9] #no twos occurred in the data so 2 is not in L1
L2 = [6,7,1,2,8,4,2,1]
The out put I want to get is: A1 = [[6,0,7],[1,2,8],[4,2,1]]
I feel like I'm missing something but this is my last attempt:
for num in L1 and count in L2:
a1[:num] = L2[:count]
Make the list arrays for ease of use:
In [286]: L1 = np.array([1,3,4,5,6,7,8,9])
...: L2 = np.array([6,7,1,2,8,4,2,1])
Make a place to put values:
In [287]: res = np.zeros(10,int)
In [288]: res[L1]
Out[288]: array([0, 0, 0, 0, 0, 0, 0, 0])
In [289]: res[L1]=L2
In [290]: res
Out[290]: array([0, 6, 0, 7, 1, 2, 8, 4, 2, 1])
oops, offset a bit.
In [291]: res = np.zeros(10,int)
In [292]: res[L1-1]=L2
In [293]: res
Out[293]: array([6, 0, 7, 1, 2, 8, 4, 2, 1, 0])
correct the initial size, and reshape:
In [294]: res = np.zeros(9,int)
In [295]: res[L1-1]=L2
In [296]: res.reshape(3,3)
Out[296]:
array([[6, 0, 7],
[1, 2, 8],
[4, 2, 1]])

Removing duplicate sets of items in a sequence

I have a list, for example data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], and I need to remove sets of items from it (max. lenght of set k = 3), but only when the sets follow each other. data includes three such cases: [4, 4], [5, 8, 5, 8], and [1, 5, 6, 1, 5, 6], so the cleaned up list should look like [0, 4, 2, 5, 8, 7, 1, 5, 6].
I tried this code and it works:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
for k in range(1, 3):
kth_difference = data[k:] - data[:-k]
ids = np.where(kth_difference)
data = data[ids]
But if I change input list to something like data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5] (broke the last set), the new output list is [0, 4, 2, 5, 8, 7, 1, 5], which lost 6 at the end.
What a solution for this task? How to make this solution workable for any k?
You added a numpy tag, so let's use that to our advantage. Start with an array:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
It's easy to make a mask of elements up to length n that follow each other:
mask_1 = data[1:] == data[:-1]
mask_2 = data[2:] == data[:-2]
mask_3 = data[3:] == data[:-3]
The first mask has ones at each location where the next element is the same. The second mask will have a one wherever an element is the same as something two elements ahead, so you need to find runs of 2 elements at a time. The same applies to the third mask. Filtering of the mask needs to take into account that you want to include the possibility of partial matches at the end. You can effectively extend the mask with k-1 elements to accomplish this:
delta = np.diff(np.r_[False, mask_3, np.ones(2, dtype=bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < 3, :]] = 0
mask = delta[-k:].cumsum(dtype=np.int8).view(bool)
In this arrangement, mask masks the duplicated three elements that constitute a duplicated group. It may contain fewer than three elements if the replicated portion is truncated. That ensures that you get to keep all the elements of partial duplicates.
For this exercise, I will assume that you don't have strange overlaps between different levels. I.e., each part of the array that belongs to a repeated segment belongs to exactly one possible repeated segment. Otherwise, the mask processing becomes much more complex.
Here is a function to wrap all this together:
def clean_mask(mask, k):
delta = np.diff(np.r_[False, mask, np.ones(k - 1, bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < k, :]] = 0
return delta[:-k].cumsum(dtype=np.int8).view(bool)
def dedup(data, kmax):
data = np.asarray(data)
kmax = min(kmax, data.size // 2)
remove = np.zeros(data.shape, dtype=np.bool)
for k in range(kmax, 0, -1):
remove[k:] |= clean_mask(data[k:] == data[:-k], k)
return data[~remove]
Outputs for the two test cases you show in the question:
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
Timing
A quick benchmark shows that the numpy solution is also much faster than pure python:
for n in range(2, 7):
x = np.random.randint(0, 10, 10**n)
y = list(x)
%timeit dedup(x, 3)
%timeit remdup(y)
Results:
# 100 elements
dedup: 332 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: 36.9 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1000 elements
dedup: 412 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: > 1 minute
Caveats
This solution makes no attempt to cover corner cases. For example: data = [2, 2, 2, 2, 2, 2, 2] or similar, where multiple levels of k can overlap.
Here is an attempt, which is also my very first time using break and for/else:
def remdup(l):
while True:
for (i,j) in ((i,j) for i in range(0,len(l)) for j in range(i+1, len(l)+1)):
if l[i:j] == l[j:j+(j-i)]:
l = l[:j] + l[j+j-i:]
break # restart
else: # if no duplicate was found
break # halt
return l
print(remdup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6]))
# [0, 4, 2, 5, 8, 7, 1, 5, 6]
How this works:
iterate on all substrings l[i:j]:
if l[i:j] is duplicate with the next substring of same length, l[j,j+j-i]:
remove l[j,j+j-i]
break the iteration and restart it because the list has changed
if no duplicate was found, return the list
I recommend avoiding using break and for/else. They're ugly and make deceptive code.

Cycling Slicing in Python

I've come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
with shift values of S = array([0, 1, 1]), the output needs to be
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
This is easy to implement by the following code:
Y = []
for i in range(X.shape[0]):
if (S[i] > 0):
Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
else:
Y.append(X[i,:].tolist())
Y = np.array(Y)
This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?
Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array
array([[[8, 1, 8],
[8, 6, 2],
[5, 3, 7]],
[[4, 1, 0],
[5, 9, 5],
[5, 1, 7]],
[[9, 8, 6],
[5, 1, 0],
[5, 5, 4]]])
Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to
array([[[8, 1, 7],
[8, 6, 8],
[5, 3, 2]],
[[4, 1, 7],
[5, 9, 0],
[5, 1, 5]],
[[9, 8, 4],
[5, 1, 6],
[5, 5, 0]]])
Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so -
def cyclic_slice(X, S):
m,n = X.shape
idx = np.mod(np.arange(n) + S[:,None],n)
return X[np.arange(m)[:,None], idx]
Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so -
def cyclic_slice_strided(X, S):
X2 = np.column_stack((X,X[:,:-1]))
s0,s1 = X2.strides
strided = np.lib.stride_tricks.as_strided
m,n1 = X.shape
n2 = X2.shape[1]
X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
return X2_3D[np.arange(len(S)),S]
Sample run -
In [34]: X
Out[34]:
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
In [35]: S
Out[35]: array([0, 1, 1])
In [36]: cyclic_slice(X, S)
Out[36]:
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
Runtime test -
In [75]: X = np.random.rand(10000,100)
...: S = np.random.randint(0,100,(10000))
# #Moses Koledoye's soln
In [76]: %%timeit
...: Y = []
...: for i, x in zip(S, X):
...: Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop
In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop
In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop
Adaption for 3D case
Adapting approach #1 for the 3D case, we would have -
shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
Sa = S
else:
Sa = -S
# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]
# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]
# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]
There could be a way to have a generic solution for a generic axis, but I will keep it to this point.
You could shift each row using np.roll and use the new rows to build the output array:
Y = []
for i, x in zip(S, X):
Y.append(np.roll(x, -i))
print(np.array(Y))
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])

How to select value from array that is closest to value in array using vectorization?

I have an array of values that I want to replace with from an array of choices based on which choice is linearly closest.
The catch is the size of the choices is defined at runtime.
import numpy as np
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
If choices was static in size, I would simply use np.where
d = np.where(np.abs(a - choices[0]) > np.abs(a - choices[1]),
np.where(np.abs(a - choices[0]) > np.abs(a - choices[2]), choices[0], choices[2]),
np.where(np.abs(a - choices[1]) > np.abs(a - choices[2]), choices[1], choices[2]))
To get the output:
>>d
>>[[1, 1, 1], [5, 5, 5], [10, 10, 10]]
Is there a way to do this more dynamically while still preserving the vectorization.
Subtract choices from a, find the index of the minimum of the result, substitute.
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 5 5]
[10 10 10]]
a = np.array([[0, 3, 0], [4, 8, 4], [9, 1, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 10 5]
[10 1 10]]
>>>
The extra dimension was added to a so that each element of choices would be subtracted from each element of a. choices was broadcast against a in the third dimension, This link has a decent graphic. b.shape is (3,3,3). EricsBroadcastingDoc is a pretty good explanation and has a graphic 3-d example at the end.
For the second example:
>>> print b
[[[ 1 5 10]
[ 2 2 7]
[ 1 5 10]]
[[ 3 1 6]
[ 7 3 2]
[ 3 1 6]]
[[ 8 4 1]
[ 0 4 9]
[ 8 4 1]]]
>>> print i
[[0 0 0]
[1 2 1]
[2 0 2]]
>>>
The final assignment uses an Index Array or Integer Array Indexing.
In the second example, notice that there was a tie for element a[0,1] , either one or five could have been substituted.
To explain wwii's excellent answer in a little more detail:
The idea is to create a new dimension which does the job of comparing each element of a to each element in choices using numpy broadcasting. This is easily done for an arbitrary number of dimensions in a using the ellipsis syntax:
>>> b = np.abs(a[..., np.newaxis] - choices)
array([[[ 1, 5, 10],
[ 1, 5, 10],
[ 1, 5, 10]],
[[ 3, 1, 6],
[ 3, 1, 6],
[ 3, 1, 6]],
[[ 8, 4, 1],
[ 8, 4, 1],
[ 8, 4, 1]]])
Taking argmin along the axis you just created (the last axis, with label -1) gives you the desired index in choices that you want to substitute:
>>> np.argmin(b, axis=-1)
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
Which finally allows you to choose those elements from choices:
>>> d = choices[np.argmin(b, axis=-1)]
>>> d
array([[ 1, 1, 1],
[ 5, 5, 5],
[10, 10, 10]])
For a non-symmetric shape:
Let's say a had shape (2, 5):
>>> a = np.arange(10).reshape((2, 5))
>>> a
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Then you'd get:
>>> b = np.abs(a[..., np.newaxis] - choices)
>>> b
array([[[ 1, 5, 10],
[ 0, 4, 9],
[ 1, 3, 8],
[ 2, 2, 7],
[ 3, 1, 6]],
[[ 4, 0, 5],
[ 5, 1, 4],
[ 6, 2, 3],
[ 7, 3, 2],
[ 8, 4, 1]]])
This is hard to read, but what it's saying is, b has shape:
>>> b.shape
(2, 5, 3)
The first two dimensions came from the shape of a, which is also (2, 5). The last dimension is the one you just created. To get a better idea:
>>> b[:, :, 0] # = abs(a - 1)
array([[1, 0, 1, 2, 3],
[4, 5, 6, 7, 8]])
>>> b[:, :, 1] # = abs(a - 5)
array([[5, 4, 3, 2, 1],
[0, 1, 2, 3, 4]])
>>> b[:, :, 2] # = abs(a - 10)
array([[10, 9, 8, 7, 6],
[ 5, 4, 3, 2, 1]])
Note how b[:, :, i] is the absolute difference between a and choices[i], for each i = 1, 2, 3.
Hope that helps explain this a little more clearly.
I love broadcasting and would have gone that way myself too. But, with large arrays, I would like to suggest another approach with np.searchsorted that keeps it memory efficient and thus achieves performance benefits, like so -
def searchsorted_app(a, choices):
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
mask = np.abs(a - cl) > np.abs(a - cr)
cl[mask] = cr[mask]
return cl
Please note that if the elements in choices are not sorted, we need to add in the additional argument sorter with np.searchsorted.
Runtime test -
In [160]: # Setup inputs
...: a = np.random.rand(100,100)
...: choices = np.sort(np.random.rand(100))
...:
In [161]: def broadcasting_app(a, choices): # #wwii's solution
...: return choices[np.argmin(np.abs(a[:,:,None] - choices),-1)]
...:
In [162]: np.allclose(broadcasting_app(a,choices),searchsorted_app(a,choices))
Out[162]: True
In [163]: %timeit broadcasting_app(a, choices)
100 loops, best of 3: 9.3 ms per loop
In [164]: %timeit searchsorted_app(a, choices)
1000 loops, best of 3: 1.78 ms per loop
Related post : Find elements of array one nearest to elements of array two

Replace elements of a multidimensional numpy array according to a rule

Let's say we have a numpy array:
import numpy as np
arr = np.array([[ 5, 9],[14, 23],[26, 4],[ 5, 26]])
I want to replace each element with its number of occurrences,
unique0, counts0= np.unique(arr.flatten(), return_counts=True)
print (unique0, counts0)
(array([ 4, 5, 9, 14, 23, 26]), array([1, 2, 1, 1, 1, 2]))
so 4 should be replaced by 1, 5 by 2, etc to get:
[[ 2, 1],[1, 1],[2, 1],[ 2, 2]]
Is there any way to achieve this in numpy?
Use the other optional argument return_inverse with np.unique to tag all elements based on their uniqueness and then map those with the counts to give us our desired output, like so -
_, idx, counts0 = np.unique(arr, return_counts=True,return_inverse=True)
out = counts0[idx].reshape(arr.shape)
Sample run -
In [100]: arr
Out[100]:
array([[ 5, 9],
[14, 23],
[26, 4],
[ 5, 26]])
In [101]: _, idx, counts0 = np.unique(arr, return_counts=True,return_inverse=True)
In [102]: counts0[idx].reshape(arr.shape)
Out[102]:
array([[2, 1],
[1, 1],
[2, 1],
[2, 2]])
This is an alternative solution since #Divakar's answer does not work in version <1.9:
In [1]: import numpy as np
In [2]: arr = np.array([[ 5, 9],[14, 23],[26, 4],[ 5, 26]])
In [3]: np.bincount(arr.flatten())[arr]
Out[3]:
array([[2, 1],
[1, 1],
[2, 1],
[2, 2]])
To test for speed (with 10000 random integers):
def replace_unique(arr):
_, idx, counts0 = np.unique(arr,return_counts=True,return_inverse=True)
return counts0[idx].reshape(arr.shape)
def replace_bincount(arr):
return np.bincount(arr.flatten())[arr]
arr = np.random.random_integers(30,size=[10000,2])
%timeit -n 1000 replace_bincount(arr)
# 1000 loops, best of 3: 68.3 µs per loop
%timeit -n 1000 replace_unique(arr)
# 1000 loops, best of 3: 922 µs per loop
so bincount method is ~14 times faster than unique method.

Categories

Resources