What would be the fastest way to break a list of random numbers into sets of two, alternately flipping every pair? For example:
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
What I want in one operation:
flatPairs=[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Items will be random single digits, I only made them sequential for readability. I need to do thousands of these in a run so speed is priority. Python 3.6.4.
Thank you for any ideas, I’m stumped by this one.
Option 1
As long as this is pairs we're talking about, let's try a list comprehension:
flatPairs = [
[x, y] if i % 2 == 0 else [y, x] for i, (x, y) in enumerate(
zip(pleatedTuple[::2], pleatedTuple[1::2])
)
]
You can also build this from scratch using a loop:
flatPairs = []
for i, (x, y) in enumerate(zip(pleatedTuple[::2], pleatedTuple[1::2])):
if i % 2 == 0:
flatPairs.append([x, y])
else:
flatPairs.append([y, x])
print(flatPairs)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 2
Use Ned Batchelder's chunking subroutine chunks and flip every alternate sublist:
# https://stackoverflow.com/a/312464/4909087
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
Call chunks and exhaust the returned generator to get a list of pairs:
flatPairs = list(chunks(pleatedTuple, n=2))
Now, reverse every other pair with a loop.
for i in range(1, len(flatPairs), 2):
flatPairs[i] = flatPairs[i][::-1]
print(flatPairs)
[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]
Note that in this case, the result is a list of tuples.
Performance
(of my answers only)
I'm interested in performance, so I've decided to time my answers:
# Setup
pleatedTuple = tuple(range(100000))
# List comp
21.1 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Loop
20.8 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# chunks
26 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more performance, you may replace the chunks generator with a more performant alternative:
flatPairs = list(zip(pleatedTuple[::2], pleatedTuple[1::2]))
And then reverse with a loop as required. This brings the time down considerably:
13.1 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A 2x speedup, phew! Beware though, this isn't nearly as memory efficient as the generator would be...
You can use the standard grouping idiom and zip it with the length:
>>> by_pairs_index = zip(range(len(pleatedTuple)), *[iter(pleatedTuple)]*2)
>>> [[b, a] if i%2 else [a,b] for i,a,b in by_pairs_index]
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
If performance is critical, you may consider other approaches.
You can use iter with list slicing:
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
new_data = [list(pleatedTuple[i:i+2][::-1]) if c%2 != 0 else list(pleatedTuple[i:i+2]) for i, c in zip(range(0, len(pleatedTuple), 2), range(len(pleatedTuple)))]
Output:
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 1
Using map and reversed and slice assignment.
p = list(map(list, zip(pleatedTuple[::2], pleatedTuple[1::2])))
p[1::2] = map(list, map(reversed, p[1::2]))
p
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Slight variation
p = list(map(list, zip(pleatedTuple[::2], pleatedTuple[1::2])))
p[1::2] = (x[::-1] for x in p[1::2])
p
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
Option 2
def weird(p):
return [[p[2 * i + i % 2], p[2 * i + (i + 1) % 2]] for i in range(len(p) // 2)]
weird(pleatedTuple)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
More generic
def weird(p, k):
return [list(p[i*k:(i+1)*k][::(i-1)%2*2-1]) for i in range(len(p) // k)]
weird(pleatedTuple, 2)
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
weird(pleatedTuple * 3, 3)
[[0, 1, 3],
[5, 4, 2],
[7, 6, 8],
[1, 0, 9],
[3, 2, 4],
[6, 7, 5],
[8, 9, 0],
[2, 3, 1],
[4, 5, 7],
[9, 8, 6]]
You can do this in numpy:
>>> pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
>>> pleatedArray = np.array(pleatedTuple)
>>> flat2D = pleatedArray.reshape(5,2)
>>> flat2D[1::2] = np.flip(pleated2D[1::2], axis=1)
Of course this is probably going to waste as much time converting between tuples and arrays as it saves doing a tiny loop in numpy instead of Python. (From a quick test, it takes about twice as long as Coldspeed's Option 2 at the example size, and doesn't catch up until you get to much, much longer tuples, and you have a bunch of little tuples, not a few giant ones.)
But if you're concerned with speed, the obvious thing to do is put all of these thousands of pleated tuples into one giant numpy array and do them all at once, and then it will probably be a lot faster. (Still, we're probably talking about saving milliseconds for thousands of these.)
What i use is :
pleatedTuple=(0, 1, 3, 2, 4, 5, 7, 6, 8, 9)
for i in range(0,len(pleatedTuple),2):
print(pleatedTuple[i:i+2])
Is this what you are looking for ?
(0, 1)
(3, 2)
(4, 5)
(7, 6)
(8, 9)
Related
Short intro
I have two paired lists of 2D numpy arrays (see below) - paired in the sense that index 0 in array1 corresponds to index 0 in array2. For each of the pairs I want to get all the combinations of all rows in the 2D numpy arrays, like answered by Divakar here.
Array example
arr1 = [
np.vstack([[1,6,3,9], [8,5,6,7]]),
np.vstack([[1,6,3,9]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
arr2 = [
np.vstack([[8,8,8,8]]),
np.vstack([[8,8,8,8]]),
np.vstack([[1,6,3,9], [8,5,6,7],[8,5,6,7]])
]
Working code
Note, unlike the linked answer my columns are fixed (always 4) hence I replaced using shape by the hardcode value 4 (or 8 in np.zeros).
def merge(a1, a2):
# From: https://stackoverflow.com/questions/47143712/combination-of-all-rows-in-two-numpy-arrays
m1 = a1.shape[0]
m2 = a2.shape[0]
out = np.zeros((m1, m2, 8), dtype=int)
out[:, :, :4] = a1[:, None, :]
out[:, :, 4:] = a2
out.shape = (m1 * m2, -1)
return out
total = np.concatenate([merge(arr1[i], arr2[i]) for i in range(len(arr1))])
print(total)
Question
While the above works fine, it looks inefficient to me as it:
involves looping through the arrays
"appends" (in list list comprehsion) to the total array, requiring it to allocate memory each time
creates multiple zero arrays (in the merge function), whereas I could create an empty one at the start? related to the point above
I perform this operation thousands of times on arrays with millions of elements, so any suggestions on how to transform this code into something more efficient?
To be honest, this seems pretty hard to optimize. Each step in the loop has a different size, so likely there isn't any purely vectorized way of doing these things. You can try pre-allocating the memory and writing in place, rather than allocating many pieces and finally concatenating the results, but I'd bet that doesn't help you much (unless you are under such restrained conditions that you don't have enough RAM to store everything twice, of course).
Feel free to try the following approach on your larger data, but I'd be surprised if you get any significant speedup (or even that you don't get slower results!).
# Use scalar product to get the final size
result = np.zeros((np.dot([len(x) for x in arr1], [len(x) for x in arr2]), 8), dtype=int)
start = 0
for a1, a2 in zip(arr1, arr2):
end = start + len(a1) * len(a2)
result[start:end, :4] = np.repeat(a1, len(a2), axis=0)
result[start:end, 4:] = np.tile(a2, (len(a1), 1))
start = end
This is what I wanted to see - the list and the merge results:
In [60]: arr1
Out[60]:
[array([[1, 6, 3, 9],
[8, 5, 6, 7]]),
array([[1, 6, 3, 9]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [61]: arr2
Out[61]:
[array([[8, 8, 8, 8]]),
array([[8, 8, 8, 8]]),
array([[1, 6, 3, 9],
[8, 5, 6, 7],
[8, 5, 6, 7]])]
In [63]: merge(arr1[0],arr2[0]) # a (2,4) with (1,4) => (2,8)
Out[63]:
array([[1, 6, 3, 9, 8, 8, 8, 8],
[8, 5, 6, 7, 8, 8, 8, 8]])
In [64]: merge(arr1[1],arr2[1]) # a (1,4) with (1,4) => (1,8)
Out[64]: array([[1, 6, 3, 9, 8, 8, 8, 8]])
In [65]: merge(arr1[2],arr2[2]) # a (3,4) with (3,4) => (9,8)
Out[65]:
array([[1, 6, 3, 9, 1, 6, 3, 9],
[1, 6, 3, 9, 8, 5, 6, 7],
[1, 6, 3, 9, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 1, 6, 3, 9],
[8, 5, 6, 7, 8, 5, 6, 7],
[8, 5, 6, 7, 8, 5, 6, 7]])
And total is (12,8), combing all "rows".
The list comprehension is, more cleanly stated:
[merge(a,b) for a,b in zip(arr1,arr2)]
The lists, while the same length, have arrays with different numbers of rows, and the merge is also different.
People often ask about making an array iteratively, and we consistently say, collect the results in a list, and do one concatenate (like) construction at the end. The equivalent loop is:
In [70]: alist = []
...: for a,b in zip(arr1,arr2):
...: alist.append(merge(a,b))
This is usually competitive with predefining the total array, and assigning rows. And in your case to get the final shape of total you'd have to iterate through the lists and record the number of rows, etc.
Unless the computation is trivial, the iteration mechanism is a minor part of the total time. I'm pretty sure that here, it's calling merge 3 times that's taking most of the time. For a task like this I wouldn't worry too much about memory use, including the creation of the zeros. You have to, in one way or other use memory for a (12,8) final result. Building that from a (2,8),(1,8), and (9,8) isn't a big issue.
The list comprehension with concatenate and without:
In [72]: timeit total = np.concatenate([merge(a,b) for a,b in zip(arr1,arr2)])
22.4 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [73]: timeit [merge(a,b) for a,b in zip(arr1,arr2)]
16.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Calling merge 3 times with any of the pairs takes about the same time.
Oh, another thing, don't try to 'reuse' the out array across merge calls. When accumulating results like this in a list, reuse of the arrays is dangerous. Each merge call must return its own array, not a "recycled" one.
I have a DataFrame like below:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
I want to filter this DataFrame such that it only contains the row in which X is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list column.
for example, if X = [6, 8, 3], I want the output to look like this:
id list
1 [2, 51, 6, 8, 3]
3 [6, 8, 3, 9, 10, 11]
I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):
def x_in_y(query, base):
l = len(query)
for i in range(len(base) - l + 1):
if base[i:i+l] == query:
return True
return False
I have two questions:
Question 1:
How to apply this to a Pandas DataFrame column like in my example?
Question 2:
Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.
[Note: the elements of the lists in the list column are unique, should that help to optimize things]
Here is solution call function for column:
df = df[df.list.map(lambda x: x_in_y(X, x))]
#alternative
#df = df[df.list.apply(lambda x: x_in_y(X, x))]
print (df)
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]
Performance is really good in sample data, in real the best test too:
#200k rows
df = pd.concat([df] * 40000, ignore_index=True)
print (df)
X = [6, 8, 3]
x = to_string([6, 8, 3])
In [166]: %timeit df.list.map(lambda x: x_in_y(X, x))
214 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [167]: %timeit df['list'].map(to_string).str.contains(x)
413 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %timeit df["list"].apply(has_subsequence, subseq=X)
5.2 s ± 420 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit df.list.apply(lambda y: ''.join(map(str,X)) in ''.join(map(str,y)))
573 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using the numpy rolling-window technique:
import numpy as np
def rolling_window(a, size):
a = np.array(a)
shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def has_subsequence(a, subseq):
return (rolling_window(a, len(subseq)) == subseq).all(axis=1).any()
mask = df["list"].apply(has_subsequence, subseq=[6,8,3])
df[mask]
Explanation:
rolling_window creates a view into the array with the given shape and strides:
>>> rolling_window([1,2,3,4], 2)
np.array([[1,2], [2,3], [3,4]])
Then we compare the result with our target X
>>> np.array([[1,2], [2,3], [3,4]]) == [2,3]
np.array([[False, False], [True, True], [False, False]])
Then we tell to numpy to return True in cases when the all items are True over the 1st axis.
>>> np.array([[False, False], [True, True], [False, False]]).all(axis=1)
np.array([False, True, False])
And finally return True if there are any True in the array.
>>> np.array([False, True, False]).any()
You can try this:
import pandas as pd
def to_string(l):
return '-' + '-'.join(map(str, l)) + '-'
X = to_string([6, 8, 3])
df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
df[df['list'].map(to_string).str.contains(X)]
# id list
# 0 1 [2, 51, 6, 8, 3]
# 2 3 [6, 8, 3, 9, 10, 11]
In my oppinion it is important that you add delimiters to start and end of the string. Otherwise you will have problems with lists such as: [666, 8, 3]
You can try this:
x = [6, 8, 3]
df = df.loc[df.list.apply(lambda y: ''.join(map(str,x)) in ''.join(map(str,y)))]
OUTPUT:
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]
I have a list, for example data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], and I need to remove sets of items from it (max. lenght of set k = 3), but only when the sets follow each other. data includes three such cases: [4, 4], [5, 8, 5, 8], and [1, 5, 6, 1, 5, 6], so the cleaned up list should look like [0, 4, 2, 5, 8, 7, 1, 5, 6].
I tried this code and it works:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
for k in range(1, 3):
kth_difference = data[k:] - data[:-k]
ids = np.where(kth_difference)
data = data[ids]
But if I change input list to something like data = [0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5] (broke the last set), the new output list is [0, 4, 2, 5, 8, 7, 1, 5], which lost 6 at the end.
What a solution for this task? How to make this solution workable for any k?
You added a numpy tag, so let's use that to our advantage. Start with an array:
data = np.array([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6])
It's easy to make a mask of elements up to length n that follow each other:
mask_1 = data[1:] == data[:-1]
mask_2 = data[2:] == data[:-2]
mask_3 = data[3:] == data[:-3]
The first mask has ones at each location where the next element is the same. The second mask will have a one wherever an element is the same as something two elements ahead, so you need to find runs of 2 elements at a time. The same applies to the third mask. Filtering of the mask needs to take into account that you want to include the possibility of partial matches at the end. You can effectively extend the mask with k-1 elements to accomplish this:
delta = np.diff(np.r_[False, mask_3, np.ones(2, dtype=bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < 3, :]] = 0
mask = delta[-k:].cumsum(dtype=np.int8).view(bool)
In this arrangement, mask masks the duplicated three elements that constitute a duplicated group. It may contain fewer than three elements if the replicated portion is truncated. That ensures that you get to keep all the elements of partial duplicates.
For this exercise, I will assume that you don't have strange overlaps between different levels. I.e., each part of the array that belongs to a repeated segment belongs to exactly one possible repeated segment. Otherwise, the mask processing becomes much more complex.
Here is a function to wrap all this together:
def clean_mask(mask, k):
delta = np.diff(np.r_[False, mask, np.ones(k - 1, bool), False].view(np.int8))
edges = np.flatnonzero(delta).reshape(-1, 2)
lengths = edges[:, 1] - edges[:, 0]
delta[edges[lengths < k, :]] = 0
return delta[:-k].cumsum(dtype=np.int8).view(bool)
def dedup(data, kmax):
data = np.asarray(data)
kmax = min(kmax, data.size // 2)
remove = np.zeros(data.shape, dtype=np.bool)
for k in range(kmax, 0, -1):
remove[k:] |= clean_mask(data[k:] == data[:-k], k)
return data[~remove]
Outputs for the two test cases you show in the question:
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
>>> dedup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5], 3)
array([0, 4, 2, 5, 8, 7, 1, 5, 6])
Timing
A quick benchmark shows that the numpy solution is also much faster than pure python:
for n in range(2, 7):
x = np.random.randint(0, 10, 10**n)
y = list(x)
%timeit dedup(x, 3)
%timeit remdup(y)
Results:
# 100 elements
dedup: 332 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: 36.9 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1000 elements
dedup: 412 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
remdup: > 1 minute
Caveats
This solution makes no attempt to cover corner cases. For example: data = [2, 2, 2, 2, 2, 2, 2] or similar, where multiple levels of k can overlap.
Here is an attempt, which is also my very first time using break and for/else:
def remdup(l):
while True:
for (i,j) in ((i,j) for i in range(0,len(l)) for j in range(i+1, len(l)+1)):
if l[i:j] == l[j:j+(j-i)]:
l = l[:j] + l[j+j-i:]
break # restart
else: # if no duplicate was found
break # halt
return l
print(remdup([0, 4, 4, 2, 5, 8, 5, 8, 7, 1, 5, 6, 1, 5, 6]))
# [0, 4, 2, 5, 8, 7, 1, 5, 6]
How this works:
iterate on all substrings l[i:j]:
if l[i:j] is duplicate with the next substring of same length, l[j,j+j-i]:
remove l[j,j+j-i]
break the iteration and restart it because the list has changed
if no duplicate was found, return the list
I recommend avoiding using break and for/else. They're ugly and make deceptive code.
I am looking for some built-in NumPy module or some vectorized approach to get such n by n matrices for n>1. The only key thing here is that the last element of a given row serves as the first element of the following row.
n = 2
# array([[1, 2],
# [2, 3]])
n = 3
# array([[1, 2, 3],
# [3, 4, 5],
# [5, 6, 7]])
n = 4
# array([[1, 2, 3, 4],
# [4, 5, 6, 7],
# [7, 8, 9, 10],
# [10, 11, 12, 13]])
My attempt using list comprehensions. The same can be written in an extended for loop syntax as well.
import numpy as np
n = 4
arr = np.array([[(n-1)*j+i for i in range(1, n+1)] for j in range(n)])
# array([[ 1, 2, 3, 4],
# [ 4, 5, 6, 7],
# [ 7, 8, 9, 10],
# [10, 11, 12, 13]])
If you want to keep things simple (and somewhat readable), this should do it:
def ranged_mat(n):
out = np.arange(1, n ** 2 + 1).reshape(n, n)
out -= np.arange(n).reshape(n, 1)
return out
Simply build all numbers from 1 to n², reshape them to the desired block shape, then subtract the row number from each row.
This does the same as Divakar's ranged_mat_v2, but I like being explicit with intermediate array shapes. Not everyone is an expert at NumPy's broadcasting rules.
Approach #1
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
Additionally, it accepts a step argument and that would fit in perfectly for this problem. Hence, the implementation would be -
from skimage.util.shape import view_as_windows
def ranged_mat(n):
r = np.arange(1,n*(n-1)+2)
return view_as_windows(r,n,step=n-1)
Sample runs -
In [270]: ranged_mat(2)
Out[270]:
array([[1, 2],
[2, 3]])
In [271]: ranged_mat(3)
Out[271]:
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
In [272]: ranged_mat(4)
Out[272]:
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])
Approach #2
Another with outer-broadcasted-addition -
def ranged_mat_v2(n):
r = np.arange(n)
return (n-1)*r[:,None]+r+1
Approach #3
We can also use numexpr module that supports multi-core processing and hence achieve better efficiency on large n's -
import numexpr as ne
def ranged_mat_v3(n):
r = np.arange(n)
r2d = (n-1)*r[:,None]
return ne.evaluate('r2d+r+1')
Making use of slicing gives us a more memory-efficient one -
def ranged_mat_v4(n):
r = np.arange(n+1)
r0 = r[1:]
r1 = r[:-1,None]*(n-1)
return ne.evaluate('r0+r1')
Timings -
In [423]: %timeit ranged_mat(10000)
273 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [424]: %timeit ranged_mat_v2(10000)
316 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [425]: %timeit ranged_mat_v3(10000)
176 ms ± 85.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [426]: %timeit ranged_mat_v4(10000)
154 ms ± 82.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
There is also np.fromfunction as below. docs here
def func(n):
return np.fromfunction(lambda r,c: (n-1)*r+1+c, shape=(n,n))
It takes a function which calculates an array from the index values.
We can use NumPy strides for this:
def as_strides(n):
m = n**2 - (n-1)
a = np.arange(1, m+1)
s = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(n,n), strides=((n-1)*s,s))
as_strides(2)
rray([[1, 2],
[2, 3]])
as_strides(3)
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
as_strides(4)
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])
I'm running a script in Python, where I need to insert new numbers into an array (or list) at certain index locations. The problem is that obviously as I insert new numbers, the index locations are invalidated. Is there a clever way to insert the new values at the index locations all at once? Or is the only solution to increment the index number (first value of the pair) as I add?
Sample test code snippets:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = [(insertion_indices[i], new_numbers[i]) for i in range(len(insertion_indices))]
for pair in pairs:
original_list.insert(pair[0], pair[1])
Results in:
[0, 8, 1, 2, 9, 10, 3, 4, 5, 6, 7]
whereas I want:
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
Insert those values in backwards order. Like so:
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
new = zip(insertion_indices, new_numbers)
new.sort(reverse=True)
for i, x in new:
original_list.insert(i, x)
The reason this works is based on the following observation:
Inserting a value at the beginning of the list offsets the indexes of all other values by 1. Inserting a value at the end though, and the indexes remain unchanged. As a consequence, if you start by inserting the value with the largest index (10) and continue "backwards" you would not have to update any indexes.
Being NumPy tagged and since input is mentioned as list/array, you can simply use builtin numpy.insert -
np.insert(original_list, insertion_indices, new_numbers)
To roll out the theory as a custom made one (mostly for performance), we could use mask, like so -
def insert_numbers(original_list,insertion_indices, new_numbers):
# Length of output array
n = len(original_list)+len(insertion_indices)
# Setup mask array to selecrt between new and old numbers
mask = np.ones(n,dtype=bool)
mask[insertion_indices+np.arange(len(insertion_indices))] = 0
# Setup output array for assigning values from old and new lists/arrays
# by using mask and inverted mask version
out = np.empty(n,dtype=int)
out[mask] = original_list
out[~mask] = new_numbers
return out
For list output, append .tolist().
Sample run -
In [83]: original_list = [0, 1, 2, 3, 4, 5, 6, 7]
...: insertion_indices = [1, 4, 5]
...: new_numbers = [8, 9, 10]
...:
In [85]: np.insert(original_list, insertion_indices, new_numbers)
Out[85]: array([ 0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7])
In [86]: np.insert(original_list, insertion_indices, new_numbers).tolist()
Out[86]: [0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
Runtime test on a 10000x scaled dataset -
In [184]: original_list = range(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0)).tolist()
...: new_numbers = np.random.randint(0,10, len(insertion_indices)).tolist()
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [185]: %timeit np.insert(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 5.37 ms per loop
In [186]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
100 loops, best of 3: 4.8 ms per loop
Let's test out with arrays as inputs -
In [190]: original_list = np.arange(70000)
...: insertion_indices = np.sort(np.random.choice(len(original_list), 30000, replace=0))
...: new_numbers = np.random.randint(0,10, len(insertion_indices))
...: out1 = np.insert(original_list, insertion_indices, new_numbers)
...: out2 = insert_numbers(original_list, insertion_indices, new_numbers)
...: print np.allclose(out1, out2)
True
In [191]: %timeit np.insert(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.48 ms per loop
In [192]: %timeit insert_numbers(original_list, insertion_indices, new_numbers)
1000 loops, best of 3: 1.07 ms per loop
The performance just shoots up, because there's no runtime overhead on conversion to list.
Add this before your for loop:
for i in range(len(insertion_indices)):
insertion_indices[i]+=i
Increment the required index by 1 after every insert
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
for i in range(len(insertion_indices)):
original_list.insert(insertion_indices[i]+i,new_numbers[i])
print(original_list)
Output
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
#Required list
[0, 8, 1, 2, 3, 9, 4, 10, 5, 6, 7]
less elegant, but working too: use numpy ndarray, to increment indice each time:
import numpy as np
original_list = [0, 1, 2, 3, 4, 5, 6, 7]
insertion_indices = [1, 4, 5]
new_numbers = [8, 9, 10]
pairs = np.array([[insertion_indices[i], new_numbers[i]] for i in range(len(insertion_indices))])
for pair in pairs:
original_list.insert(pair[0], pair[1])
pairs[:, 0] += 1