The problem is regarding reversing a list A of size N in groups of K. For example if A = [1,2,3,4,5], k = 3
Output = [3,2,1,5,4]
The error I get, when I run this is List Index out of range on line 4.
def reverseInGroups(A,N,K):
arr1 = []
for i in range(K):
arr1.append(A[(N-i)%K]) #line 4
for j in range(N-K):
arr1.append(A[N-j-1])
return arr1
This will implement what you are trying to achieve:
def reverseInGroups(A,K):
N = len(A)
arr1 = []
for i in range(0, N, K):
arr1.extend(A[i : i+K][::-1])
return arr1
print(reverseInGroups([1,2,3,4,5], 3))
Interestingly, the code in the question actually works in the example case, but it is not general. The case where it works is where N = 2*K - 1 (although where it does not work, the elements are in the wrong order rather than an IndexError).
Cant seem to reproduce your 'List index out of range' error, but your logic is faulty:
reverseInGroups(A,N,K):
arr1 = []
for i in range(K):
arr1.append(A[(N-i)%K]) #line 4
for j in range(N-K):
arr1.append(A[N-j-1])
return arr1
print(reverseInGroups([1,2,3,4,5],5, 3)) # works, others get wrong result
print(reverseInGroups([1,2,3,4,5,6],6, 3)) # wrong result: [1, 3, 2, 6, 5, 4]
prints:
[3, 2, 1, 5, 4] # correct
[1, 3, 2, 6, 5, 4] # wrong
You fix this and make this smaller by packing it into a list comprehension:
def revv(L,k):
return [w for i in (L[s:s+k][::-1] for s in range(0,len(L),k)) for w in i]
for gr in range(2,8):
print(gr, revv([1,2,3,4,5,6,7,8,9,10,11],gr))
to get:
2 [2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 11]
3 [3, 2, 1, 6, 5, 4, 9, 8, 7, 11, 10]
4 [4, 3, 2, 1, 8, 7, 6, 5, 11, 10, 9]
5 [5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 11]
6 [6, 5, 4, 3, 2, 1, 11, 10, 9, 8, 7]
7 [7, 6, 5, 4, 3, 2, 1, 11, 10, 9, 8]
You can also try with this:
def reverse(l, n):
result = []
for i in range(0, len(l)-1, n):
for item in reversed(l[i:i+n]):
result.append(item)
for item in reversed(l[i+n:]):
result.append(item)
return result
You can reverse the array upto index K and reverse the remaining part and add these both arrays.
def reverseInGroups(A,N,K):
return A[:K][::-1]+A[K:][::-1]
A = [1,2,3,4,5]
N = 5
K = 3
res = reverseInGroups(A,N,K)
print(res)
Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of N and M?
Here is an idea, based on a generalized suffix tree structure. Your list of lists can be seen as a list of strings, where the alphabet would consist of integers (so about 10k characters in the alphabet with the info you provided).
The construction of a generalized suffix tree is done in linear time w.r.t the string length, so this should not be an issue since in any case, you will have to go though your lists at some point.
First, store all your strings in the suffix tree. This requires 2 small adaptations of the structure.
You need to keep a counter of the number of occurences of a certain suffix, since your goal is ultimately to find the most common subsequence respecting certains properties.
Then, you also want to have a lookup table from (i, d) (where i is the integer you're looking for, the target, and d is the depth in your tree, the M) to the set of nodes of your suffix link that are labeled with the 'letter' i (your alphabet is not made of chars, but of integers), located at a depth d. This lookup table can be build by traversing your suffix link (BFS or DFS). You can even possibly store only the node that corresponds to the highest counter value.
From there, for some query (target, M), you would first look in your lookup table, and then find the node in the tree with the highest counter value. This would correspond to the most frequently encountered 'suffix' (or subsequence) in the list of lists.
The implementation is quite complex, since the generalized suffix tree is not a trivial structure (at all), and implementing it correctly, with modifications, would not be a small feat. But I think that this would allow for a very efficient query time.
For a suffix tree implementation, I would recommend that you read only the original papers until you get a deep and real understanding of those(like this or that, sc*-h*b can be your friend) on the matter, and not the online 'explanations' of it, that are riddled with approximations and mistakes (even this post can help to get a first idea, but will misdirect you at some point if your goal is to implement a correct version).
To answer your first question: you can put all lists in an array, fixing the length by extending zeroes so the array becomes something you can work with. From an answer here
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)
mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
arr
>> [[ 1 2 3 4 5 6 7 0 0 0 0]
[ 6 5 10 11 0 0 0 0 0 0 0]
[ 9 8 2 3 4 5 0 0 0 0 0]
[12 12 6 5 0 0 0 0 0 0 0]
[ 5 8 3 4 2 0 0 0 0 0 0]
[ 1 5 0 0 0 0 0 0 0 0 0]
[ 2 8 8 3 5 9 1 4 12 5 6]
[ 7 1 7 3 4 1 2 0 0 0 0]
[ 9 4 12 12 6 5 1 0 0 0 0]]
For the second question: use np.where to find the different positions matching your condition. Then you can broadcast the indeces of row and column by adding dimensions to include the 5s and the preceding 4 elements:
M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2, 3, 4, 5],
[ 2, 3, 4, 5],
[12, 12, 6, 5],
[ 8, 8, 3, 5],
[ 1, 4, 12, 5],
[12, 12, 6, 5]])
There's two parts to your question:
To generate the sub sequences you wanted, you can use a generator to help you:
def gen_m(lst, m, val):
'''
lst = sub_list to parse
m = length required
val = target value
'''
found = 0 # starts with 0 index
for i in range(lst[m-1:].count(val)): # repeat by the count of val
found = lst.index(val, found) + 1 # set and find the next index of val
yield tuple(lst[found-m: found]) # yield the sliced sub_list of m length as a tuple
Then, using another generator, you can create a Counter of your sub lists:
from collections import Counter
target = 5
req_len = 4
# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))
Then, create a generator to check the counter object to return the N count required:
req_N = 2
def gen_common(counter, n):
s = set()
for i, (item, count) in enumerate(counter.most_common()):
if i < n or count in s:
yield item
else:
return
s.add(count)
result = list(gen_common(counter, req_N))
Results where N == 2:
[[2, 3, 4, 5], [12, 12, 6, 5]]
Results where N == 3:
[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]
With a larger sample:
x = [[1, 2, 3, 4, 5, 6, 7],
[6, 5, 10, 11],
[9, 8, 2, 3, 4, 5],
[12, 12, 6, 5],
[5, 8, 3, 4, 2],
[1, 5],
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],
[7, 1, 7, 3, 4, 1, 2],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 2, 3, 4, 5, 1],
[9, 4, 8, 8, 3, 5, 1],
[9, 4, 7, 8, 9, 5, 1],
[9, 4, 1, 2, 2, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 1, 4, 12, 5],
[9, 1, 4, 12, 5, 1]
]
Where Counter is now:
Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})
You can get results such as these:
for i in range(6):
# testing req_N from 0 to 5
list(gen_common(c, i))
# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]
Since there is not an only N, M and target I assume there are chunks of lists with lists. Here is an approach in O(N + M) time complexity fashion (where N is the number of lists in a chunk and M is the total number of elements):
def get_seq(x, M, target):
index_for_length_m = M - 1
for v in [l for l in x if len(l) >= M]:
for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
# convert to str to be hashable
yield str(v[i - index_for_length_m : i + 1])
def process_chunk(x, M, N, target):
return Counter(get_seq(x, M, target)).most_common(N)
with your example:
process_chunk(x, M, 2, target)
output:
[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]
the performence:
%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)