Related
i/p 1:
test_list = [1, 1, 3, 4, 4, 4, 5,6, 6, 7, 8, 8, 6]
o/p
[3, 5, 7, 6]
Exp: Since (1 1), (4 4 4) (6 6) (8 8) are in consecutive occurrence so resultant list has no addition of 6 but for last occurrence where 8, 6 are not in multiple consecutive occurrence so 6 is valid
in last iteration
i/p 2:
test_list = [1, 1, 3, 4, 4, 4, 5,4,6, 6, 7, 8, 8, 6]
o/p
[3, 5,4, 7, 6]
** like wise for 2nd input 4,4,4 is not valid but 5,4 is valid
Any suggestion for the expected o/p?
(I am looking for bit elaborated algorithm)
You can use itertools.groupby to group adjacent identical values, then only keep values that have group length of 1.
>>> from itertools import groupby
>>> test_list = [1, 1, 3, 4, 4, 4, 5,6, 6, 7, 8, 8, 6]
>>> [k for k, g in groupby(test_list) if len(list(g)) == 1]
[3, 5, 7, 6]
>>> test_list = [1, 1, 3, 4, 4, 4, 5,4,6, 6, 7, 8, 8, 6]
>>> [k for k, g in groupby(test_list) if len(list(g)) == 1]
[3, 5, 4, 7, 6]
First of all, you need to know that increasing i in your for loop does not change the value of i.
You can check it by runin this code:
for i in range(5):
print(i)
i = 2
This code will print 0 1 2 3 4 not 0 2 2 2 2 as you might think.
Going back to your question. I would use groupby from itertools, but since you specified you don't want to use it, I would do something like this:
if test_list[0] != test_list[1]: # <-- check if first element should belong to result
res_list.append(test_list[0])
for i in range(len(test_list[1:-1])): # Here we use input list, but without first and last element.
if test_list[i+1] == test_list[i+2] or test_list[i+1] == test_list[i]:
continue
else:
res_list.append(test_list[i+1])
if test_list[-2] != test_list[-1]: # <-- check if last element should belong to result
res_list.append(test_list[-1])
Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of N and M?
Here is an idea, based on a generalized suffix tree structure. Your list of lists can be seen as a list of strings, where the alphabet would consist of integers (so about 10k characters in the alphabet with the info you provided).
The construction of a generalized suffix tree is done in linear time w.r.t the string length, so this should not be an issue since in any case, you will have to go though your lists at some point.
First, store all your strings in the suffix tree. This requires 2 small adaptations of the structure.
You need to keep a counter of the number of occurences of a certain suffix, since your goal is ultimately to find the most common subsequence respecting certains properties.
Then, you also want to have a lookup table from (i, d) (where i is the integer you're looking for, the target, and d is the depth in your tree, the M) to the set of nodes of your suffix link that are labeled with the 'letter' i (your alphabet is not made of chars, but of integers), located at a depth d. This lookup table can be build by traversing your suffix link (BFS or DFS). You can even possibly store only the node that corresponds to the highest counter value.
From there, for some query (target, M), you would first look in your lookup table, and then find the node in the tree with the highest counter value. This would correspond to the most frequently encountered 'suffix' (or subsequence) in the list of lists.
The implementation is quite complex, since the generalized suffix tree is not a trivial structure (at all), and implementing it correctly, with modifications, would not be a small feat. But I think that this would allow for a very efficient query time.
For a suffix tree implementation, I would recommend that you read only the original papers until you get a deep and real understanding of those(like this or that, sc*-h*b can be your friend) on the matter, and not the online 'explanations' of it, that are riddled with approximations and mistakes (even this post can help to get a first idea, but will misdirect you at some point if your goal is to implement a correct version).
To answer your first question: you can put all lists in an array, fixing the length by extending zeroes so the array becomes something you can work with. From an answer here
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)
mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
arr
>> [[ 1 2 3 4 5 6 7 0 0 0 0]
[ 6 5 10 11 0 0 0 0 0 0 0]
[ 9 8 2 3 4 5 0 0 0 0 0]
[12 12 6 5 0 0 0 0 0 0 0]
[ 5 8 3 4 2 0 0 0 0 0 0]
[ 1 5 0 0 0 0 0 0 0 0 0]
[ 2 8 8 3 5 9 1 4 12 5 6]
[ 7 1 7 3 4 1 2 0 0 0 0]
[ 9 4 12 12 6 5 1 0 0 0 0]]
For the second question: use np.where to find the different positions matching your condition. Then you can broadcast the indeces of row and column by adding dimensions to include the 5s and the preceding 4 elements:
M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2, 3, 4, 5],
[ 2, 3, 4, 5],
[12, 12, 6, 5],
[ 8, 8, 3, 5],
[ 1, 4, 12, 5],
[12, 12, 6, 5]])
There's two parts to your question:
To generate the sub sequences you wanted, you can use a generator to help you:
def gen_m(lst, m, val):
'''
lst = sub_list to parse
m = length required
val = target value
'''
found = 0 # starts with 0 index
for i in range(lst[m-1:].count(val)): # repeat by the count of val
found = lst.index(val, found) + 1 # set and find the next index of val
yield tuple(lst[found-m: found]) # yield the sliced sub_list of m length as a tuple
Then, using another generator, you can create a Counter of your sub lists:
from collections import Counter
target = 5
req_len = 4
# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))
Then, create a generator to check the counter object to return the N count required:
req_N = 2
def gen_common(counter, n):
s = set()
for i, (item, count) in enumerate(counter.most_common()):
if i < n or count in s:
yield item
else:
return
s.add(count)
result = list(gen_common(counter, req_N))
Results where N == 2:
[[2, 3, 4, 5], [12, 12, 6, 5]]
Results where N == 3:
[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]
With a larger sample:
x = [[1, 2, 3, 4, 5, 6, 7],
[6, 5, 10, 11],
[9, 8, 2, 3, 4, 5],
[12, 12, 6, 5],
[5, 8, 3, 4, 2],
[1, 5],
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],
[7, 1, 7, 3, 4, 1, 2],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 2, 3, 4, 5, 1],
[9, 4, 8, 8, 3, 5, 1],
[9, 4, 7, 8, 9, 5, 1],
[9, 4, 1, 2, 2, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 1, 4, 12, 5],
[9, 1, 4, 12, 5, 1]
]
Where Counter is now:
Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})
You can get results such as these:
for i in range(6):
# testing req_N from 0 to 5
list(gen_common(c, i))
# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]
Since there is not an only N, M and target I assume there are chunks of lists with lists. Here is an approach in O(N + M) time complexity fashion (where N is the number of lists in a chunk and M is the total number of elements):
def get_seq(x, M, target):
index_for_length_m = M - 1
for v in [l for l in x if len(l) >= M]:
for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
# convert to str to be hashable
yield str(v[i - index_for_length_m : i + 1])
def process_chunk(x, M, N, target):
return Counter(get_seq(x, M, target)).most_common(N)
with your example:
process_chunk(x, M, 2, target)
output:
[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]
the performence:
%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I am trying to figure out how to append elements from the list into a pattern where I put them into the nested lists.
For example:
members = [1, 2, 3, 4, 5, 6, 7, 8, 9]
no_of_teams = int(input('no. of teams? '))
teams = [ [ ] for _ in range(no_of_teams)]
So that my output will end up looking like this:
no_of_teams? 2
teams = [ [1, 3, 5, 7, 9], [2, 4, 6, 8]]
if the user enters 3 then it will look like this:
teams = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
and for 7 it looks like
teams = [ [1, 8], [2, 9], [3], [4], [5], [6], [7] ]
A good way is to use slicing:
number = int(input(...))
members = list(range(1, 10))
chunks = len(members) // number
teams = [members[i*chunks:i*chunks+number]
for i in range(number)]
You could also use step size instead:
teams = [members[i::number]
for i in range(number)]
This will yield your desired output:
in each iteration, we start the slice at the next item in the list
the slice goes up in n steps
So, if n is 3, the 1st iteration will give a slice containing the indexes 0, 3, 6, 9..... since the step size is 3
The 2nd iteration gives indexes 1, 4, 7...
The 3rd iteration gives indexes 2, 5, 8...
Iteration stops here at the 3rd since n dictates this too.
You can slice it into the correct number of sublists:
slice_size = int(len(members) / no_of_teams)
teams = []
for i in range(no_of_teams):
teams.append( members[i * slice_size: i * slice_size + slice_size]
Given there are n teams, we can use list comprehension to construct n slices, such that the i-th element is assigned to the i mod n-th team:
teams = [ members[i::n] for i in range(n) ]
For example:
>>> n= 1
>>> [ members[i::n] for i in range(n) ]
[[1, 2, 3, 4, 5, 6, 7, 8, 9]]
>>> n= 2
>>> [ members[i::n] for i in range(n) ]
[[1, 3, 5, 7, 9], [2, 4, 6, 8]]
>>> n= 3
>>> [ members[i::n] for i in range(n) ]
[[1, 4, 7], [2, 5, 8], [3, 6, 9]]
You can slice the list:
teams = [members[i::no_of_teams] for i in range(no_of_teams)]
I have the following array [1, 4, 7, 9, 2, 10, 5, 8] and I need to separate the array in 3 different arrays: one for values between 0 and 3, anther for 3 to 6 and anther for 6 and 25.The result must be something like that:
array1 = [1, 2]
array2 = [4, 5]
array3 = [7, 9, 10, 8]
Any idea about how to do it simple?
First, define your "pole" numbers
Second, generate your intervals from those "pole" numbers
Third, define as many lists as there are intervals.
Then for each interval, scan the list and appends items in the relevant list if they belong to the interval
code:
source = [1, 4, 7, 9, 2, 10, 5, 8]
poles = (0,3,6,25)
intervals = [(poles[i],poles[i+1]) for i in range(len(poles)-1)]
# will generate: intervals = [(0,3),(3,6),(6,25)]
output = [list() for _ in range(len(intervals))]
for out,(start,stop) in zip(output,intervals):
for s in source:
if start <= s <stop:
out.append(s)
print(output)
result:
[[1, 2], [4, 5], [7, 9, 10, 8]]
This solution has the advantage of being adaptable to more than 3 lists/intervals by adding more "pole" numbers.
EDIT: There's a nice & fast solution (O(log(N)*N)) if the output lists order don't matter:
first sort the input list
then generate the sliced sub-lists using bisect which returns insertion position of the provided numbers (left & right)
like this:
import bisect
source = sorted([1, 4, 7, 9, 2, 10, 5, 8])
poles = (0,3,6,25)
output = [source[bisect.bisect_left(source,poles[i]):bisect.bisect_right(source,poles[i+1])] for i in range(len(poles)-1)]
print(output)
result:
[[1, 2], [4, 5], [7, 8, 9, 10]]
You can do that in a very simple way using a combination of a for loop and range functions:
lists = ([], [], [])
for element in [1, 4, 7, 9, 2, 10, 5, 8]:
if element in range(0, 3):
lists[0].append(element)
elif element in range(3, 6):
lists[1].append(element)
elif element in range(6, 25):
lists[2].append(element)
array1, array2, array3 = lists
"One-line" solution using set.intersection(*others) and range(start, stop[, step]) functions:
l = [1, 4, 7, 9, 2, 10, 5, 8]
l1, l2, l3 = (list(set(l).intersection(range(3))), list(set(l).intersection(range(3,6))), list(set(l).intersection(range(6,25))))
print(l1)
print(l2)
print(l3)
The output:
[1, 2]
[4, 5]
[8, 9, 10, 7]
https://docs.python.org/3/library/stdtypes.html?highlight=intersection#set.intersection
Sorting a list of vectors (e.g. list of lists or array of arrays with integer numbers) to make inner vectors with most amount of common integer numbers adjacent. Every component counted only once and paired only with a single number.
An example.
Input
[
[ 4, 6, 2, 2, 10 ],
[ 5, 20, 2, 7, 9 ], # 1 component is common with previous
[ 5, 4, 2, 10, 9 ], # 3 ...
[ 9, 6, 3, 3, 0 ], # 1 ...
[ 5, 7, 2, 9, 5 ], # 1 ...
[ 9, 3, 6, 7, 0 ] # 2 ...
]
Output (common match number was 1+3+1+1+2 and became 2+3+3+1+4).
[
[ 4, 6, 2, 2, 10 ],
[ 5, 4, 2, 10, 9 ], # 2 components are common with previous
[ 5, 20, 2, 7, 9 ], # 3 ...
[ 5, 7, 2, 9, 5 ], # 3 ...
[ 9, 6, 3, 3, 0 ], # 1
[ 9, 3, 6, 7, 0 ] # 4 ...
]
My current 'sorting-of-sorting' solution (Python) does not work properly:
def _near_vectors( vectors_ ):
"""
Return value - changed order of indexes.
"""
vectors = copy( vectors_ )
# Sort each vector
for i in range( len( vectors ) ):
vectors[ i ] = sorted( vectors[ i ] )
# Save info about indexes
ind = [ ( i, vectors[ i ] ) for i in range( len( vectors ) ) ]
sort_info = sorted( ind, key = itemgetter( 1 ) )
return [ v[ 0 ] for v in sort_info ]
An example where it failes:
Input:
[
[0, 1, 2, 3],
[0, 1, 2, 4],
[4, 5, 6],
[4, 5, 13],
[5, 8, 9, 17],
[5, 12, 13],
[7, 8, 9],
[7, 10, 11],
[7, 11, 14, 15],
[7, 14, 15, 16]
]
Output: the same list, that is incorrect. [5, 12, 13] must be just after [4, 5, 13].
It's a useful algorithm for many things, for example to pull together in time tasks with common components (components are integer indexes). May be somebody has solved the case?
This is the travelling salesman problem without the requirement of returning to the starting position. To avoid a negative metric, costs should be expressed as the number of elements not in common between adjacent lists; this gives you a triangle inequality so you can use metric TSP methods.
You could implement a TSP solver yourself, but it'd probably make more sense to use an existing one; for example, Google's or-tools has Python bindings and example code for how to use them to solve TSP instances.
The algorithm for what it does its correct:
# Sort each vector
for i in range( len( vectors ) ):
vectors[ i ] = sorted( vectors[ i ] )
# Output
[
[0, 1, 2, 3],
[0, 1, 2, 4],
[4, 5, 6],
[4, 5, 13],
[5, 8, 9, 17],
[5, 12, 13],
[7, 8, 9],
[7, 10, 11],
[7, 11, 14, 15],
[7, 14, 15, 16]
]
# Save info about indexes
ind = [ ( i, vectors[ i ] ) for i in range( len( vectors ) ) ]
# Output convert to list with tuples (index,list[i])
[(0, [0, 1, 2, 3]),
(1, [0, 1, 2, 4]),
(2, [4, 5, 6]),
(3, [4, 5, 13]),
(4, [5, 8, 9, 17]),
(5, [5, 12, 13]),
(6, [7, 8, 9]),
(7, [7, 10, 11]),
(8, [7, 11, 14, 15]),
(9, [7, 14, 15, 16])]
sort_info = sorted( ind, key = itemgetter( 1 ) )
# Output sort by list first by sorting the first elements on each list
# then the second ones etc. For example list [4, 5, 13] comes before
# [5, 8, 9, 17] because their first element is 4 < 5. Then
# [5, 8, 9, 17] comes before [5, 12, 13] because the second element
# is 8 < 12 etc.
[(0, [0, 1, 2, 3]),
(1, [0, 1, 2, 4]),
(2, [4, 5, 6]),
(3, [4, 5, 13]),
(4, [5, 8, 9, 17]),
(5, [5, 12, 13]),
(6, [7, 8, 9]),
(7, [7, 10, 11]),
(8, [7, 11, 14, 15]),
(9, [7, 14, 15, 16])
]
The result of my current function _near_vectors() can by improved.
The idea: to improve a result of the simple sorting solution (which sorts partially) by a function which moves each element to more valued position (it much improves the result, moving vectors in 'vacuum' to right positions). E.g. it calculates the current value of adjacent components and checks if they can be moved to another position (with another adjacent components) with more valued links between newly formed groups.
def _match( v1, v2 ):
"""
The number of identical elements in v1 and v2 where v1 and v2 are lists.
"""
res = len(list((Counter(v1) & Counter(v2)).elements()))
return res
def _mind( vectors, vi, i ):
"Calculate value of position i for vector vi in vectors."
def pos(i):
return 0 <= i < len( vectors )
if isinstance(vi, int):
if pos(vi):
v = vectors[vi][ 1 ]
else:
return 0
else:
v = vi
if pos(i):
return _match( vectors[ i ][ 1 ], v )
else:
return 0
def _near_vectors2(vectors):
i = 0
for v in vectors:
max_val = 0
new_pos = None
for k in range(len(vectors)):
if k != i:
v = vectors[ i ][ 1 ] # current vector
# Calc. sum of links, move if links are improved
next_k = k + 1 if i < k else k - 1
cur_mind = _mind(vectors, v, k) + _mind(vectors, v, next_k) - _mind(vectors, k, next_k) - _mind(vectors, v, i + 1 ) - _mind(vectors, v, i - 1 ) + _mind(vectors, i - 1, i + 1)
if cur_mind > max_val:
max_val = cur_mind
new_pos = k
# move element
if new_pos is not None:
vectors.insert( new_pos, vectors.pop( i ) )
i += 1
return vectors
So, to make it working we can call:
near_indexes = _near_vectors2( near_vectors( vects ) )
Each element in output list consists of two components: changed index and source vector.
Remark. None of the decisions works properly without it's paired solution. Probably result is not 'absolutely' sorted, but it's vastly sorted.