Merging arrays based on duplicate values on another array in python? - python

I've organized my data into 3 lists. The first one simply contains floating-point numbers, some of which are duplicates. The second and third lists contain 1D arrays of variable length.
The first list is sorted and all lists contain the same number of elements.
The overall format is this:
a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1 2 3 4 10]), arr([4 8 10 11 5 6 12]), arr([1 5 7]), arr([70 1 2]), arr([1])]
c = [arr([3 4 8]), arr([5 6 12]), arr([6 7 10 123 14]), arr([70 1 2]), arr([1 5 10 4])]
I'm trying to find a way to merge the arrays in lists b and c if their corresponding float number is the same in the list a. For the example above, the desired result would be:
a = [1.0, 1.5, 2]
b = [arr([1 2 3 4 10]), arr([4 8 10 11 5 6 12 1 5 7]), arr([70 1 2 1])]
c = [arr([3 4 8]), arr([5 6 12 6 7 10 123 14]), arr([70 1 2 1 5 10 4]])]
How would I go about doing this? Does it have something to do with zip?

Since a is sorted, I would use itertools.groupby. Similar to #MadPhysicist's answer, but iterating over the zip of lists:
import numpy as np
from itertools import groupby
arr = np.array
a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1, 2, 3, 4, 10]), arr([4, 8, 10, 11, 5, 6, 12]), arr([1, 5, 7]), arr([70, 1, 2]), arr([1])]
c = [arr([3, 4, 8]), arr([5, 6, 12]), arr([6, 7, 10, 123, 14]), arr([70, 1, 2]), arr([1, 5, 10, 4])]
res_a, res_b, res_c = [], [], []
for k, g in groupby(zip(a, b, c), key=lambda x: x[0]):
g = list(g)
res_a.append(k)
res_b.append(np.concatenate([x[1] for x in g]))
res_c.append(np.concatenate([x[2] for x in g]))
..which outputs res_a, res_b and res_c as:
[1.0, 1.5, 2]
[array([ 1, 2, 3, 4, 10]), array([ 4, 8, 10, 11, 5, 6, 12, 1, 5, 7]), array([70, 1, 2, 1])]
[array([3, 4, 8]), array([ 5, 6, 12, 6, 7, 10, 123, 14]), array([70, 1, 2, 1, 5, 10, 4])]
Alternatively in case a is not sorted, you can use defaultdict:
import numpy as np
from collections import defaultdict
arr = np.array
a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1, 2, 3, 4, 10]), arr([4, 8, 10, 11, 5, 6, 12]), arr([1, 5, 7]), arr([70, 1, 2]), arr([1])]
c = [arr([3, 4, 8]), arr([5, 6, 12]), arr([6, 7, 10, 123, 14]), arr([70, 1, 2]), arr([1, 5, 10, 4])]
res_a, res_b, res_c = [], [], []
d = defaultdict(list)
for x, y, z in zip(a, b, c):
d[x].append([y, z])
for k, v in d.items():
res_a.append(k)
res_b.append(np.concatenate([x[0] for x in v]))
res_c.append(np.concatenate([x[1] for x in v]))

Since a is sorted, you could use itertools.groupby on the range of indices in your list, keyed by a:
from itertools import groupby
result_a = []
result_b = []
result_c = []
for _, group in groupby(range(len(a)), key=a.__getitem__):
group = list(group)
index = slice(group[0], group[-1] + 1)
result_a.append(k)
result_b.append(np.concatenate(b[index]))
result_c.append(np.concatenate(c[index]))
group is an iterator, so you need to consume it to get the actual indices it represents. Each group contains all the indices that correspond to the same value in list_a.
slice(...) is what gets passed to list.__getitem__ any time there is a : in the indexing expression. index is equivalent to group[0]:group[-1] + 1]. This slices out the portion of the list that corresponds to each key in list_a.
Finally, np.concatenate just merges your arrays together in batches.
If you wanted to do this without doing list(group), you could consume the iterator in other ways, without keeping the values around. For example, you could get groupby to do it for you:
from itertools import groupby
result_a = []
result_b = []
result_c = []
prev = None
for _, group in groupby(range(len(a)), key=a.__getitem__):
index = next(group)
result_a.append(k)
if prev is not None:
result_b.append(np.concatenate(b[prev:index]))
result_c.append(np.concatenate(c[prev:index]))
prev = index
if prev is not None:
result_b.append(np.concatenate(b[prev:]))
result_c.append(np.concatenate(c[prev:]))
At that point, you wouldn't even really need to use groupby since it wouldn't be much more work to keep track of everything yourself:
result_a = []
result_b = []
result_c = []
k = None
for i, n in enumerate(a):
if n == k:
continue
result_a.append(n)
if k is not None:
result_b.append(np.concatenate(b[prev:i]))
result_c.append(np.concatenate(c[prev:i]))
k = n
prev = index
if k is not None:
result_b.append(np.concatenate(b[prev:]))
result_c.append(np.concatenate(c[prev:]))

EDIT: solutions above from #Austin and #Mad Physicist are better, so it's better to use them. Mine is reinventing bicycle which is not pythonic way.
I think that modifying original arrays is dangerous despite this approach using twice as much memory, but it's safe to iterate and do operations this way.
What's happening:
iterate over a and search for index occurencies in rest of a (we
exclude current value by remove(i)
if no duplicates then just copy b and c as usual
if there are, then merge in temp lists, then append it to a1, b1
and c1. Block value so that duplicate value won't trigger another
merge. Using if in the beginning we can check if value is blocked
return new lists
I didn't bother with np arrays, though i used np.where since it is a bit faster than using list comprehensions. Feel free to edit data formats etc, mine are simple for demonstration purposes.
import numpy as np
a = [1.0, 1.5, 1.5, 2, 2]
b = [[1, 2, 3, 4, 10], [4, 8, 10, 11, 5, 6, 12], [1, 5, 7], [70, 1, 2], [1]]
c = [[3, 4, 8], [5, 6, 12], [6, 7, 10, 123, 14], [70, 1, 2], [1, 5, 10, 4]]
def function(list1, list2, list3):
a1 = []
b1 = []
c1 = []
merged_list = []
# to preserve original index we use enumerate
for i, item in enumerate(list1):
# to aboid merging twice we just exclude values from a we already checked
if item not in merged_list:
list_without_elem = np.array(list1)
ixs = np.where(list_without_elem == item)[0].tolist() # removing our original index
ixs.remove(i)
# if empty append to new list as usual since we don't need merge
if not ixs:
a1.append(item)
b1.append(list2[i])
c1.append(list3[i])
merged_list.append(item)
else:
temp1 = [*list2[i]] # temp b and c prefilled with first b and c
temp2 = [*list3[i]]
for ix in ixs:
[temp1.append(item) for item in list2[ix]]
[temp2.append(item) for item in list3[ix]]
a1.append(item)
b1.append(temp1)
c1.append(temp2)
merged_list.append(item)
print(a1)
print(b1)
print(c1)
# example output
# [1.0, 1.5, 2]
# [[1, 2, 3, 4, 10], [4, 8, 10, 11, 5, 6, 12, 1, 5, 7], [70, 1, 2, 1]]
# [[3, 4, 8], [5, 6, 12, 6, 7, 10, 123, 14], [70, 1, 2, 1, 5, 10, 4]]

Related

How to get the unselected population in python random module

So, I know I can get a random list from a population using the random module,
l = [0, 1, 2, 3, 4, 8 ,9]
print(random.sample(l, 3))
# [1, 3, 2]
But, how do I get the list of the unselected ones? Do, I need to remove them manually from the list? Or, is there a method to get them too?
Edit: The list l from example doesn't contain the same items multiple times, but when it does I wouldn't want it removed more than it's selected as sample.
l = [0, 1, 2, 3, 4, 8 ,9]
s1 = set(random.sample(l, 3))
s2 = set(l).difference(s1)
>>> s1
{0, 3, 8}
>>> s2
{1, 2, 4, 9}
Update: same items multiple times
You can shuffle your list first and partition your population after in two:
l = [7, 4, 5, 4, 5, 9, 8, 6, 6, 6, 9, 8, 6, 3, 8]
pop = l[:]
random.shuffle(pop)
pop1, pop2 = pop[:3], pop[3:]
>>> pop1
[8, 4, 9]
>>> pop2
[7, 6, 8, 6, 5, 6, 9, 6, 5, 8, 4, 3]
Because your list can contain multiple same items, you can change to the approach below:
import random
l = [0, 1, 2, 3, 4, 8 ,9]
random.shuffle(l)
selected = l[:3]
unselected = l[3:]
print(selected)
# [4, 0, 1]
print(unselected)
# [8, 2, 3, 9]
If you want to keep track of duplicates, you could count the items of each type and compare the population count to the sample count.
If you don't care about the order of items in the population, you could do it like this:
from collections import Counter
import random
population = [1, 1, 2, 2, 9, 7, 9]
sample = random.sample(population, 3)
pop_count = Counter(population)
samp_count = Counter(sample)
unsampled = [
k
for k in pop_count
for i in range(pop_count[k] - samp_count[k])
]
If you care about the order in the population, you could do something like this:
check = sample.copy()
unsampled = []
for val in population:
if val in check:
check.remove(val)
else:
unsampled.append(val)
Or there's this weird list comprehension (not recommended):
check = sample.copy()
unsampled = [
x
for x in population
if x not in check or check.remove(x)
]
The if clause here uses two tricks:
both parts of the test will be Falseish if x is not in check (list.remove() always returns None), and
remove() will only be called if the first part fails, i.e., if x is in check.
Basically, if (and only if) x is in check, it will bomb through and check the next condition, which will also be False (None), but will have the side effect of removing one copy of x from check.
You can do with:
import random
l = [0, 1, 2, 3, 4, 8 ,9]
rand = random.sample(l, 3)
rest = list(set(l) - set(rand))
print(f"initial list: {l}")
print(f"random list: {rand}")
print (f"rest list: {rest}")
Result:
initial list: [0, 1, 2, 3, 4, 8, 9]
random list: [2, 9, 0]
rest list: [8, 1, 3, 4]

Adding values from one array depending on occurrences of values in another array

Cant Really bend my mind around this problem I'm having:
say I have 2 arrays
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
what I'm trying to do is that if a value is repeated in array B (like how 1 is repeated 3 times), those corresponding values in array A are added up to be appended to another array (say C)
so C would look like (from above two arrays):
C = [13, 12, 12]
Also sidenote.. the application I'd be using this code for uses timestamps from a database acting as array B (so that once a day is passed, that value in the array obviously won't be repeated)
Any help is appreciated!!
Here is a solution without pandas, using only itertools groupby:
from itertools import groupby
C = [sum( a for a,_ in g) for _,g in groupby(zip(A,B),key = lambda x: x[1])]
yields:
[13, 12, 12]
I would use pandas for this
Say you put those arrays in a DataFrame. This does the job:
df = pd.DataFrame(
{
'A': [2, 7, 4, 3, 9, 4, 2, 6],
'B': [1, 1, 1, 4, 4, 7, 7, 7]
}
)
df.groupby('B').sum()
If you want pure python solution, you can use itertools.groupby:
from itertools import groupby
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
out = []
for _, g in groupby(zip(A, B), lambda k: k[1]):
out.append(sum(v for v, _ in g))
print(out)
Prints:
[13, 12, 12]

How to multiply each element of two lists in python? [duplicate]

how do I multiply lists together in python using a function? This is what I have:
list = [1, 2, 3, 4]
def list_multiplication(list, value):
mylist = []
for item in list:
for place in value:
mylist.append(item*value)
return mylist
So I want to use this to multiply list*list (1*1, 2*2, 3*3, 4*4)
So the output would be 1, 4, 9, and 16. How would I do this in python where the 2nd list could be anything?
Thanks
My favorite way is mapping the mul operator over the two lists:
from operator import mul
mul(2, 5)
#>>> 10
mul(3, 6)
#>>> 18
map(mul, [1, 2, 3, 4, 5], [6, 7, 8, 9, 10])
#>>> <map object at 0x7fc424916f50>
map, at least in Python 3, returns a generator. Hence if you want a list you should cast it to one:
list(map(mul, [1, 2, 3, 4, 5], [6, 7, 8, 9, 10]))
#>>> [6, 14, 24, 36, 50]
But by then it might make more sense to use a list comprehension over the zip'd lists.
[a*b for a, b in zip([1, 2, 3, 4, 5], [6, 7, 8, 9, 10])]
#>>> [6, 14, 24, 36, 50]
To explain the last one, zip([a,b,c], [x,y,z]) gives (a generator that generates) [(a,x),(b,y),(c,z)].
The for a, b in "unpacks" each (m,n) pair into the variables a and b, and a*b multiplies them.
You can use a list comprehension:
>>> t = [1, 2, 3, 4]
>>> [i**2 for i in t]
[1, 4, 9, 16]
Note that 1*1, 2*2, etc is the same as squaring the number.
If you need to multiply two lists, consider zip():
>>> L1 = [1, 2, 3, 4]
>>> L2 = [1, 2, 3, 4]
>>> [i*j for i, j in zip(L1, L2)]
[1, 4, 9, 16]
If you have two lists A and B of the same length, easiest is to zip them:
>>> A = [1, 2, 3, 4]
>>> B = [5, 6, 7, 8]
>>> [a*b for a, b in zip(A, B)]
[5, 12, 21, 32]
Take a look at zip on its own to understand how that works:
>>> zip(A, B)
[(1, 5), (2, 6), (3, 7), (4, 8)]
zip() would do:
[a*b for a,b in zip(lista,listb)]
zip is probably the way to go, as suggested by the other answers. That said, here's an alternative beginner approach.
# create data
size = 20
a = [i+1 for i in range(size)]
b = [val*2 for val in a]
a
>> [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
b
>> [ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40]
def multiply_list_elems(list_one, list_two):
""" non-efficient method """
res = [] # initialize empty list to append results
if len(a) == len(b): # check that both lists have same number of elems (and idxs)
print("\n list one: \n", list_one, "\n list two: \n", list_two, "\n")
for idx in range(len(a)): # for each chronological element of a
res.append(a[idx] * b[idx]) # multiply the ith a and ith b for all i
return res
def efficient_multiplier(list_one, list_two):
""" efficient method """
return [a[idx] * b[idx] for idx in range(len(a)) if len(a) == len(b)]
print(multiply_list_elems(a, b))
print(efficient_multiplier(a, b))
both give:
>> [2, 8, 18, 32, 50, 72, 98, 128, 162, 200, 242, 288, 338, 392, 450, 512, 578, 648, 722, 800]
Yet another approach is using numpy, as suggested here.
Use numpy for this.
>>> import numpy as np
>>> list = [1,2,3,4]
>>> np.multiply(list, list)
array([ 1, 4, 9, 16])
If you prefer python lists:
>>> np.multiply(list, list).tolist()
[1, 4, 9, 16]
additionally, this also works for element-wise multiplication with a scalar.
>>> np.multiply(list, 2)
array([2, 4, 6, 8])

How to replace consecutive values in a list using another list as a reference?

I have a list like this:
list_target = [4, 5, 6, 7, 12, 13, 14]
list_primer = [3, 11]
So list_target consists of blocks of consecutive values, between which are jumps in values (like from 7 to 12). list_primer consists of values at the beginning of those blocks. Elements in list_primer are generated in another process.
My question is: for each element of list_primer, how can I identify the block in list_target and replace their values with what I want? For example, if I choose to replace the values in the first block with 1 and the second with 0, the outcome looks like:
list_target_result = [1, 1, 1, 1, 0, 0, 0]
Here's a simple algorithm which solves your task by looping through both lists beginning to end:
list_target = [4, 5, 6, 7, 12, 13, 14]
list_primer = [3, 11]
block_values = [1, 0]
result = []
for i, primer in enumerate(list_primer):
for j, target in enumerate(list_target):
if target == primer+1:
primer += 1
result.append(block_values[i])
else:
continue
print(result)
[1, 1, 1, 1, 0, 0, 0]
Note that you might run into trouble if not all blocks have a respective primer, depending on your use case.
Modifying method to find groups of strictly increasing numbers in a list
def group_seq(l, list_primer):
" Find groups which are strictly increasing or equals next list_primer value "
temp_list = cycle(l)
temp_primer = cycle(list_primer)
next(temp_list)
groups = groupby(l, key = lambda j: (j + 1 == next(temp_list)) or (j == next(temp_primer)))
for k, v in groups:
if k:
yield tuple(v) + (next((next(groups)[1])), )
Use group_seq to find strictly increasing blocks in list_target
list_target = [4, 5, 6, 7, 12, 13, 14]
list_primer = [3, 11]
block_values = [1, 0]
result = []
for k, v in zip(block_values, group_seq(list_target, list_primer)):
result.extend([k]*len(v)) # k is value from block_values
# v is a block of strictly increasing numbers
# ie. group_seq(list_target) creates sublists
# [(4, 5, 6, 7), (12, 13, 14)]
print(result)
Out: [1, 1, 1, 1, 0, 0, 0]
Here's a solution using numpy.
import numpy as np
list_target = np.array([4, 5, 6, 7, 12, 13, 14])
list_primer = np.array([3, 11])
values = [1, 0]
ix = np.searchsorted(list_target, list_primer)
# [0,4]
blocks = np.split(list_target, ix)[1:]
# [array([4, 5, 6, 7]), array([12, 13, 14])]
res = np.concatenate([np.full(s.size, values[i]) for i,s in enumerate(blocks)])
# array([1, 1, 1, 1, 0, 0, 0])
Here is a solution that works in O(n), where n=len(list_target). It assumes that your list_target list is consecutive in the way you described (increments by exactly one within block, increments of more than one between blocks).
It returns a dictionary with the beginning of each block as key (potential primers) and lower and upper indices of that block within list_target as values. Access to that dict is then O(1).
list_target = [4, 5, 6, 7, 12, 13, 14]
list_primer = [3, 11]
block_dict = dict()
lower_idx = 0
upper_idx = 0
for i, val in enumerate(list_target): # runs in O(n)
upper_idx = i + 1
if i == len(list_target) - 1: # for last block in list
block_dict[list_target[lower_idx] - 1] = (lower_idx, upper_idx)
break
if list_target[i + 1] - list_target[i] != 1: #if increment more than one, save current block to dict, reset lower index
block_dict[list_target[lower_idx] - 1] = (lower_idx, upper_idx)
lower_idx = i + 1
Here are the results:
print(block_dict) # quick checks
>>>> {3: (0,4), 11: (4,7)}
for p in list_primer: # printing the corresponding blocks.
lower, upper = block_dict[p] # dict access in O(1)
print(list_target[lower:upper])
>>>> [4, 5, 6, 7]
[12, 13, 14]
# getting the indices for first primer marked as in your original question:
list_target_result = [0] * len(list_target)
lower_ex, upper_ex = block_dict[3]
list_target_result[lower_ex: upper_ex] = [1]*(upper_ex-lower_ex)
print(list_target_result)
>>>> [1, 1, 1, 1, 0, 0, 0]

Find Top N Most Frequent Sequence of Numbers in List of Lists

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of N and M?
Here is an idea, based on a generalized suffix tree structure. Your list of lists can be seen as a list of strings, where the alphabet would consist of integers (so about 10k characters in the alphabet with the info you provided).
The construction of a generalized suffix tree is done in linear time w.r.t the string length, so this should not be an issue since in any case, you will have to go though your lists at some point.
First, store all your strings in the suffix tree. This requires 2 small adaptations of the structure.
You need to keep a counter of the number of occurences of a certain suffix, since your goal is ultimately to find the most common subsequence respecting certains properties.
Then, you also want to have a lookup table from (i, d) (where i is the integer you're looking for, the target, and d is the depth in your tree, the M) to the set of nodes of your suffix link that are labeled with the 'letter' i (your alphabet is not made of chars, but of integers), located at a depth d. This lookup table can be build by traversing your suffix link (BFS or DFS). You can even possibly store only the node that corresponds to the highest counter value.
From there, for some query (target, M), you would first look in your lookup table, and then find the node in the tree with the highest counter value. This would correspond to the most frequently encountered 'suffix' (or subsequence) in the list of lists.
The implementation is quite complex, since the generalized suffix tree is not a trivial structure (at all), and implementing it correctly, with modifications, would not be a small feat. But I think that this would allow for a very efficient query time.
For a suffix tree implementation, I would recommend that you read only the original papers until you get a deep and real understanding of those(like this or that, sc*-h*b can be your friend) on the matter, and not the online 'explanations' of it, that are riddled with approximations and mistakes (even this post can help to get a first idea, but will misdirect you at some point if your goal is to implement a correct version).
To answer your first question: you can put all lists in an array, fixing the length by extending zeroes so the array becomes something you can work with. From an answer here
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)
mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
arr
>> [[ 1 2 3 4 5 6 7 0 0 0 0]
[ 6 5 10 11 0 0 0 0 0 0 0]
[ 9 8 2 3 4 5 0 0 0 0 0]
[12 12 6 5 0 0 0 0 0 0 0]
[ 5 8 3 4 2 0 0 0 0 0 0]
[ 1 5 0 0 0 0 0 0 0 0 0]
[ 2 8 8 3 5 9 1 4 12 5 6]
[ 7 1 7 3 4 1 2 0 0 0 0]
[ 9 4 12 12 6 5 1 0 0 0 0]]
For the second question: use np.where to find the different positions matching your condition. Then you can broadcast the indeces of row and column by adding dimensions to include the 5s and the preceding 4 elements:
M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2, 3, 4, 5],
[ 2, 3, 4, 5],
[12, 12, 6, 5],
[ 8, 8, 3, 5],
[ 1, 4, 12, 5],
[12, 12, 6, 5]])
There's two parts to your question:
To generate the sub sequences you wanted, you can use a generator to help you:
def gen_m(lst, m, val):
'''
lst = sub_list to parse
m = length required
val = target value
'''
found = 0 # starts with 0 index
for i in range(lst[m-1:].count(val)): # repeat by the count of val
found = lst.index(val, found) + 1 # set and find the next index of val
yield tuple(lst[found-m: found]) # yield the sliced sub_list of m length as a tuple
Then, using another generator, you can create a Counter of your sub lists:
from collections import Counter
target = 5
req_len = 4
# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))
Then, create a generator to check the counter object to return the N count required:
req_N = 2
def gen_common(counter, n):
s = set()
for i, (item, count) in enumerate(counter.most_common()):
if i < n or count in s:
yield item
else:
return
s.add(count)
result = list(gen_common(counter, req_N))
Results where N == 2:
[[2, 3, 4, 5], [12, 12, 6, 5]]
Results where N == 3:
[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]
With a larger sample:
x = [[1, 2, 3, 4, 5, 6, 7],
[6, 5, 10, 11],
[9, 8, 2, 3, 4, 5],
[12, 12, 6, 5],
[5, 8, 3, 4, 2],
[1, 5],
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],
[7, 1, 7, 3, 4, 1, 2],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 2, 3, 4, 5, 1],
[9, 4, 8, 8, 3, 5, 1],
[9, 4, 7, 8, 9, 5, 1],
[9, 4, 1, 2, 2, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 12, 12, 6, 5, 1],
[9, 4, 1, 4, 12, 5],
[9, 1, 4, 12, 5, 1]
]
Where Counter is now:
Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})
You can get results such as these:
for i in range(6):
# testing req_N from 0 to 5
list(gen_common(c, i))
# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]
Since there is not an only N, M and target I assume there are chunks of lists with lists. Here is an approach in O(N + M) time complexity fashion (where N is the number of lists in a chunk and M is the total number of elements):
def get_seq(x, M, target):
index_for_length_m = M - 1
for v in [l for l in x if len(l) >= M]:
for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
# convert to str to be hashable
yield str(v[i - index_for_length_m : i + 1])
def process_chunk(x, M, N, target):
return Counter(get_seq(x, M, target)).most_common(N)
with your example:
process_chunk(x, M, 2, target)
output:
[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]
the performence:
%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Categories

Resources