Fastest way to check if a sequence contains a non-consecutive subsequence?

Fastest way to check if a sequence contains a non-consecutive subsequence? - python

Let's say that two lists of elements are given, A and B. I'm interested in checking if A contains all the elements of B. Specifically, the elements must appear in the same order and they do not need to be consecutive. If this is the case, we say that B is a subsequence of A.
Here are some examples:
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 2, 1, 3]
is_subsequence(A, B) # True
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 8, 2]
is_subsequence(A, B) # True
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 1, 6]
is_subsequence(A, B) # False
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 7, 2]
is_subsequence(A, B) # False
I found a very elegant way to solve this problem (see this answer):
def is_subsequence(A, B):
it = iter(A)
return all(x in it for x in B)
I am now wondering how this solution behaves with possibly very very large inputs. Let's say that my lists contain billions of numbers.
What's the complexity of the code above? What's its worst case? I have tried to test it with very large random inputs, but its speed mostly depends on the automatically generated input.
Most importantly, are there more efficient solutions? Why are these solutions more efficient than this one?

The code you found creates an iterator for A; you can see this as a simple pointer to the next position in A to look at, and in moves the pointer forward across A until a match is found. It can be used multiple times, but only ever moves forward; when using in containment tests against a single iterator multiple times, the iterator can't go backwards and so can only test if still to visit values are equal to the left-hand operand.
Given your last example, with B = [2, 7, 2], what happens is this:
it = iter(A) creates an iterator object for the A list, and stores 0 as the next position to look at.
The all() function tests each element in an iterable and returns False early, if such a result was found. Otherwise it keeps testing every element. Here the tests are repeated x in it calls, where x is set to each value in B in turn.
x is first set to 2, and so 2 in it is tested.
it is set to next look at A[0]. That's 4, not equal to 2, so the internal position counter is incremented to 1.
A[1] is 2, and that's equal, so 2 in it returns True at this point, but not before incrementing the 'next position to look at' counter to 2.
2 in it was true, so all() continues on.
The next value in B is 7, so 7 in it is tested.
it is set to next look at A[2]. That's 8, not 7. The position counter is incremented to 3.
it is set to next look at A[3]. That's 2, not 7. The position counter is incremented to 4.
it is set to next look at A[4]. That's 7, equal to 7. The position counter is incremented to 5 and True is returned.
7 in it was true, so all() continues on.
The next value in B is 2, so 2 in it is tested.
it is set to next look at A[5]. That's 0, not 2. The position counter is incremented to 6.
it is set to next look at A[6]. That's 1, not 2. The position counter is incremented to 7.
it is set to next look at A[7]. That's 5, not 2. The position counter is incremented to 8.
it is set to next look at A[8]. That's 3, not 2. The position counter is incremented to 9.
There is no A[9] because there are not that many elements in A, and so False is returned.
2 in it was False, so all() ends by returning False.
You could verify this with an iterator with a side effect you can observe; here I used print() to write out what the next value is for a given input:
>>> A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
>>> B = [2, 7, 2]
>>> with_sideeffect = lambda name, iterable: (
print(f"{name}[{idx}] = {value}") or value
for idx, value in enumerate(iterable)
)
>>> is_sublist(with_sideeffect(" > A", A), with_sideeffect("< B", B))
< B[0] = 2
> A[0] = 4
> A[1] = 2
< B[1] = 7
> A[2] = 8
> A[3] = 2
> A[4] = 7
< B[2] = 2
> A[5] = 0
> A[6] = 1
> A[7] = 5
> A[8] = 3
False
Your problem requires that you test every element of B consecutively, there are no shortcuts here. You also must scan through A to test for the elements of B being present, in the right order. You can only declare victory when all elements of B have been found (partial scan), and defeat when all elements in A have been scanned and the current value in B you are testing for is not found.
So, assuming the size of B is always smaller than A, the best case scenario then is where all K elements in B are equal to the first K elements of A. The worst case, is any case where not all of the elements of B are present in A, and require a full scan through A. It doesn't matter what number of elements are present in B; if you are testing element K out of K you already have been scanning part-way through A and must complete your scan through A to find that the last element is missing.
So the best case with N elements in A and K elements in B, takes O(K) time. The worst case, using the same definitions of N and K, takes O(N) time.
There is no faster algorithm to test for this condition, so all you can hope for is lowering your constant times (the time taken to complete each of the N steps). Here that'd be a faster way to scan through A as you search for the elements of B. I am not aware of a better way to do this than by using the method you already found.

Related

find elements in list that fullfill condition WHICH needs the previous element

So I want to test if a list contains a element which fullfills a condition for which the previous element is needed. E.g.:
liste = [1,3,5,2,6,4,7,1,3,5,2,3,4,7]
And now I want to test for two numbers if they occur consecutive in the list (e.g. find(liste, 3, 4) would give out TRUE if 3 comes directly before 4 in the array liste, otherwise FALSE)
What gives me problems is that a number occurs multiple times in the array. And I need to test it for every occurence. Any ideas?
FYI: I have implemented it in javascript but now want it in python. In javascript I use:
!!liste.find((element, idx) => idx > 0 && liste[idx-1] == 3 && element == 4)
But I have trouble translating that into pyhton...

You could do the following zip + any:
liste = [1, 3, 5, 2, 6, 4, 7, 1, 3, 5, 2, 3, 4, 7]
def find(lst, first, second):
return any((first, second) == pair for pair in zip(lst, lst[1:]))
print(find(liste, 3, 4))
Output
True

zip(liste, liste[1:])
will give you a pairwise iterator on every item and its predecessor.

Python - easy way to "comparison" map one array to another

I have an array a = [1, 2, 3, 4, 5, 6] and b = [1, 3, 5] and I'd like to map a such that for every element in a that's between an element in b it will get mapped to the index of b that is the upper range that a is contained in. Not the best explanation in words but here's an example
a = 1 -> 0 because a <= first element of b
a = 2 -> 1 because b[0] < 2 <= b[1] and b[1] = 3
a = 3 -> 1
a = 4 -> 2 because b[1] < 4 <= b[2]
So the final product I want is f(a, b) = [0, 1, 1, 2, 2, 2]
I know I can just loop and solve for it but I was wondering if there is a clever, fast (vectorized) way to do this in pandas/numpy

Use python's bisect module:
from bisect import bisect_left
a = [1, 2, 3, 4, 5, 6]
b = [1, 3, 5]
def f(_a, _b):
return [bisect_left(_b, i) for i in _a]
print(f(a, b))
bisect — Array bisection algorithm
This module provides support for maintaining a list in sorted order without having to sort the list after each insertion. For long lists of items with expensive comparison operations, this can be an improvement over the more common approach. The module is called bisect because it uses a basic bisection algorithm to do its work. The source code may be most useful as a working example of the algorithm (the boundary conditions are already right!).
The following functions are provided:
bisect.bisect_left(a, x, lo=0, hi=len(a))
Locate the insertion point for x in a to maintain sorted order. The parameters lo and hi may be used to specify a subset of the list which should be considered; by default the entire list is used. If x is already present in a, the insertion point will be before (to the left of) any existing entries.
The return value is suitable for use as the first parameter to list.insert() assuming that a is already sorted.
The returned insertion point i partitions the array a into two halves so that all(val < x for val in a[lo:i]) for the left side and all(val >= x for val in a[i:hi]) for the right side.
Reference:
https://docs.python.org/3/library/bisect.html

bisect is faster: the solution assumes lists are sorted
a = [1, 2, 3, 4, 5, 6]
b = [1, 3, 5]
inds=[min(bisect_left(b,x),len(b)-1) for x in a]
returns
[0, 1, 1, 2, 2, 2]

Cannot iterate through two lists at once python

I am having a problem with my python code, but I am not sure what it is. I am creating a program that creates a table with all possible combinations of four digits provided the digits do not repeat, which I know is successful. Then, I create another table and attempt to add to this secondary table all of the values which use the same numbers in a different order (so I do not have, say, 1234, 4321, 3241, 3214, 1324, 2413, etc. on this table.) However, this does not seem to be working, as the second table has only one value. What have I done wrong? My code is below. Oh, and I know that the one value comes from appending the 1 at the top.
combolisttwo = list()
combolisttwo.append(1)
combolist = {(a, b, c, d) for a in {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} for b in {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} for c in {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} for d in {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} if a != b and a != c and a != d and b != c and b != d and c!=d}
for i in combolist:
x = 0
letternums = str(i)
letters = list(letternums)
for g in letters:
n = 0
hits = 0
nonhits = 0
letterstwo = str(combolisttwo[n])
if g == letterstwo[n]:
hits = hits + 1
if g != letterstwo[n]:
nonhits = nonhits + 1
if hits == 4:
break
if hits + nonhits == 4:
combolisttwo.append(i)
break
x = len(combolisttwo)
print (x)

All possible combinations of four digits provided the digits do not repeat:
import itertools as IT
combolist = list(IT.combinations(range(10), 4))
Then, I create another table and attempt to add to this secondary table all of the values which use the same numbers in a different order (so I do not have, say, 1234, 4321, 3241, 3214, 1324, 2413, etc. on this table.):
combolist2 = [item for combo in combolist
for item in IT.permutations(combo, len(combo))]
Useful references:
combinations -- for enumerating collections of elements without replacement
permutations -- for enumerating collections of elements in all possible orders

This code is pretty confused ;-) For example, you have n = 0 in your inner loop, but never set n to anything else. For another, you have x = 0, but never use x. Etc.
Using itertools is really best, but if you're trying to learn how to do these things yourself, that's fine. For a start, change your:
letters = list(letternums)
to
letters = list(letternums)
print(letters)
break
I bet you'll be surprised at what you see! The elements of your combolist are tuples, so when you do letternums = str(i) you get a string with a mix of digits, spaces, parentheses and commas. I don't think you're expecting anything but digits.
Your letterstwo is the string "1" (always, because you never change n). But it doesn't much matter, because you set hits and nonhits to 0 every time your for g in letters loop iterates. So hits and nonhits can never be bigger than 1.
Which answers your literal question ;-) combolisttwo.append(i) is never executed because
hits + nonhits == 4 is never true. That's why combolisttwo remains at its initial value ([1]).
Put some calls to print() in your code? That will help you see what's going wrong.

How can I verify if one list is a subset of another?

I need to verify if a list is a subset of another - a boolean return is all I seek.
Is testing equality on the smaller list after an intersection the fastest way to do this? Performance is of utmost importance given the number of datasets that need to be compared.
Adding further facts based on discussions:
Will either of the lists be the same for many tests? It does as one of them is a static lookup table.
Does it need to be a list? It does not - the static lookup table can be anything that performs best. The dynamic one is a dict from which we extract the keys to perform a static lookup on.
What would be the optimal solution given the scenario?

>>> a = [1, 3, 5]
>>> b = [1, 3, 5, 8]
>>> c = [3, 5, 9]
>>> set(a) <= set(b)
True
>>> set(c) <= set(b)
False
>>> a = ['yes', 'no', 'hmm']
>>> b = ['yes', 'no', 'hmm', 'well']
>>> c = ['sorry', 'no', 'hmm']
>>>
>>> set(a) <= set(b)
True
>>> set(c) <= set(b)
False

Use set.issubset
Example:
a = {1,2}
b = {1,2,3}
a.issubset(b) # True
a = {1,2,4}
b = {1,2,3}
a.issubset(b) # False
The performant function Python provides for this is set.issubset. It does have a few restrictions that make it unclear if it's the answer to your question, however.
A list may contain items multiple times and has a specific order. A set does not. Additionally, sets only work on hashable objects.
Are you asking about subset or subsequence (which means you'll want a string search algorithm)? Will either of the lists be the same for many tests? What are the datatypes contained in the list? And for that matter, does it need to be a list?
Your other post intersect a dict and list made the types clearer and did get a recommendation to use dictionary key views for their set-like functionality. In that case it was known to work because dictionary keys behave like a set (so much so that before we had sets in Python we used dictionaries). One wonders how the issue got less specific in three hours.

one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
all(x in two for x in one)
Explanation: Generator creating booleans by looping through list one checking if that item is in list two. all() returns True if every item is truthy, else False.
There is also an advantage that all return False on the first instance of a missing element rather than having to process every item.

Assuming the items are hashable
>>> from collections import Counter
>>> not Counter([1, 2]) - Counter([1])
False
>>> not Counter([1, 2]) - Counter([1, 2])
True
>>> not Counter([1, 2, 2]) - Counter([1, 2])
False
If you don't care about duplicate items eg. [1, 2, 2] and [1, 2] then just use:
>>> set([1, 2, 2]).issubset([1, 2])
True
Is testing equality on the smaller list after an intersection the fastest way to do this?
.issubset will be the fastest way to do it. Checking the length before testing issubset will not improve speed because you still have O(N + M) items to iterate through and check.

One more solution would be to use a intersection.
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(one).intersection(set(two)) == set(one)
The intersection of the sets would contain of set one
(OR)
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(one) & (set(two)) == set(one)

Set theory is inappropriate for lists since duplicates will result in wrong answers using set theory.
For example:
a = [1, 3, 3, 3, 5]
b = [1, 3, 3, 4, 5]
set(b) > set(a)
has no meaning. Yes, it gives a false answer but this is not correct since set theory is just comparing: 1,3,5 versus 1,3,4,5. You must include all duplicates.
Instead you must count each occurrence of each item and do a greater than equal to check. This is not very expensive, because it is not using O(N^2) operations and does not require quick sort.
#!/usr/bin/env python
from collections import Counter
def containedInFirst(a, b):
a_count = Counter(a)
b_count = Counter(b)
for key in b_count:
if a_count.has_key(key) == False:
return False
if b_count[key] > a_count[key]:
return False
return True
a = [1, 3, 3, 3, 5]
b = [1, 3, 3, 4, 5]
print "b in a: ", containedInFirst(a, b)
a = [1, 3, 3, 3, 4, 4, 5]
b = [1, 3, 3, 4, 5]
print "b in a: ", containedInFirst(a, b)
Then running this you get:
$ python contained.py
b in a: False
b in a: True

one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(x in two for x in one) == set([True])
If list1 is in list 2:
(x in two for x in one) generates a list of True.
when we do a set(x in two for x in one) has only one element (True).

Pardon me if I am late to the party. ;)
To check if one set A is subset of set B, Python has A.issubset(B) and A <= B. It works on set only and works great BUT the complexity of internal implementation is unknown. Reference: https://docs.python.org/2/library/sets.html#set-objects
I came up with an algorithm to check if list A is a subset of list B with following remarks.
To reduce complexity of finding subset, I find it appropriate to
sort both lists first before comparing elements to qualify for
subset.
It helped me to break the loop when value of element of second list B[j] is greater than value of element of first list A[i].
last_index_j is used to start loop over list B where it last left off. It helps avoid starting comparisons from the start of
list B (which is, as you might guess unnecessary, to start list B from index 0 in subsequent iterations.)
Complexity will be O(n ln n) each for sorting both lists and O(n) for checking for subset.
O(n ln n) + O(n ln n) + O(n) = O(n ln n).
Code has lots of print statements to see what's going on at each iteration of the loop. These are meant for understanding
only.
Check if one list is subset of another list
is_subset = True;
A = [9, 3, 11, 1, 7, 2];
B = [11, 4, 6, 2, 15, 1, 9, 8, 5, 3];
print(A, B);
# skip checking if list A has elements more than list B
if len(A) > len(B):
is_subset = False;
else:
# complexity of sorting using quicksort or merge sort: O(n ln n)
# use best sorting algorithm available to minimize complexity
A.sort();
B.sort();
print(A, B);
# complexity: O(n^2)
# for a in A:
# if a not in B:
# is_subset = False;
# break;
# complexity: O(n)
is_found = False;
last_index_j = 0;
for i in range(len(A)):
for j in range(last_index_j, len(B)):
is_found = False;
print("i=" + str(i) + ", j=" + str(j) + ", " + str(A[i]) + "==" + str(B[j]) + "?");
if B[j] <= A[i]:
if A[i] == B[j]:
is_found = True;
last_index_j = j;
else:
is_found = False;
break;
if is_found:
print("Found: " + str(A[i]));
last_index_j = last_index_j + 1;
break;
else:
print("Not found: " + str(A[i]));
if is_found == False:
is_subset = False;
break;
print("subset") if is_subset else print("not subset");
Output
[9, 3, 11, 1, 7, 2] [11, 4, 6, 2, 15, 1, 9, 8, 5, 3]
[1, 2, 3, 7, 9, 11] [1, 2, 3, 4, 5, 6, 8, 9, 11, 15]
i=0, j=0, 1==1?
Found: 1
i=1, j=1, 2==1?
Not found: 2
i=1, j=2, 2==2?
Found: 2
i=2, j=3, 3==3?
Found: 3
i=3, j=4, 7==4?
Not found: 7
i=3, j=5, 7==5?
Not found: 7
i=3, j=6, 7==6?
Not found: 7
i=3, j=7, 7==8?
not subset

Below code checks whether a given set is a "proper subset" of another set
def is_proper_subset(set, superset):
return all(x in superset for x in set) and len(set)<len(superset)

In python 3.5 you can do a [*set()][index] to get the element. It is much slower solution than other methods.
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
result = set(x in two for x in one)
[*result][0] == True
or just with len and set
len(set(a+b)) == len(set(a))

Here is how I know if one list is a subset of another one, the sequence matters to me in my case.
def is_subset(list_long,list_short):
short_length = len(list_short)
subset_list = []
for i in range(len(list_long)-short_length+1):
subset_list.append(list_long[i:i+short_length])
if list_short in subset_list:
return True
else: return False

Most of the solutions consider that the lists do not have duplicates. In case your lists do have duplicates you can try this:
def isSubList(subList,mlist):
uniqueElements=set(subList)
for e in uniqueElements:
if subList.count(e) > mlist.count(e):
return False
# It is sublist
return True
It ensures the sublist never has different elements than list or a greater amount of a common element.
lst=[1,2,2,3,4]
sl1=[2,2,3]
sl2=[2,2,2]
sl3=[2,5]
print(isSubList(sl1,lst)) # True
print(isSubList(sl2,lst)) # False
print(isSubList(sl3,lst)) # False

Since no one has considered comparing two strings, here's my proposal.
You may of course want to check if the pipe ("|") is not part of either lists and maybe chose automatically another char, but you got the idea.
Using an empty string as separator is not a solution since the numbers can have several digits ([12,3] != [1,23])
def issublist(l1,l2):
return '|'.join([str(i) for i in l1]) in '|'.join([str(i) for i in l2])

If you are asking if one list is "contained" in another list then:
>>>if listA in listB: return True
If you are asking if each element in listA has an equal number of matching elements in listB try:
all(True if listA.count(item) <= listB.count(item) else False for item in listA)

Extract elements of list at odd positions

So I want to create a list which is a sublist of some existing list.
For example,
L = [1, 2, 3, 4, 5, 6, 7], I want to create a sublist li such that li contains all the elements in L at odd positions.
While I can do it by
L = [1, 2, 3, 4, 5, 6, 7]
li = []
count = 0
for i in L:
if count % 2 == 1:
li.append(i)
count += 1
But I want to know if there is another way to do the same efficiently and in fewer number of steps.

Solution
Yes, you can:
l = L[1::2]
And this is all. The result will contain the elements placed on the following positions (0-based, so first element is at position 0, second at 1 etc.):
1, 3, 5
so the result (actual numbers) will be:
2, 4, 6
Explanation
The [1::2] at the end is just a notation for list slicing. Usually it is in the following form:
some_list[start:stop:step]
If we omitted start, the default (0) would be used. So the first element (at position 0, because the indexes are 0-based) would be selected. In this case the second element will be selected.
Because the second element is omitted, the default is being used (the end of the list). So the list is being iterated from the second element to the end.
We also provided third argument (step) which is 2. Which means that one element will be selected, the next will be skipped, and so on...
So, to sum up, in this case [1::2] means:
take the second element (which, by the way, is an odd element, if you judge from the index),
skip one element (because we have step=2, so we are skipping one, as a contrary to step=1 which is default),
take the next element,
Repeat steps 2.-3. until the end of the list is reached,
EDIT: #PreetKukreti gave a link for another explanation on Python's list slicing notation. See here: Explain Python's slice notation
Extras - replacing counter with enumerate()
In your code, you explicitly create and increase the counter. In Python this is not necessary, as you can enumerate through some iterable using enumerate():
for count, i in enumerate(L):
if count % 2 == 1:
l.append(i)
The above serves exactly the same purpose as the code you were using:
count = 0
for i in L:
if count % 2 == 1:
l.append(i)
count += 1
More on emulating for loops with counter in Python: Accessing the index in Python 'for' loops

For the odd positions, you probably want:
>>>> list_ = list(range(10))
>>>> print list_[1::2]
[1, 3, 5, 7, 9]
>>>>

I like List comprehensions because of their Math (Set) syntax. So how about this:
L = [1, 2, 3, 4, 5, 6, 7]
odd_numbers = [y for x,y in enumerate(L) if x%2 != 0]
even_numbers = [y for x,y in enumerate(L) if x%2 == 0]
Basically, if you enumerate over a list, you'll get the index x and the value y. What I'm doing here is putting the value y into the output list (even or odd) and using the index x to find out if that point is odd (x%2 != 0).

You can also use itertools.islice if you don't need to create a list but just want to iterate over the odd/even elements
import itertools
L = [1, 2, 3, 4, 5, 6, 7]
li = itertools.islice(l, 1, len(L), 2)

You can make use of bitwise AND operator &:
>>> x = [1, 2, 3, 4, 5, 6, 7]
>>> y = [i for i in x if i&1]
[1, 3, 5, 7]
This will give you the odd elements in the list. Now to extract the elements at odd indices you just need to change the above a bit:
>>> x = [10, 20, 30, 40, 50, 60, 70]
>>> y = [j for i, j in enumerate(x) if i&1]
[20, 40, 60]
Explanation
Bitwise AND operator is used with 1, and the reason it works is because, odd number when written in binary must have its first digit as 1. Let's check:
23 = 1 * (2**4) + 0 * (2**3) + 1 * (2**2) + 1 * (2**1) + 1 * (2**0) = 10111
14 = 1 * (2**3) + 1 * (2**2) + 1 * (2**1) + 0 * (2**0) = 1110
AND operation with 1 will only return 1 (1 in binary will also have last digit 1), iff the value is odd.
Check the Python Bitwise Operator page for more.
P.S: You can tactically use this method if you want to select odd and even columns in a dataframe. Let's say x and y coordinates of facial key-points are given as columns x1, y1, x2, etc... To normalize the x and y coordinates with width and height values of each image you can simply perform:
for i in range(df.shape[1]):
if i&1:
df.iloc[:, i] /= heights
else:
df.iloc[:, i] /= widths
This is not exactly related to the question but for data scientists and computer vision engineers this method could be useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to check if a sequence contains a non-consecutive subsequence? - python

Related

find elements in list that fullfill condition WHICH needs the previous element

Python - easy way to "comparison" map one array to another

Cannot iterate through two lists at once python

How can I verify if one list is a subset of another?

Extract elements of list at odd positions

Categories

Resources