Python - easy way to "comparison" map one array to another - python

I have an array a = [1, 2, 3, 4, 5, 6] and b = [1, 3, 5] and I'd like to map a such that for every element in a that's between an element in b it will get mapped to the index of b that is the upper range that a is contained in. Not the best explanation in words but here's an example
a = 1 -> 0 because a <= first element of b
a = 2 -> 1 because b[0] < 2 <= b[1] and b[1] = 3
a = 3 -> 1
a = 4 -> 2 because b[1] < 4 <= b[2]
So the final product I want is f(a, b) = [0, 1, 1, 2, 2, 2]
I know I can just loop and solve for it but I was wondering if there is a clever, fast (vectorized) way to do this in pandas/numpy

Use python's bisect module:
from bisect import bisect_left
a = [1, 2, 3, 4, 5, 6]
b = [1, 3, 5]
def f(_a, _b):
return [bisect_left(_b, i) for i in _a]
print(f(a, b))
bisect — Array bisection algorithm
This module provides support for maintaining a list in sorted order without having to sort the list after each insertion. For long lists of items with expensive comparison operations, this can be an improvement over the more common approach. The module is called bisect because it uses a basic bisection algorithm to do its work. The source code may be most useful as a working example of the algorithm (the boundary conditions are already right!).
The following functions are provided:
bisect.bisect_left(a, x, lo=0, hi=len(a))
Locate the insertion point for x in a to maintain sorted order. The parameters lo and hi may be used to specify a subset of the list which should be considered; by default the entire list is used. If x is already present in a, the insertion point will be before (to the left of) any existing entries.
The return value is suitable for use as the first parameter to list.insert() assuming that a is already sorted.
The returned insertion point i partitions the array a into two halves so that all(val < x for val in a[lo:i]) for the left side and all(val >= x for val in a[i:hi]) for the right side.
Reference:
https://docs.python.org/3/library/bisect.html

bisect is faster: the solution assumes lists are sorted
a = [1, 2, 3, 4, 5, 6]
b = [1, 3, 5]
inds=[min(bisect_left(b,x),len(b)-1) for x in a]
returns
[0, 1, 1, 2, 2, 2]

Related

Need a "sortorder" funtion [duplicate]

I have a numerical list:
myList = [1, 2, 3, 100, 5]
Now if I sort this list to obtain [1, 2, 3, 5, 100].
What I want is the indices of the elements from the
original list in the sorted order i.e. [0, 1, 2, 4, 3]
--- ala MATLAB's sort function that returns both
values and indices.
If you are using numpy, you have the argsort() function available:
>>> import numpy
>>> numpy.argsort(myList)
array([0, 1, 2, 4, 3])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
This returns the arguments that would sort the array or list.
Something like next:
>>> myList = [1, 2, 3, 100, 5]
>>> [i[0] for i in sorted(enumerate(myList), key=lambda x:x[1])]
[0, 1, 2, 4, 3]
enumerate(myList) gives you a list containing tuples of (index, value):
[(0, 1), (1, 2), (2, 3), (3, 100), (4, 5)]
You sort the list by passing it to sorted and specifying a function to extract the sort key (the second element of each tuple; that's what the lambda is for. Finally, the original index of each sorted element is extracted using the [i[0] for i in ...] list comprehension.
myList = [1, 2, 3, 100, 5]
sorted(range(len(myList)),key=myList.__getitem__)
[0, 1, 2, 4, 3]
I did a quick performance check on these with perfplot (a project of mine) and found that it's hard to recommend anything else but
np.argsort(x)
(note the log scale):
Code to reproduce the plot:
import perfplot
import numpy as np
def sorted_enumerate(seq):
return [i for (v, i) in sorted((v, i) for (i, v) in enumerate(seq))]
def sorted_enumerate_key(seq):
return [x for x, y in sorted(enumerate(seq), key=lambda x: x[1])]
def sorted_range(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
b = perfplot.bench(
setup=np.random.rand,
kernels=[sorted_enumerate, sorted_enumerate_key, sorted_range, np.argsort],
n_range=[2 ** k for k in range(15)],
xlabel="len(x)",
)
b.save("out.png")
The answers with enumerate are nice, but I personally don't like the lambda used to sort by the value. The following just reverses the index and the value, and sorts that. So it'll first sort by value, then by index.
sorted((e,i) for i,e in enumerate(myList))
Updated answer with enumerate and itemgetter:
sorted(enumerate(a), key=lambda x: x[1])
# [(0, 1), (1, 2), (2, 3), (4, 5), (3, 100)]
Zip the lists together: The first element in the tuple will the index, the second is the value (then sort it using the second value of the tuple x[1], x is the tuple)
Or using itemgetter from the operatormodule`:
from operator import itemgetter
sorted(enumerate(a), key=itemgetter(1))
Essentially you need to do an argsort, what implementation you need depends if you want to use external libraries (e.g. NumPy) or if you want to stay pure-Python without dependencies.
The question you need to ask yourself is: Do you want the
indices that would sort the array/list
indices that the elements would have in the sorted array/list
Unfortunately the example in the question doesn't make it clear what is desired because both will give the same result:
>>> arr = np.array([1, 2, 3, 100, 5])
>>> np.argsort(np.argsort(arr))
array([0, 1, 2, 4, 3], dtype=int64)
>>> np.argsort(arr)
array([0, 1, 2, 4, 3], dtype=int64)
Choosing the argsort implementation
If you have NumPy at your disposal you can simply use the function numpy.argsort or method numpy.ndarray.argsort.
An implementation without NumPy was mentioned in some other answers already, so I'll just recap the fastest solution according to the benchmark answer here
def argsort(l):
return sorted(range(len(l)), key=l.__getitem__)
Getting the indices that would sort the array/list
To get the indices that would sort the array/list you can simply call argsort on the array or list. I'm using the NumPy versions here but the Python implementation should give the same results
>>> arr = np.array([3, 1, 2, 4])
>>> np.argsort(arr)
array([1, 2, 0, 3], dtype=int64)
The result contains the indices that are needed to get the sorted array.
Since the sorted array would be [1, 2, 3, 4] the argsorted array contains the indices of these elements in the original.
The smallest value is 1 and it is at index 1 in the original so the first element of the result is 1.
The 2 is at index 2 in the original so the second element of the result is 2.
The 3 is at index 0 in the original so the third element of the result is 0.
The largest value 4 and it is at index 3 in the original so the last element of the result is 3.
Getting the indices that the elements would have in the sorted array/list
In this case you would need to apply argsort twice:
>>> arr = np.array([3, 1, 2, 4])
>>> np.argsort(np.argsort(arr))
array([2, 0, 1, 3], dtype=int64)
In this case :
the first element of the original is 3, which is the third largest value so it would have index 2 in the sorted array/list so the first element is 2.
the second element of the original is 1, which is the smallest value so it would have index 0 in the sorted array/list so the second element is 0.
the third element of the original is 2, which is the second-smallest value so it would have index 1 in the sorted array/list so the third element is 1.
the fourth element of the original is 4 which is the largest value so it would have index 3 in the sorted array/list so the last element is 3.
If you do not want to use numpy,
sorted(range(len(seq)), key=seq.__getitem__)
is fastest, as demonstrated here.
The other answers are WRONG.
Running argsort once is not the solution.
For example, the following code:
import numpy as np
x = [3,1,2]
np.argsort(x)
yields array([1, 2, 0], dtype=int64) which is not what we want.
The answer should be to run argsort twice:
import numpy as np
x = [3,1,2]
np.argsort(np.argsort(x))
gives array([2, 0, 1], dtype=int64) as expected.
Most easiest way you can use Numpy Packages for that purpose:
import numpy
s = numpy.array([2, 3, 1, 4, 5])
sort_index = numpy.argsort(s)
print(sort_index)
But If you want that you code should use baisc python code:
s = [2, 3, 1, 4, 5]
li=[]
for i in range(len(s)):
li.append([s[i],i])
li.sort()
sort_index = []
for x in li:
sort_index.append(x[1])
print(sort_index)
We will create another array of indexes from 0 to n-1
Then zip this to the original array and then sort it on the basis of the original values
ar = [1,2,3,4,5]
new_ar = list(zip(ar,[i for i in range(len(ar))]))
new_ar.sort()
`
s = [2, 3, 1, 4, 5]
print([sorted(s, reverse=False).index(val) for val in s])
For a list with duplicate elements, it will return the rank without ties, e.g.
s = [2, 2, 1, 4, 5]
print([sorted(s, reverse=False).index(val) for val in s])
returns
[1, 1, 0, 3, 4]
Import numpy as np
FOR INDEX
S=[11,2,44,55,66,0,10,3,33]
r=np.argsort(S)
[output]=array([5, 1, 7, 6, 0, 8, 2, 3, 4])
argsort Returns the indices of S in sorted order
FOR VALUE
np.sort(S)
[output]=array([ 0, 2, 3, 10, 11, 33, 44, 55, 66])
Code:
s = [2, 3, 1, 4, 5]
li = []
for i in range(len(s)):
li.append([s[i], i])
li.sort()
sort_index = []
for x in li:
sort_index.append(x[1])
print(sort_index)
Try this, It worked for me cheers!
firstly convert your list to this:
myList = [1, 2, 3, 100, 5]
add a index to your list's item
myList = [[0, 1], [1, 2], [2, 3], [3, 100], [4, 5]]
next :
sorted(myList, key=lambda k:k[1])
result:
[[0, 1], [1, 2], [2, 3], [4, 5], [3, 100]]
A variant on RustyRob's answer (which is already the most performant pure Python solution) that may be superior when the collection you're sorting either:
Isn't a sequence (e.g. it's a set, and there's a legitimate reason to want the indices corresponding to how far an iterator must be advanced to reach the item), or
Is a sequence without O(1) indexing (among Python's included batteries, collections.deque is a notable example of this)
Case #1 is unlikely to be useful, but case #2 is more likely to be meaningful. In either case, you have two choices:
Convert to a list/tuple and use the converted version, or
Use a trick to assign keys based on iteration order
This answer provides the solution to #2. Note that it's not guaranteed to work by the language standard; the language says each key will be computed once, but not the order they will be computed in. On every version of CPython, the reference interpreter, to date, it's precomputed in order from beginning to end, so this works, but be aware it's not guaranteed. In any event, the code is:
sizediterable = ...
sorted_indices = sorted(range(len(sizediterable)), key=lambda _, it=iter(sizediterable): next(it))
All that does is provide a key function that ignores the value it's given (an index) and instead provides the next item from an iterator preconstructed from the original container (cached as a defaulted argument to allow it to function as a one-liner). As a result, for something like a large collections.deque, where using its .__getitem__ involves O(n) work (and therefore computing all the keys would involve O(n²) work), sequential iteration remains O(1), so generating the keys remains just O(n).
If you need something guaranteed to work by the language standard, using built-in types, Roman's solution will have the same algorithmic efficiency as this solution (as neither of them rely on the algorithmic efficiency of indexing the original container).
To be clear, for the suggested use case with collections.deque, the deque would have to be quite large for this to matter; deques have a fairly large constant divisor for indexing, so only truly huge ones would have an issue. Of course, by the same token, the cost of sorting is pretty minimal if the inputs are small/cheap to compare, so if your inputs are large enough that efficient sorting matters, they're large enough for efficient indexing to matter too.

Fastest way to check if a sequence contains a non-consecutive subsequence?

Let's say that two lists of elements are given, A and B. I'm interested in checking if A contains all the elements of B. Specifically, the elements must appear in the same order and they do not need to be consecutive. If this is the case, we say that B is a subsequence of A.
Here are some examples:
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 2, 1, 3]
is_subsequence(A, B) # True
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 8, 2]
is_subsequence(A, B) # True
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 1, 6]
is_subsequence(A, B) # False
A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 7, 2]
is_subsequence(A, B) # False
I found a very elegant way to solve this problem (see this answer):
def is_subsequence(A, B):
it = iter(A)
return all(x in it for x in B)
I am now wondering how this solution behaves with possibly very very large inputs. Let's say that my lists contain billions of numbers.
What's the complexity of the code above? What's its worst case? I have tried to test it with very large random inputs, but its speed mostly depends on the automatically generated input.
Most importantly, are there more efficient solutions? Why are these solutions more efficient than this one?
The code you found creates an iterator for A; you can see this as a simple pointer to the next position in A to look at, and in moves the pointer forward across A until a match is found. It can be used multiple times, but only ever moves forward; when using in containment tests against a single iterator multiple times, the iterator can't go backwards and so can only test if still to visit values are equal to the left-hand operand.
Given your last example, with B = [2, 7, 2], what happens is this:
it = iter(A) creates an iterator object for the A list, and stores 0 as the next position to look at.
The all() function tests each element in an iterable and returns False early, if such a result was found. Otherwise it keeps testing every element. Here the tests are repeated x in it calls, where x is set to each value in B in turn.
x is first set to 2, and so 2 in it is tested.
it is set to next look at A[0]. That's 4, not equal to 2, so the internal position counter is incremented to 1.
A[1] is 2, and that's equal, so 2 in it returns True at this point, but not before incrementing the 'next position to look at' counter to 2.
2 in it was true, so all() continues on.
The next value in B is 7, so 7 in it is tested.
it is set to next look at A[2]. That's 8, not 7. The position counter is incremented to 3.
it is set to next look at A[3]. That's 2, not 7. The position counter is incremented to 4.
it is set to next look at A[4]. That's 7, equal to 7. The position counter is incremented to 5 and True is returned.
7 in it was true, so all() continues on.
The next value in B is 2, so 2 in it is tested.
it is set to next look at A[5]. That's 0, not 2. The position counter is incremented to 6.
it is set to next look at A[6]. That's 1, not 2. The position counter is incremented to 7.
it is set to next look at A[7]. That's 5, not 2. The position counter is incremented to 8.
it is set to next look at A[8]. That's 3, not 2. The position counter is incremented to 9.
There is no A[9] because there are not that many elements in A, and so False is returned.
2 in it was False, so all() ends by returning False.
You could verify this with an iterator with a side effect you can observe; here I used print() to write out what the next value is for a given input:
>>> A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
>>> B = [2, 7, 2]
>>> with_sideeffect = lambda name, iterable: (
print(f"{name}[{idx}] = {value}") or value
for idx, value in enumerate(iterable)
)
>>> is_sublist(with_sideeffect(" > A", A), with_sideeffect("< B", B))
< B[0] = 2
> A[0] = 4
> A[1] = 2
< B[1] = 7
> A[2] = 8
> A[3] = 2
> A[4] = 7
< B[2] = 2
> A[5] = 0
> A[6] = 1
> A[7] = 5
> A[8] = 3
False
Your problem requires that you test every element of B consecutively, there are no shortcuts here. You also must scan through A to test for the elements of B being present, in the right order. You can only declare victory when all elements of B have been found (partial scan), and defeat when all elements in A have been scanned and the current value in B you are testing for is not found.
So, assuming the size of B is always smaller than A, the best case scenario then is where all K elements in B are equal to the first K elements of A. The worst case, is any case where not all of the elements of B are present in A, and require a full scan through A. It doesn't matter what number of elements are present in B; if you are testing element K out of K you already have been scanning part-way through A and must complete your scan through A to find that the last element is missing.
So the best case with N elements in A and K elements in B, takes O(K) time. The worst case, using the same definitions of N and K, takes O(N) time.
There is no faster algorithm to test for this condition, so all you can hope for is lowering your constant times (the time taken to complete each of the N steps). Here that'd be a faster way to scan through A as you search for the elements of B. I am not aware of a better way to do this than by using the method you already found.

Getting first n unique elements from Python list

I have a python list where elements can repeat.
>>> a = [1,2,2,3,3,4,5,6]
I want to get the first n unique elements from the list.
So, in this case, if i want the first 5 unique elements, they would be:
[1,2,3,4,5]
I have come up with a solution using generators:
def iterate(itr, upper=5):
count = 0
for index, element in enumerate(itr):
if index==0:
count += 1
yield element
elif element not in itr[:index] and count<upper:
count += 1
yield element
In use:
>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]
I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient
way?
I would use a set to remember what was seen and return from the generator when you have seen enough:
a = [1, 2, 2, 3, 3, 4, 5, 6]
def get_unique_N(iterable, N):
"""Yields (in order) the first N unique elements of iterable.
Might yield less if data too short."""
seen = set()
for e in iterable:
if e in seen:
continue
seen.add(e)
yield e
if len(seen) == N:
return
k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
print(list(k))
Output:
[1, 2, 3, 4]
According to PEP-479 you should return from generators, not raise StopIteration - thanks to #khelwood & #iBug for that piece of comment - one never learns out.
With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration
Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.
Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:
For 6 uniques (in longer list):
you would have lookups of O(1)+O(2)+...+O(5001)
mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})
You can adapt the popular itertools unique_everseen recipe:
def unique_everseen_limit(iterable, limit=5):
seen = set()
seen_add = seen.add
for element in iterable:
if element not in seen:
seen_add(element)
yield element
if len(seen) == limit:
break
a = [1,2,2,3,3,4,5,6]
res = list(unique_everseen_limit(a)) # [1, 2, 3, 4, 5]
Alternatively, as suggested by #Chris_Rands, you can use itertools.islice to extract a fixed number of values from a non-limited generator:
from itertools import islice
def unique_everseen(iterable):
seen = set()
seen_add = seen.add
for element in iterable:
if element not in seen:
seen_add(element)
yield element
res = list(islice(unique_everseen(a), 5)) # [1, 2, 3, 4, 5]
Note the unique_everseen recipe is available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique, so you could use:
from itertools import islice
from more_itertools import unique_everseen
from toolz import unique
res = list(islice(unique_everseen(a), 5)) # [1, 2, 3, 4, 5]
res = list(islice(unique(a), 5)) # [1, 2, 3, 4, 5]
If your objects are hashable (ints are hashable) you can write utility function using fromkeys method of collections.OrderedDict class (or starting from Python3.7 a plain dict, since they became officially ordered) like
from collections import OrderedDict
def nub(iterable):
"""Returns unique elements preserving order."""
return OrderedDict.fromkeys(iterable).keys()
and then implementation of iterate can be simplified to
from itertools import islice
def iterate(itr, upper=5):
return islice(nub(itr), upper)
or if you want always a list as an output
def iterate(itr, upper=5):
return list(nub(itr))[:upper]
Improvements
As #Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing nub utility in a form of generator like others already did:
def nub(iterable):
seen = set()
add_seen = seen.add
for element in iterable:
if element in seen:
continue
yield element
add_seen(element)
Here is a Pythonic approach using itertools.takewhile():
In [95]: from itertools import takewhile
In [96]: seen = set()
In [97]: set(takewhile(lambda x: seen.add(x) or len(seen) <= 4, a))
Out[97]: {1, 2, 3, 4}
You can use OrderedDict or, since Python 3.7, an ordinary dict, since they are implemented to preserve the insertion order. Note that this won't work with sets.
N = 3
a = [1, 2, 2, 3, 3, 3, 4]
d = {x: True for x in a}
list(d.keys())[:N]
There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don't care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.
a = [1,2,2,3,3,4,5,6]
res = []
for x in a:
if x not in res: # yes, not optimal, but doesnt need additional dict
res.append(x)
if len(res) == 5:
break
print(res)
Assuming the elements are ordered as shown, this is an opportunity to have fun with the groupby function in itertools:
from itertools import groupby, islice
def first_unique(data, upper):
return islice((key for (key, _) in groupby(data)), 0, upper)
a = [1, 2, 2, 3, 3, 4, 5, 6]
print(list(first_unique(a, 5)))
Updated to use islice instead of enumerate per #juanpa.arrivillaga. You don't even need a set to keep track of duplicates.
Using set with sorted+ key
sorted(set(a), key=list(a).index)[:5]
Out[136]: [1, 2, 3, 4, 5]
Given
import itertools as it
a = [1, 2, 2, 3, 3, 4, 5, 6]
Code
A simple list comprehension (similar to #cdlane's answer).
[k for k, _ in it.groupby(a)][:5]
# [1, 2, 3, 4, 5]
Alternatively, in Python 3.6+:
list(dict.fromkeys(a))[:5]
# [1, 2, 3, 4, 5]
Profiling Analysis
Solutions
Which solution is the fastest? There are two clear favorite answers (and 3 solutions) that captured most of the votes.
The solution by Patrick Artner - denoted as PA.
The first solution by jpp - denoted as jpp1
The second solution by jpp - denoted as jpp2
This is because these claim to run in O(N) while others here run in O(N^2), or do not guarantee the order of the returned list.
Experiment setup
For this experiment 3 variables were considered.
N elements. The number of first N elements the function is searching for.
List length. The longer the list the further the algorithm has to look to find the last element.
Repeat limit. How many times an element can repeat before the next element occurs in the list. This is uniformly distributed between 1 and the repeat limit.
The assumptions for data generation were as follows. How strict these are depend on the algorithm used, but is more a note on how the data was generated than a limitation on the algorithms themselves.
The elements never occur again after its repeated sequence first appears in the list.
The elements are numeric and increasing.
The elements are of type int.
So in a list of [1,1,1,2,2,3,4 ....] 1,2,3 would never appear again. The next element after 4 would be 5, but there could be a random number of 4s up to the repeat limit before we see 5.
A new dataset was created for each combination of variables and and re-generated 20 times. The python timeit function was used to profile the algorithms 50 times on each dataset. The mean time of the 20x50=1000 runs (for each combination) were reported here. Since the algorithms are generators, their outputs were converted to a list to get the execution time.
Results
As is expected the more elements searched for, the longer it takes. This graph shows that the execution time is indeed O(N) as claimed by the authors (the straight line proves this).
Fig 1. Varying the first N elements searched for.
All three solutions do not consume additional computation time beyond that which is required. The below image shows what happens when the list is limited in size, and not N elements. Lists of length 10k, with elements repeating a maximum of 100 times (and thus on average repeating 50 times) would on average run out of unique elements by 200 (10000/50). If any of these graphs showed an increase in computation time beyond 200 this would be a cause for concern.
Fig 2. The effect of first N elements chosen > number of unique elements.
The figure below again shows that processing time increases (at a rate of O(N)) the more data the algorithm has to sift through. The rate of increase is the same as when first N elements were varied. This is because stepping through the list is the common execution block in both, and the execution block that ultimately decides how fast the algorithm is.
Fig 3. Varying the repeat limit.
Conclusion
The 2nd solution posted by jpp is the fastest solution of the 3 in all cases. The solution is only slightly faster than the solution posted by Patrick Artner, and is almost twice as fast as his first solution.
Why not use something like this?
>>> a = [1, 2, 2, 3, 3, 4, 5, 6]
>>> list(set(a))[:5]
[1, 2, 3, 4, 5]
Example list:
a = [1, 2, 2, 3, 3, 4, 5, 6]
Function returns all or count of unique items needed from list
1st argument - list to work with, 2nd argument (optional) - count of unique items (by default - None - it means that all unique elements will be returned)
def unique_elements(lst, number_of_elements=None):
return list(dict.fromkeys(lst))[:number_of_elements]
Here is example how it works. List name is "a", and we need to get 2 unique elements:
print(unique_elements(a, 2))
Output:
a = [1,2,2,3,3,4,5,6]
from collections import defaultdict
def function(lis,n):
dic = defaultdict(int)
sol=set()
for i in lis:
try:
if dic[i]:
pass
else:
sol.add(i)
dic[i]=1
if len(sol)>=n:
break
except KeyError:
pass
return list(sol)
print(function(a,3))
output
[1, 2, 3]

Python arranging a list to include duplicates

I have a list in Python that is similar to:
x = [1,2,2,3,3,3,4,4]
Is there a way using pandas or some other list comprehension to make the list appear like this, similar to a queue system:
x = [1,2,3,4,2,3,4,3]
It is possible, by using cumcount
s=pd.Series(x)
s.index=s.groupby(s).cumcount()
s.sort_index()
Out[11]:
0 1
0 2
0 3
0 4
1 2
1 3
1 4
2 3
dtype: int64
If you split your list into one separate list for each value (groupby), you can then use the itertools recipe roundrobin to get this behavior:
x = ([1, 2, 2, 3, 3, 3, 4, 4])
roundrobin(*(g for _, g in groupby(x)))
If I'm understanding you correctly, you want to retain all duplicates, but then have the list arranged in an order where you create what are in essence separate lists of unique values, but they're all concatenated into a single list, in order.
I don't think this is possible in a listcomp, and nothing's occurring to me for getting it done easily/quickly in pandas.
But the straightforward algorithm is:
Create a different list for each set of unique values: For i in x: if x not in list1, add to list 1; else if not in list2, add to list2; else if not in list3, ad to list3; and so on. There's certainly a way to do this with recursion, if it's an unpredictable number of lists.
Evaluate the lists based on their values, to determine the order in which you want to have them listed in the final list. It's unclear from your post exactly what order you want them to be in. Querying by the value in the 0th position could be one way. Evaluating the entire lists as >= each other is another way.
Once you have that set of lists and their orders, it's straightforward to concatenate them in order, in the final list.
essentially what you want is pattern, this pattern is nothing but the order in which we found unique numbers while traversing the list x for eg: if x = [4,3,1,3,5] then pattern = 4 3 1 5 and this will now help us in filling x again such that output will be [4,3,1,5,3]
from collections import defaultdict
x = [1,2,2,3,3,3,4,4]
counts_dict = defaultdict(int)
for p in x:
counts_dict[p]+=1
i =0
while i < len(x):
for p,cnt in counts_dict.items():
if i < len(x):
if cnt > 0:
x[i] = p
counts_dict[p]-=1
i+=1
else:
continue
else:
# we have placed all the 'p'
break
print(x) # [1, 2, 3, 4, 2, 3, 4, 3]
note: python 3.6+ dict respects insertion order and I am assuming that you are using python3.6+ .
This is what I thought of doing at first but It fails in some cases..
'''
x = [3,7,7,7,4]
i = 1
while i < len(x):
if x[i] == x[i-1]:
x.append(x.pop(i))
i = max(1,i-1)
else:
i+=1
print(x) # [1, 2, 3, 4, 2, 3, 4, 3]
# x = [2,2,3,3,3,4,4]
# output [2, 3, 4, 2, 3, 4, 3]
# x = [3,7,1,7,4]
# output [3, 7, 1, 7, 4]
# x = [3,7,7,7,4]
# output time_out
'''

How can I verify if one list is a subset of another?

I need to verify if a list is a subset of another - a boolean return is all I seek.
Is testing equality on the smaller list after an intersection the fastest way to do this? Performance is of utmost importance given the number of datasets that need to be compared.
Adding further facts based on discussions:
Will either of the lists be the same for many tests? It does as one of them is a static lookup table.
Does it need to be a list? It does not - the static lookup table can be anything that performs best. The dynamic one is a dict from which we extract the keys to perform a static lookup on.
What would be the optimal solution given the scenario?
>>> a = [1, 3, 5]
>>> b = [1, 3, 5, 8]
>>> c = [3, 5, 9]
>>> set(a) <= set(b)
True
>>> set(c) <= set(b)
False
>>> a = ['yes', 'no', 'hmm']
>>> b = ['yes', 'no', 'hmm', 'well']
>>> c = ['sorry', 'no', 'hmm']
>>>
>>> set(a) <= set(b)
True
>>> set(c) <= set(b)
False
Use set.issubset
Example:
a = {1,2}
b = {1,2,3}
a.issubset(b) # True
a = {1,2,4}
b = {1,2,3}
a.issubset(b) # False
The performant function Python provides for this is set.issubset. It does have a few restrictions that make it unclear if it's the answer to your question, however.
A list may contain items multiple times and has a specific order. A set does not. Additionally, sets only work on hashable objects.
Are you asking about subset or subsequence (which means you'll want a string search algorithm)? Will either of the lists be the same for many tests? What are the datatypes contained in the list? And for that matter, does it need to be a list?
Your other post intersect a dict and list made the types clearer and did get a recommendation to use dictionary key views for their set-like functionality. In that case it was known to work because dictionary keys behave like a set (so much so that before we had sets in Python we used dictionaries). One wonders how the issue got less specific in three hours.
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
all(x in two for x in one)
Explanation: Generator creating booleans by looping through list one checking if that item is in list two. all() returns True if every item is truthy, else False.
There is also an advantage that all return False on the first instance of a missing element rather than having to process every item.
Assuming the items are hashable
>>> from collections import Counter
>>> not Counter([1, 2]) - Counter([1])
False
>>> not Counter([1, 2]) - Counter([1, 2])
True
>>> not Counter([1, 2, 2]) - Counter([1, 2])
False
If you don't care about duplicate items eg. [1, 2, 2] and [1, 2] then just use:
>>> set([1, 2, 2]).issubset([1, 2])
True
Is testing equality on the smaller list after an intersection the fastest way to do this?
.issubset will be the fastest way to do it. Checking the length before testing issubset will not improve speed because you still have O(N + M) items to iterate through and check.
One more solution would be to use a intersection.
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(one).intersection(set(two)) == set(one)
The intersection of the sets would contain of set one
(OR)
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(one) & (set(two)) == set(one)
Set theory is inappropriate for lists since duplicates will result in wrong answers using set theory.
For example:
a = [1, 3, 3, 3, 5]
b = [1, 3, 3, 4, 5]
set(b) > set(a)
has no meaning. Yes, it gives a false answer but this is not correct since set theory is just comparing: 1,3,5 versus 1,3,4,5. You must include all duplicates.
Instead you must count each occurrence of each item and do a greater than equal to check. This is not very expensive, because it is not using O(N^2) operations and does not require quick sort.
#!/usr/bin/env python
from collections import Counter
def containedInFirst(a, b):
a_count = Counter(a)
b_count = Counter(b)
for key in b_count:
if a_count.has_key(key) == False:
return False
if b_count[key] > a_count[key]:
return False
return True
a = [1, 3, 3, 3, 5]
b = [1, 3, 3, 4, 5]
print "b in a: ", containedInFirst(a, b)
a = [1, 3, 3, 3, 4, 4, 5]
b = [1, 3, 3, 4, 5]
print "b in a: ", containedInFirst(a, b)
Then running this you get:
$ python contained.py
b in a: False
b in a: True
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
set(x in two for x in one) == set([True])
If list1 is in list 2:
(x in two for x in one) generates a list of True.
when we do a set(x in two for x in one) has only one element (True).
Pardon me if I am late to the party. ;)
To check if one set A is subset of set B, Python has A.issubset(B) and A <= B. It works on set only and works great BUT the complexity of internal implementation is unknown. Reference: https://docs.python.org/2/library/sets.html#set-objects
I came up with an algorithm to check if list A is a subset of list B with following remarks.
To reduce complexity of finding subset, I find it appropriate to
sort both lists first before comparing elements to qualify for
subset.
It helped me to break the loop when value of element of second list B[j] is greater than value of element of first list A[i].
last_index_j is used to start loop over list B where it last left off. It helps avoid starting comparisons from the start of
list B (which is, as you might guess unnecessary, to start list B from index 0 in subsequent iterations.)
Complexity will be O(n ln n) each for sorting both lists and O(n) for checking for subset.
O(n ln n) + O(n ln n) + O(n) = O(n ln n).
Code has lots of print statements to see what's going on at each iteration of the loop. These are meant for understanding
only.
Check if one list is subset of another list
is_subset = True;
A = [9, 3, 11, 1, 7, 2];
B = [11, 4, 6, 2, 15, 1, 9, 8, 5, 3];
print(A, B);
# skip checking if list A has elements more than list B
if len(A) > len(B):
is_subset = False;
else:
# complexity of sorting using quicksort or merge sort: O(n ln n)
# use best sorting algorithm available to minimize complexity
A.sort();
B.sort();
print(A, B);
# complexity: O(n^2)
# for a in A:
# if a not in B:
# is_subset = False;
# break;
# complexity: O(n)
is_found = False;
last_index_j = 0;
for i in range(len(A)):
for j in range(last_index_j, len(B)):
is_found = False;
print("i=" + str(i) + ", j=" + str(j) + ", " + str(A[i]) + "==" + str(B[j]) + "?");
if B[j] <= A[i]:
if A[i] == B[j]:
is_found = True;
last_index_j = j;
else:
is_found = False;
break;
if is_found:
print("Found: " + str(A[i]));
last_index_j = last_index_j + 1;
break;
else:
print("Not found: " + str(A[i]));
if is_found == False:
is_subset = False;
break;
print("subset") if is_subset else print("not subset");
Output
[9, 3, 11, 1, 7, 2] [11, 4, 6, 2, 15, 1, 9, 8, 5, 3]
[1, 2, 3, 7, 9, 11] [1, 2, 3, 4, 5, 6, 8, 9, 11, 15]
i=0, j=0, 1==1?
Found: 1
i=1, j=1, 2==1?
Not found: 2
i=1, j=2, 2==2?
Found: 2
i=2, j=3, 3==3?
Found: 3
i=3, j=4, 7==4?
Not found: 7
i=3, j=5, 7==5?
Not found: 7
i=3, j=6, 7==6?
Not found: 7
i=3, j=7, 7==8?
not subset
Below code checks whether a given set is a "proper subset" of another set
def is_proper_subset(set, superset):
return all(x in superset for x in set) and len(set)<len(superset)
In python 3.5 you can do a [*set()][index] to get the element. It is much slower solution than other methods.
one = [1, 2, 3]
two = [9, 8, 5, 3, 2, 1]
result = set(x in two for x in one)
[*result][0] == True
or just with len and set
len(set(a+b)) == len(set(a))
Here is how I know if one list is a subset of another one, the sequence matters to me in my case.
def is_subset(list_long,list_short):
short_length = len(list_short)
subset_list = []
for i in range(len(list_long)-short_length+1):
subset_list.append(list_long[i:i+short_length])
if list_short in subset_list:
return True
else: return False
Most of the solutions consider that the lists do not have duplicates. In case your lists do have duplicates you can try this:
def isSubList(subList,mlist):
uniqueElements=set(subList)
for e in uniqueElements:
if subList.count(e) > mlist.count(e):
return False
# It is sublist
return True
It ensures the sublist never has different elements than list or a greater amount of a common element.
lst=[1,2,2,3,4]
sl1=[2,2,3]
sl2=[2,2,2]
sl3=[2,5]
print(isSubList(sl1,lst)) # True
print(isSubList(sl2,lst)) # False
print(isSubList(sl3,lst)) # False
Since no one has considered comparing two strings, here's my proposal.
You may of course want to check if the pipe ("|") is not part of either lists and maybe chose automatically another char, but you got the idea.
Using an empty string as separator is not a solution since the numbers can have several digits ([12,3] != [1,23])
def issublist(l1,l2):
return '|'.join([str(i) for i in l1]) in '|'.join([str(i) for i in l2])
If you are asking if one list is "contained" in another list then:
>>>if listA in listB: return True
If you are asking if each element in listA has an equal number of matching elements in listB try:
all(True if listA.count(item) <= listB.count(item) else False for item in listA)

Categories

Resources