Is there an operator to remove elements from a List based on the content of a Set?
What I want to do is already possible by doing this:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = [w for w in words if w not in my_set]
# ["hello", "how", "today", "hello"]
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets. Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
So is there some way of computing a difference between a List and a Set in a shorter/cleaner/more efficient way than using a list comprehension, like for example:
# I know this is not possible, but does something approaching exist?
new_list = words - my_set
TL;DR
I'm looking for a way to remove all element presents in a Set from a List, that is either:
cleaner (with a built-in perhaps)
and/or more efficient
than what I know can be done with list comprehensions.
Unfortunately, the only answer for this is: No, there is no built-in way, implemented in native code, for this kind of operation.
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets.
I think what’s important here is the “looks” part. Yes, list comprehensions run more within Python than a set difference, but I assume that most of your application actually runs within Python (otherwise you should probably be programming in C instead). So you should consider whether it really matters much. Iterating over a list is fast in Python, and a membership test on a set is also super fast (constant time, and implemented in native code). And if you look at list comprehensions, they are also very fast. So it probably won’t matter much.
Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
It is true that native operations are faster, but they are also more specialized, limited and allow for less flexibility. For sets, a difference is pretty easy. The set difference is a mathematical concept and is very clearly defined.
But when talking about a “list difference” or a “list and set difference” (or more generalized “list and iterable difference”?) it becomes a lot more unclear. There are a lot open questions:
How are duplicates handled? If there are two X in the original list and only one X in the subtrahend, should both X disappear from the list? Should only one disappear? If so, which one, and why?
How is order handled? Should the order be kept as in the original list? Does the order of the elements in the subtrahend have any impact?
What if we want to subtract members based on some other condition than equality? For sets, it’s clear that they always work on the equality (and hash value) of the members. Lists don’t, so lists are by design a lot more flexible. With list comprehensions, we can easily have any kind of condition to remove elements from a list; with a “list difference” we would be restricted to equality, and that might actually be a rare situation if you think about it.
It’s maybe more likely to use a set if you need to calculate differences (or even some ordered set). And for filtering lists, it might also be a rare case that you want to end up with a filtered list, so it might be more common to use a generator expression (or the Python 3 filter() function) and work with that later without having to create that filtered list in memory.
What I’m trying to say is that the use case for a list difference is not as clear as a set difference. And if there was a use case, it might be a very rare use case. And in general, I don’t think it’s worth to add complexity to the Python implementation for this. Especially when the in-Python alternative, e.g. a list comprehension, is as fast as it already is.
First things first, are you prematurely worrying about an optimisation problem that isn't really an issue? I have to to have lists with at least 10,000,000 elements before I even get into the range of this operation taking 1/10ths of a second.
If you're working with large data sets then you may find it advantageous to move to using numpy.
import random
import timeit
r = range(10000000)
setup = """
import numpy as np
l = list({!r})
s = set(l)
to_remove = {!r}
n = np.array(l)
n_remove = np.array(list(to_remove))
""".format(r, set(random.sample(r, 3)))
list_filter = "[x for x in l if x not in to_remove]"
set_filter = "s - to_remove"
np_filter = "n[np.in1d(n, n_remove, invert=True)]"
n = 1
l_time = timeit.timeit(list_filter, setup, number=n)
print("lists:", l_time)
s_time = timeit.timeit(set_filter, setup, number=n)
print("sets:", s_time)
n_time = timeit.timeit(np_filter, setup, number=n)
print("numpy:", n_time)
returns the following results -- with numpy an order of magnitude faster than using sets.
lists: 0.8743789765043315
sets: 0.20703006886620656
numpy: 0.06197169088128707
I agree with poke. Here is my reasoning:
The easiest way to do it would be using a filter:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = filter(lambda w: w not in my_set, words)
And using Dunes solution, I get these times:
lists: 0.87401028
sets: 0.55103887
numpy: 0.16134396
filter: 0.00000886 WOW beats numpy by various orders of magnitude !!!
But wait, we are making a flawed comparison because we are comparing the time of making a list strictly (comprehension and set difference) vs. lazily (numpy and filter).
If I run Dunes solution but producing the actual lists, I get:
lists: 0.86804159
sets: 0.56945663
numpy: 1.19315723
filter: 1.68792561
Now numpy is slightly more efficient than using a simple filter, but both are not better than the list comprehension, which was the first and more intuitive solution.
I would definitely use a filter over the comprehension, except if I need to use the filtered list more than once (although I could tee it).
Related
According to my research there are two easy ways to remove duplicates from a list:
a = list(dict.fromkeys(a))
and
a = list(set(a))
Is one of them more efficient than the other?
Definitely the second one is more efficient as sets are more or less created for that purpose and you skip the overhead related to creation of dict which is way heavier.
Perfomance-wise it definitely depends on what the payload actually is.
import timeit
import random
input_data = [random.choice(range(100)) for i in range(1000)]
from_keys = timeit.timeit('list(dict.fromkeys(input_data))', number=10000, globals={'input_data': input_data})
from_set = timeit.timeit('list(set(input_data))', number=10000, globals={'input_data': input_data})
print(f"From keys performance: {from_keys:.3f}")
print(f"From set performance: {from_set:.3f}")
Prints:
From keys performance: 0.230
From set performance: 0.140
It doesn't really mean it's almost twice as fast. The difference is barely visible. Try it for yourself with different random data.
The second answer is way better not only because its faster, but it shows the intention of the programmer better. set() is designed specifically to describe mathematical sets in which elements cannot be duplicated, thus it fits this purpose and the intention is clear to the reader. On the other hand dict() is for storing key-value pairs and tells nothing about the intention.
in case we have a list containing a = [1,16,2,3,4,5,6,8,10,3,9,15,7]
and we used a = list(set(a)) the set() function will drop the duplication's and also reorder our list, the new list will look like this [1,2,3,4,5,6,7,8,9,10,15,16], while if we use a = list(dict.fromkeys(a)) the dict.fromkeys() function will drop the duplication's and keep the list elements in the same order [1,16,2,3,4,5,6,8,10,9,15,7].
to sum things up, if you're looking for a way to drop duplications from a list without caring about reordering the list then set() is what you're looking for, but!! if keeping the order of the list is required, then you can use dict.fromkeys()
CAUTION: since Python 3.7 the keys of a dict are ordered.
So the first form that uses
list(dict.fromkeys(a)) # preserves order!!
preserves the order while using the set will potentially (and probably) change the order of the elements of the list 'a'.
I have a Python list of objects that could be pretty long. At particular times, I'm interested in all of the elements in the list that have a certain attribute, say flag, that evaluates to False. To do so, I've been using a list comprehension, like this:
objList = list()
# ... populate list
[x for x in objList if not x.flag]
Which seems to work well. After forming the sublist, I have a few different operations that I might need to do:
Subscript the sublist to get the element at index ind.
Calculate the length of the sublist (i.e. the number of elements that have flag == False).
Search the sublist for the first instance of a particular object (i.e. using the list's .index() method).
I've implemented these using the naive approach of just forming the sublist and then using its methods to get at the data I want. I'm wondering if there are more efficient ways to go about these. #1 and #3 at least seem like they could be optimized, because in #1 I only need the first ind + 1 matching elements of the sublist, not necessarily the entire result set, and in #3 I only need to search through the sublist until I find a matching element.
Is there a good Pythonic way to do this? I'm guessing I might be able to use the () syntax in some way to get a generator instead of creating the entire list, but I haven't happened upon the right way yet. I obviously could write loops manually, but I'm looking for something as elegant as the comprehension-based method.
If you need to do any of these operations a couple of times, the overhead of other methods will be higher, the list is the best way. It's also probably the clearest, so if memory isn't a problem, then I'd recommend just going with it.
If memory/speed is a problem, then there are alternatives - note that speed-wise, these might actually be slower, depending on the common case for your software.
For your scenarios:
#value = sublist[n]
value = nth(x for x in objList if not x.flag, n)
#value = len(sublist)
value = sum(not x.flag for x in objList)
#value = sublist.index(target)
value = next(dropwhile(lambda x: x != target, (x for x in objList if not x.flag)))
Using itertools.dropwhile() and the nth() recipe from the itertools docs.
I'm going to assume you might do any of these three things, and you might do them more than once.
In that case, what you want is basically to write a lazily evaluated list class. It would keep two pieces of data, a real list cache of evaluated items, and a generator of the rest. You could then do ll[10] and it would evaluate up to the 10th item, ll.index('spam') and it would evaluate until it finds 'spam', and then len(ll) and it would evaluate the rest of the list, all the while caching in the real list what it sees so nothing is done more than once.
Constructing it would look like this:
LazyList(x for x in obj_list if not x.flag)
But nothing would actually be computed until you actually start using it as above.
Since you commented that your objList can change, if you don't also need to index or search objList itself, then you might be better off just storing two different lists, one with .flag = True and one with .flag = False. Then you can use the second list directly instead of constructing it with a list comprehension each time.
If this works in your situation, it is likely the most efficient way to do it.
Here's a example of what I want to do
spam_list = ["We", "are", "the", "knights", "who", "say", "Ni"]
spam_order = [0,1,2,4,5,6,3]
spam_list.magical_sort(spam_order)
print(spam_list)
["We", "are", "the", "who", "say", "Ni", "knights"]
I can do it with enumerate, list and so on, but I would like to directly affect spam_list, like list.sort() and not copy it like sorted()
Edit : pushed a string example to avoid confusion between indices and values of spam_list
Edit : turned out this is a duplicate of Python sort parallel arrays in place?. Well, I can't delete so much efforts for SO consistency arguments.
You could try:
spam_list = [spam_list[i] for i in spam_order]
You can give a special key to the sort function:
order = dict(zip(spam_list, spam_order))
spam_list.sort(key=order.get)
Edit: As #ninjagecko points out in his answer, this is not really efficient, as it copies both lists to create the dictionary for the lookup. However, with the modified example given by the OP, this is the only way, because one has to build some index. The upside is that, at least for the strings, the values will not be copied, so the overhead is just that of the dictionary itself.
but I would like to directly affect spam_list, like list.sort() and not copy it like sorted()
There is ONLY ONE SOLUTION, that does exactly what you ask. Every single other solution is implicitly making a copy of one or both lists (or turning it into a dict, etc.). What you are asking for is a method which sorts two lists in-place, using O(1) extra space, using one list as the keys of the other. I personally would just accept the extra space complexity, but if you really wanted to, you could do this:
(edit: it may be the case that the original poster doesn't really care about .sort because it's efficient, but rather because it modifies state; in general this is a dangerous thing to want and non-low-level languages attempt to avoid this and even ban it, but the solutions which use slice assignment will achieve "in-place" semantics)
Create a custom dictionary subclass (effectively a Zip class) which is backed by both lists you are sorting.
Indexing myZip[i] -> results in the tuple (list1[i],list2[i])
Assignment myZip[i]=(x1,x2) -> dispatches into list1[i]=x1, list2[i]=x2.
Use that to do myZip(spam_list,spam_order).sort(), and now both spam_list and spam_order are sorted in-place
Example:
#!/usr/bin/python3
class LiveZip(list):
def __init__(self, list1, list2):
self.list1 = list1
self.list2 = list2
def __len__(self):
return len(self.list1)
def __getitem__(self, i):
return (self.list1[i], self.list2[i])
def __setitem__(self, i, tuple):
x1,x2 = tuple
self.list1[i] = x1
self.list2[i] = x2
spam_list = ["We", "are", "the", "knights", "who", "say", "Ni"]
spam_order = [0,1,2,4,5,6,3]
#spam_list.magical_sort(spam_order)
proxy = LiveZip(spam_order, spam_list)
Now let's see if it works...
#proxy.sort()
#fail --> oops, the internal implementation is not meant to be subclassed! lame
# It turns out that the python [].sort method does NOT work without passing in
# a list to the constructor (i.e. the internal implementation does not use the
# public interface), so you HAVE to implement your own sort if you want to not
# use any extra space. This kind of dumb. But the approach above means you can
# just use any standard textbook in-place sorting algorithm:
def myInPlaceSort(x):
# [replace with in-place textbook sorting algorithm]
NOW it works:
myInPlaceSort(proxy)
print(spam_list)
Unfortunately there is no way to just sort one list in O(1) space without sorting the other; if you don't want to sort both lists, you might as well do your original approach which constructs a dummy list.
You can however do the following:
spam_list.sort(key=lambda x:x)
but if the key or cmp functions makes any references to any collection (e.g. if you pass in a dict.__getitem__ of a dict you had to construct) this is no better than your original O(N)-space approach, unless you already happened to have such a dictionary lying around.
Turns out this is a duplicate question of Python sort parallel arrays in place? , but that question also had no correct answers except this one , which is equivalent to mine but without the sample code. Unless you are incredibly optimized or specialized code, I'd just use your original solution, which is equivalent in space complexity to the other solutions.
edit2:
As senderle pointed out, the OP doesn't want a sort at all, but rather wishes to, I think, apply a permutation. To achieve this, you can and SHOULD use simply indexing that other answers suggest [spam_list[i] for i in spam_order], but an explicit or implicit copy must be made still because you still need the intermediate data. (Unrelated and for the record, applying the inverse permutation is I think the inverse of parallel sorting with the identity, and you can use one to get the other, though sorting is less time-efficient. _,spam_order_inverse = parallelSort(spam_order, range(N)), then sort by spam_order_inverse. I leave the above discussion about sorting up for the record.)
edit3:
It is possible, however, to achieve an in-place permutation in O(#cycles) space but with terrible time efficiency. Every permutation can be decomposed into disjoint permutations applied in parallel on subsets. These subsets are called cycles or orbits. The period is equal to their size. You thus take a leap of faith and do as follows:
Create a temp variable.
For index i=0...N:
Put x_i into temp, assign NULL to x_i
Swap temp with x_p(i)
Swap temp with x_p(p(i))
...
Swap temp with x_p(..p(i)..), which is x_i
Put a "do not repeat" marker on the smallest element you visited larger than i
Whenever you encounter a "do not repeat" marker, perform the loop again but
without swapping, moving the marker to the smallest element larger than i
To avoid having to perform the loop again, use a bloom filter
This will run in O(N^2) time and O(#cycles) place without a bloom filter, or ~O(N) time and O(#cycle + bloomfilter_space) space if you use them
If the issue is specifically in-placeness and not memory usage per se -- if you want this to have side effects, in other words -- then you could use slice assignment. Stealing from Peter Collingridge:
other_spam_list = spam_list
spam_list[:] = [spam_list[i] for i in spam_order]
assert other_spam_list == spam_list
It seems you might even be able to do this with a generator expression! But I suspect this still implicitly creates a new sequence of some sort -- probably a tuple. If it didn't, I think it would exhibit wrong behavior; but I tested it, and its behavior seemed correct.
spam_list[:] = (spam_list[i] for i in spam_order)
Aha! See this excellent answer by the inimitable Sven Marnach -- generator slice assignment does indeed generate an implicit tuple. Which means it's safe, but not as memory efficient as you might think. Still, tuples are more memory efficient than lists, so the generator expression is preferable from that perspective.
map(lambda x:spam_list[x], spam_order)
If you actually don't care about the efficiency at all, and just want in-place semantics (which is a bit odd, because there are entire programming languages dedicated to avoiding in-place semantics), then you can do this:
def modifyList(toModify, newList):
toModify[:] = newList
def permuteAndUpdate(toPermute, permutation):
modifyList(toPermute, [toPermute[i] for i in permutation])
permuteAndUpdate(spam_list, spam_order)
print(spam_list)
# ['We', 'are', 'the', 'Ni', 'knights', 'who', 'say']
Credit goes to senderle for recognizing that this is what the OP may actually be after; he should feel free to copy this answer into his own. Should not accept this answer unless you really prefer it over his.
You may use numpy.
import numpy as np
spam_list = list(np.array(spam_list)[spam_order])
When you do something like "test" in a where a is a list does python do a sequential search on the list or does it create a hash table representation to optimize the lookup? In the application I need this for I'll be doing a lot of lookups on the list so would it be best to do something like b = set(a) and then "test" in b? Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Don't use a list, use a set() instead. It has exactly the properties you want, including a blazing fast in test.
I've seen speedups of 20x and higher in places (mostly heavy number crunching) where one list was changed for a set.
"test" in a with a list a will do a linear search. Setting up a hash table on the fly would be much more expensive than a linear search. "test" in b on the other hand will do an amoirtised O(1) hash look-up.
In the case you describe, there doesn't seem to be a reason to use a list over a set.
I think it would be better to go with the set implementation. I know for a fact that sets have O(1) lookup time. I think lists take O(n) lookup time. But even if lists are also O(1) lookup, you lose nothing with switching to sets.
Further, sets don't allow duplicate values. This will make your program slightly more memory efficient as well
List and tuples seems to have the same time, and using "in" is slow for large data:
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.66235494614
>>> t = tuple(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.6594209671
Here is much better solution: Most efficient way for a lookup/search in a huge list (python)
It's super fast:
>>> from bisect import bisect_left
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [t[bisect_left(t,b)]==b for b in range(100234,101234)];print(time.time()-a)
0.0054759979248
I have two very large lists and to loop through it once takes at least a second and I need to do it 200,000 times. What's the fastest way to remove duplicates in two lists to form one?
This is the fastest way I can think of:
import itertools
output_list = list(set(itertools.chain(first_list, second_list)))
Slight update: As jcd points out, depending on your application, you probably don't need to convert the result back to a list. Since a set is iterable by itself, you might be able to just use it directly:
output_set = set(itertools.chain(first_list, second_list))
for item in output_set:
# do something
Beware though that any solution involving the use of set() will probably reorder the elements in your list, so there's no guarantee that elements will be in any particular order. That said, since you're combining two lists, it's hard to come up with a good reason why you would need a particular ordering over them anyway, so this is probably not something you need to worry about.
I'd recommend something like this:
def combine_lists(list1, list2):
s = set(list1)
s.update(list2)
return list(s)
This eliminates the problem of creating a monster list of the concatenation of the first two.
Depending on what you're doing with the output, don't bother to convert back to a list. If ordering is important, you might need some sort of decorate/sort/undecorate shenanigans around this.
As Daniel states, a set cannot contain duplicate entries - so concatenate the lists:
list1 + list2
Then convert the new list to a set:
set(list1 + list2)
Then back to a list:
list(set(list1 + list2))
result = list(set(list1).union(set(list2)))
That's how I'd do it. I am not so sure about performance, though, but it is certainly better, than doing it by hand.