When you do something like "test" in a where a is a list does python do a sequential search on the list or does it create a hash table representation to optimize the lookup? In the application I need this for I'll be doing a lot of lookups on the list so would it be best to do something like b = set(a) and then "test" in b? Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Don't use a list, use a set() instead. It has exactly the properties you want, including a blazing fast in test.
I've seen speedups of 20x and higher in places (mostly heavy number crunching) where one list was changed for a set.
"test" in a with a list a will do a linear search. Setting up a hash table on the fly would be much more expensive than a linear search. "test" in b on the other hand will do an amoirtised O(1) hash look-up.
In the case you describe, there doesn't seem to be a reason to use a list over a set.
I think it would be better to go with the set implementation. I know for a fact that sets have O(1) lookup time. I think lists take O(n) lookup time. But even if lists are also O(1) lookup, you lose nothing with switching to sets.
Further, sets don't allow duplicate values. This will make your program slightly more memory efficient as well
List and tuples seems to have the same time, and using "in" is slow for large data:
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.66235494614
>>> t = tuple(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.6594209671
Here is much better solution: Most efficient way for a lookup/search in a huge list (python)
It's super fast:
>>> from bisect import bisect_left
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [t[bisect_left(t,b)]==b for b in range(100234,101234)];print(time.time()-a)
0.0054759979248
Related
According to my research there are two easy ways to remove duplicates from a list:
a = list(dict.fromkeys(a))
and
a = list(set(a))
Is one of them more efficient than the other?
Definitely the second one is more efficient as sets are more or less created for that purpose and you skip the overhead related to creation of dict which is way heavier.
Perfomance-wise it definitely depends on what the payload actually is.
import timeit
import random
input_data = [random.choice(range(100)) for i in range(1000)]
from_keys = timeit.timeit('list(dict.fromkeys(input_data))', number=10000, globals={'input_data': input_data})
from_set = timeit.timeit('list(set(input_data))', number=10000, globals={'input_data': input_data})
print(f"From keys performance: {from_keys:.3f}")
print(f"From set performance: {from_set:.3f}")
Prints:
From keys performance: 0.230
From set performance: 0.140
It doesn't really mean it's almost twice as fast. The difference is barely visible. Try it for yourself with different random data.
The second answer is way better not only because its faster, but it shows the intention of the programmer better. set() is designed specifically to describe mathematical sets in which elements cannot be duplicated, thus it fits this purpose and the intention is clear to the reader. On the other hand dict() is for storing key-value pairs and tells nothing about the intention.
in case we have a list containing a = [1,16,2,3,4,5,6,8,10,3,9,15,7]
and we used a = list(set(a)) the set() function will drop the duplication's and also reorder our list, the new list will look like this [1,2,3,4,5,6,7,8,9,10,15,16], while if we use a = list(dict.fromkeys(a)) the dict.fromkeys() function will drop the duplication's and keep the list elements in the same order [1,16,2,3,4,5,6,8,10,9,15,7].
to sum things up, if you're looking for a way to drop duplications from a list without caring about reordering the list then set() is what you're looking for, but!! if keeping the order of the list is required, then you can use dict.fromkeys()
CAUTION: since Python 3.7 the keys of a dict are ordered.
So the first form that uses
list(dict.fromkeys(a)) # preserves order!!
preserves the order while using the set will potentially (and probably) change the order of the elements of the list 'a'.
I quite often use set() to remove duplicates from lists. After doing so, I always directly change it back to a list.
a = [0,0,0,1,2,3,4,5]
b = list(set(a))
Why does set() return a set item, instead of simply a list?
type(set(a)) == set # is true
Is there a use for set items that I have failed to understand?
Yes, sets have many uses. They have lots of nice operations documented here which lists don't have. One very useful difference is that membership testing (x in a) can be much faster than for a list.
Okay, by doubles you mean duplicate? and set() will always return a set because it is a data structure in python like lists. when you are calling set you are creating an object of set().
rest of the information about sets you can find here
https://docs.python.org/2/library/sets.html
As already mentioned, I won't go into why set does not return a list but like you stated:
I quite often use set() to remove doubles from lists. After doing so, I always directly change it back to a list.
You could use OrderedDict if you really hate going back to changing it to a list:
source_list = [0,0,0,1,2,3,4,5]
from collections import OrderedDict
print(OrderedDict((x, True) for x in source_list).keys())
OUTPUT:
odict_keys([0, 1, 2, 3, 4, 5])
As said before, for certain operations if you use set instead of list, it is faster. Python wiki has query TimeComplexity in which speed of operations of various data types are given. Note that if you have few elements in your list or set, you will most probably do not notice difference, but with more elements it become more important.
Notice that for example if you want to make in-place removal, for list it is O(n) meaning that for 10 times longer list it will need 10 times more time, while for set and s.difference_update(t) where s is set, t is set with one element to be removed from s, time is O(1) i.e. independent from number of elements of s.
I am solving a problem in which I need a list of zeroes and after that I have to update some values in the list . Now I have two options in my mind how can I do this first is to simply make a list of zeroes and then update the values or I create a dictionary and then I update values .
List method :
l=[0]*n
Dictionary method :
d={}
for i in range(n):
d[i]=0
Now to complexity to build dictionary is O(n) and then updating a key is O(1) . But I don't know how python builds the list of zeroes using above method .
Let's assume n is a large number which one the above method will be better for this task ? and how is the list method implemented in python ? . Also why is the above list method faster than list comprehension method for creating list of zeroes ?
The access and update once you have pre-allocated your sequence will be roughly the same.
Pick a data-structure that makes sense for your application. In this case I suggest a list because it more naturally fits "sequence indexed by integers"
The reason [0]*n is fast is that it can make a list of the correct size in one go, rather than constantly expanding the list as more elements are added.
collections.defaultdict may be a better solution if you expect that a lot of elements will not change during your updates keeping initial value (and if you don't rely on KeyErrors somehow). Just
import collections
d = collections.defaultdict(int)
assert d[42] == 0
d[43] = 1
# ...
Another thing to consider is array.array. You can use it if you want to store only elements (counts) of one type. It should be a little faster and memory efficient than lists:
import array
l = array.array('L', [0]) * n
# use as list
After running a test using timeit:
import timeit
timeit.repeat("[0]*1000", number=1000000)
#[4.489016328923801, 4.459866205812087, 4.477892545204176]
timeit.repeat("""d={}
for i in range(1000):
d[i]=0""", number=1000000)
#[77.77789647192793, 77.88324065372811, 77.7300221235187]
timeit.repeat("""x={};x.fromkeys(range(1000),0)""", number=1000000)
#[53.62738158027423, 53.87422525293914, 53.50821399216625]
As you can see there is HUGE difference between these two methods and third one is better but not as lists! The reason is creating a list with size specified is way too faster than creating a dictionary with expanding it over iteration.
I think in this situation you should just use list, unless you want to access some data without using index.
Python list is an array. It initializes with a specific size, when it needs to store more items than its size can hold, it just copies everything to a new array, and the copying is O(k), where k is the then size of the list. this process can happen a lot of times until the list get to size bigger than or equal to n. However, [0]*n will just create the array with the right size (which is n), so it's faster than updating the list to the right size from the beginning.
For creation by list comprehension, if you mean something like [0 for i in range(n)], I think it suffers from updating the list size and so it is slower.
Python dictionary is an implementation of Hash Table, and it use a hash function to calculate the hash value for the key when you insert a new key-value pair. The execution of hash function itself is comparatively expensive, and dictionary also deals with other situations like collision, which makes it even slower. Thus, creation 0s by dictionary should be the slowest, in theory.
Is there an operator to remove elements from a List based on the content of a Set?
What I want to do is already possible by doing this:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = [w for w in words if w not in my_set]
# ["hello", "how", "today", "hello"]
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets. Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
So is there some way of computing a difference between a List and a Set in a shorter/cleaner/more efficient way than using a list comprehension, like for example:
# I know this is not possible, but does something approaching exist?
new_list = words - my_set
TL;DR
I'm looking for a way to remove all element presents in a Set from a List, that is either:
cleaner (with a built-in perhaps)
and/or more efficient
than what I know can be done with list comprehensions.
Unfortunately, the only answer for this is: No, there is no built-in way, implemented in native code, for this kind of operation.
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets.
I think what’s important here is the “looks” part. Yes, list comprehensions run more within Python than a set difference, but I assume that most of your application actually runs within Python (otherwise you should probably be programming in C instead). So you should consider whether it really matters much. Iterating over a list is fast in Python, and a membership test on a set is also super fast (constant time, and implemented in native code). And if you look at list comprehensions, they are also very fast. So it probably won’t matter much.
Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
It is true that native operations are faster, but they are also more specialized, limited and allow for less flexibility. For sets, a difference is pretty easy. The set difference is a mathematical concept and is very clearly defined.
But when talking about a “list difference” or a “list and set difference” (or more generalized “list and iterable difference”?) it becomes a lot more unclear. There are a lot open questions:
How are duplicates handled? If there are two X in the original list and only one X in the subtrahend, should both X disappear from the list? Should only one disappear? If so, which one, and why?
How is order handled? Should the order be kept as in the original list? Does the order of the elements in the subtrahend have any impact?
What if we want to subtract members based on some other condition than equality? For sets, it’s clear that they always work on the equality (and hash value) of the members. Lists don’t, so lists are by design a lot more flexible. With list comprehensions, we can easily have any kind of condition to remove elements from a list; with a “list difference” we would be restricted to equality, and that might actually be a rare situation if you think about it.
It’s maybe more likely to use a set if you need to calculate differences (or even some ordered set). And for filtering lists, it might also be a rare case that you want to end up with a filtered list, so it might be more common to use a generator expression (or the Python 3 filter() function) and work with that later without having to create that filtered list in memory.
What I’m trying to say is that the use case for a list difference is not as clear as a set difference. And if there was a use case, it might be a very rare use case. And in general, I don’t think it’s worth to add complexity to the Python implementation for this. Especially when the in-Python alternative, e.g. a list comprehension, is as fast as it already is.
First things first, are you prematurely worrying about an optimisation problem that isn't really an issue? I have to to have lists with at least 10,000,000 elements before I even get into the range of this operation taking 1/10ths of a second.
If you're working with large data sets then you may find it advantageous to move to using numpy.
import random
import timeit
r = range(10000000)
setup = """
import numpy as np
l = list({!r})
s = set(l)
to_remove = {!r}
n = np.array(l)
n_remove = np.array(list(to_remove))
""".format(r, set(random.sample(r, 3)))
list_filter = "[x for x in l if x not in to_remove]"
set_filter = "s - to_remove"
np_filter = "n[np.in1d(n, n_remove, invert=True)]"
n = 1
l_time = timeit.timeit(list_filter, setup, number=n)
print("lists:", l_time)
s_time = timeit.timeit(set_filter, setup, number=n)
print("sets:", s_time)
n_time = timeit.timeit(np_filter, setup, number=n)
print("numpy:", n_time)
returns the following results -- with numpy an order of magnitude faster than using sets.
lists: 0.8743789765043315
sets: 0.20703006886620656
numpy: 0.06197169088128707
I agree with poke. Here is my reasoning:
The easiest way to do it would be using a filter:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = filter(lambda w: w not in my_set, words)
And using Dunes solution, I get these times:
lists: 0.87401028
sets: 0.55103887
numpy: 0.16134396
filter: 0.00000886 WOW beats numpy by various orders of magnitude !!!
But wait, we are making a flawed comparison because we are comparing the time of making a list strictly (comprehension and set difference) vs. lazily (numpy and filter).
If I run Dunes solution but producing the actual lists, I get:
lists: 0.86804159
sets: 0.56945663
numpy: 1.19315723
filter: 1.68792561
Now numpy is slightly more efficient than using a simple filter, but both are not better than the list comprehension, which was the first and more intuitive solution.
I would definitely use a filter over the comprehension, except if I need to use the filtered list more than once (although I could tee it).
Suppose I have a set of words. For any given word, I would like to find if it is already in the set. What is some efficient data structure and/or algorithm for implementing that?
For example, is the following way using a hash table a good way?
first store the set of words, by using some hash function and a hash table.
given a query word, calculate its hash value and see if it is in the hash table.
In Python, is there already some data structure and/or algorithm
which can be used to implement the way you recommend?
Thanks!
Python has sets. For example:
>>> foo = set()
>>> foo.add('word')
>>> 'word' in foo
True
>>> 'bar' in foo
False
mywords = set(["this", "is", "a", "test"])
"test" in mywords # => True
"snorkle" in mywords # => False
Yes, python has a native dictionary data structure that is implemented using a HashTable, and so the in operator is executed in O(1) time on dictionaries. Per Allen Downey in Think Python
The in operator uses different algorithms for lists and dictionaries.
For lists, it uses a search algorithm, as in Section 8.6. As the list
gets longer, the search time gets longer in direct proportion. For
dictionaries, Python uses an algorithm called a hashtable that has a
re- markable property: the in operator takes about the same amount of
time no matter how many items there are in a dictionary.
Alternatively, if you're building a large set of words overtime and the words aren't too long consider using trie.
http://en.wikipedia.org/wiki/Trie
https://pypi.python.org/pypi/PyTrie