I have a problem trying to check if an element is part of a set in Python. (My set contains about 600K tuples of string.)
I'm searching for a solution that use the benefit of in operator to check if a value is element of a tuple of the set.
I've found solution like:
# S set of tuples, I'm checking if v is the second element of a tuple
any( y == v for (_, y) in S )
but this has a O(n) complexity.
The Python documentation says that the average complexity of the IN operator is O(1).
EDIT
My problem is: How to check if an element is the first/second/... element of at least one tuple of the set using the speed of in operator.
The complexity of a containment test depends on the object type, not the operator, because the operation is delegated to the container. Testing containment in a list is O(n), containment in a set is O(1).
However, you are not testing containment in a set, you are testing containment in a pile of tuples (where the container for the tuples can't help). Without further processing, you can't do better than O(n) here.
You could create and maintain separate datastructures, for example, where you track the separate values contained in your tuples as well as the tuples themselves, then test against those separate datastructures. That'd increase the memory requirements, but lower the computational cost.
You'd amortise the cost of keeping that structure up-to-date over the lifetime of your program (only increasing the constant cost of building the data structure slightly), and in return you get O(1) operations on your containment test. Only do this if you need to do this test multiple times, for different values.
average complexity of the IN operator is O(1)
That's correct for membership check in sets or any container that uses hash table for storing it's items like dictionary.
And its completely a different operation than following in:
for (_, y) in S
That in is just a part of the for loop syntax.
Also if you want to get the tuples that are contain a particular string you could use a list comprehension rather than the any:
[item for item in S if my_str in item]
If you want to take advantage of the set's membership checking you should have sets instead of tuples but since they're not hashable you'd not be allowed to use them within a set in that case you can use frozenset() instead.
And in case you want to just check the existence of an item that meets a certain criteria you can go with following generator expression within any :
any(my_str in item for item in S)
After all since your set completely has the potential of being a dictionary you can create a dictionary instead of set then just check the membership with my_str in my_dict. Your dictionary would be something like: {'luca': 1, 'mario': 2 , 'franco': 3}
Answering the question as posed (note: This is not how you usually want to solve it, because it's guaranteed O(n) behavior, because the in operator does not have guaranteed O(1), and in this case, never does).
You can use the in operator by mapping away the extraneous values from each tuple. Done with C-level built-ins, this will run faster than your any expression for large enough inputs, but the difference is small (maybe a 10% speed up for sufficiently large inputs where the value isn't there):
# At top of file
from future_builtins import map # Only on Py2, to get lazy map
from operator import itemgetter
v in map(itemgetter(1), S)
This works because the in operator is implemented for arbitrary iterators as a lazy check similar to any, pulling a value at a time, comparing to v, and short-circuiting out if it finds a hit.
Like I said, this is O(n); in the real world, if you might do this test more than once, you'd probably want to make a set of the target only and reuse it, or a dict mapping the target to the associated values if needed, to get O(1) checks. The other answers already cover this adequately.
Related
I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.
I was using a dictionary as a lookup table but I started to wonder if a list would be better for my application -- the amount of entries in my lookup table wasn't that big. I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
I decided to profile the alternatives but the results surprised me. List lookup was only better with a single element! See the following figure (log-log plot):
So here comes the question: Why do list lookups perform so poorly? What am I missing?
On a side question, something else that called my attention was a little "discontinuity" in the dict lookup time after approximately 1000 entries. I plotted the dict lookup time alone to show it.
p.s.1 I know about O(n) vs O(1) amortized time for arrays and hash tables, but it is usually the case that for a small number of elements iterating over an array is better than to use a hash table.
p.s.2 Here is the code I used to compare the dict and list lookup times:
import timeit
lengths = [2 ** i for i in xrange(15)]
list_time = []
dict_time = []
for l in lengths:
list_time.append(timeit.timeit('%i in d' % (l/2), 'd=range(%i)' % l))
dict_time.append(timeit.timeit('%i in d' % (l/2),
'd=dict.fromkeys(range(%i))' % l))
print l, list_time[-1], dict_time[-1]
p.s.3 Using Python 2.7.13
I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.
Your list lookups need to compute == a lot more than your dict lookups do.
Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)
Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.
Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.
The short answer is that lists use linear search and dicts use amortized O(1) search.
In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.
Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94
Roughly the logic for searching lists is:
for element in s:
if element is target:
# fast check for identity implies equality
return True
if element == target:
# slower check for actual equality
return True
return False
For dicts the logic is roughly:
h = hash(target)
for i in probe_sequence(h, len(table)):
element = key_table[i]
if element is UNUSED:
raise KeyError(target)
if element is target:
# fast path for identity implies equality
return value_table[i]
if h != h_table[i]:
# unequal hashes implies unequal keys
continue
if element == target:
# slower check for actual equality
return value_table[i]
Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).
If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.
You say "accessing a few elements in an array is faster than computing a hash".
A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I've got a list of dictionaries, and I'm looking for a unique list of values for one of the keys.
This is what I came up with, but can't help but wonder if its efficient, time and/or memory wise:
list(set([d['key'] for d in my_list]))
Is there a better way?
This:
list(set([d['key'] for d in my_list]))
… constructs a list of all values, then constructs a set of just the unique values, then constructs a list out of the set.
Let's say you had 10000 items, of which 1000 are unique. You've reduced final storage from 10000 items to 1000, which is great—but you've increased peak storage from 10000 to 11000 (because there clearly has to be a time when the entire list and almost the entire set are both in memory simultaneously).
There are two very simple ways to avoid this.
First (as long as you've got Python 2.4 or later) use a generator expression instead of a list comprehension. In most cases, including this one, that's just a matter of removing the square brackets or turning them into parentheses:
list(set(d['key'] for d in my_list))
Or, even more simply (with Python 2.7 or later), just construct the set directly by using a set comprehension instead of a list comprehension:
list({d['key'] for d in my_list})
If you're stuck with Python 2.3 or earlier, you'll have to write an explicit loop. And with 2.2 or earlier, there are no sets, so you'll have to fake it with a dict mapping each key to None or similar.
Beyond space, what about time? Well, clearly you have to traverse the entire list of 10000 dictionaries, and do an O(1) dict.get for each one.
The original version does a list.append (actually a slightly faster internal equivalent) for each of those steps, and then the set conversion is a traversal of a list of the same size with a set.add for each one, and then the list conversion is a traversal of a smaller set with a list.append for each one. So, it's O(N), which is clearly optimal algorithmically, and only worse by a smallish multiplier than just iterating the list and doing nothing.
The set version skips over the list.appends, and only iterates once instead of twice. So, it's also O(N), but with an even smaller multiplier. And the savings in memory management (if N is big enough to matter) may help as well.
Here's a example of what I want to do
spam_list = ["We", "are", "the", "knights", "who", "say", "Ni"]
spam_order = [0,1,2,4,5,6,3]
spam_list.magical_sort(spam_order)
print(spam_list)
["We", "are", "the", "who", "say", "Ni", "knights"]
I can do it with enumerate, list and so on, but I would like to directly affect spam_list, like list.sort() and not copy it like sorted()
Edit : pushed a string example to avoid confusion between indices and values of spam_list
Edit : turned out this is a duplicate of Python sort parallel arrays in place?. Well, I can't delete so much efforts for SO consistency arguments.
You could try:
spam_list = [spam_list[i] for i in spam_order]
You can give a special key to the sort function:
order = dict(zip(spam_list, spam_order))
spam_list.sort(key=order.get)
Edit: As #ninjagecko points out in his answer, this is not really efficient, as it copies both lists to create the dictionary for the lookup. However, with the modified example given by the OP, this is the only way, because one has to build some index. The upside is that, at least for the strings, the values will not be copied, so the overhead is just that of the dictionary itself.
but I would like to directly affect spam_list, like list.sort() and not copy it like sorted()
There is ONLY ONE SOLUTION, that does exactly what you ask. Every single other solution is implicitly making a copy of one or both lists (or turning it into a dict, etc.). What you are asking for is a method which sorts two lists in-place, using O(1) extra space, using one list as the keys of the other. I personally would just accept the extra space complexity, but if you really wanted to, you could do this:
(edit: it may be the case that the original poster doesn't really care about .sort because it's efficient, but rather because it modifies state; in general this is a dangerous thing to want and non-low-level languages attempt to avoid this and even ban it, but the solutions which use slice assignment will achieve "in-place" semantics)
Create a custom dictionary subclass (effectively a Zip class) which is backed by both lists you are sorting.
Indexing myZip[i] -> results in the tuple (list1[i],list2[i])
Assignment myZip[i]=(x1,x2) -> dispatches into list1[i]=x1, list2[i]=x2.
Use that to do myZip(spam_list,spam_order).sort(), and now both spam_list and spam_order are sorted in-place
Example:
#!/usr/bin/python3
class LiveZip(list):
def __init__(self, list1, list2):
self.list1 = list1
self.list2 = list2
def __len__(self):
return len(self.list1)
def __getitem__(self, i):
return (self.list1[i], self.list2[i])
def __setitem__(self, i, tuple):
x1,x2 = tuple
self.list1[i] = x1
self.list2[i] = x2
spam_list = ["We", "are", "the", "knights", "who", "say", "Ni"]
spam_order = [0,1,2,4,5,6,3]
#spam_list.magical_sort(spam_order)
proxy = LiveZip(spam_order, spam_list)
Now let's see if it works...
#proxy.sort()
#fail --> oops, the internal implementation is not meant to be subclassed! lame
# It turns out that the python [].sort method does NOT work without passing in
# a list to the constructor (i.e. the internal implementation does not use the
# public interface), so you HAVE to implement your own sort if you want to not
# use any extra space. This kind of dumb. But the approach above means you can
# just use any standard textbook in-place sorting algorithm:
def myInPlaceSort(x):
# [replace with in-place textbook sorting algorithm]
NOW it works:
myInPlaceSort(proxy)
print(spam_list)
Unfortunately there is no way to just sort one list in O(1) space without sorting the other; if you don't want to sort both lists, you might as well do your original approach which constructs a dummy list.
You can however do the following:
spam_list.sort(key=lambda x:x)
but if the key or cmp functions makes any references to any collection (e.g. if you pass in a dict.__getitem__ of a dict you had to construct) this is no better than your original O(N)-space approach, unless you already happened to have such a dictionary lying around.
Turns out this is a duplicate question of Python sort parallel arrays in place? , but that question also had no correct answers except this one , which is equivalent to mine but without the sample code. Unless you are incredibly optimized or specialized code, I'd just use your original solution, which is equivalent in space complexity to the other solutions.
edit2:
As senderle pointed out, the OP doesn't want a sort at all, but rather wishes to, I think, apply a permutation. To achieve this, you can and SHOULD use simply indexing that other answers suggest [spam_list[i] for i in spam_order], but an explicit or implicit copy must be made still because you still need the intermediate data. (Unrelated and for the record, applying the inverse permutation is I think the inverse of parallel sorting with the identity, and you can use one to get the other, though sorting is less time-efficient. _,spam_order_inverse = parallelSort(spam_order, range(N)), then sort by spam_order_inverse. I leave the above discussion about sorting up for the record.)
edit3:
It is possible, however, to achieve an in-place permutation in O(#cycles) space but with terrible time efficiency. Every permutation can be decomposed into disjoint permutations applied in parallel on subsets. These subsets are called cycles or orbits. The period is equal to their size. You thus take a leap of faith and do as follows:
Create a temp variable.
For index i=0...N:
Put x_i into temp, assign NULL to x_i
Swap temp with x_p(i)
Swap temp with x_p(p(i))
...
Swap temp with x_p(..p(i)..), which is x_i
Put a "do not repeat" marker on the smallest element you visited larger than i
Whenever you encounter a "do not repeat" marker, perform the loop again but
without swapping, moving the marker to the smallest element larger than i
To avoid having to perform the loop again, use a bloom filter
This will run in O(N^2) time and O(#cycles) place without a bloom filter, or ~O(N) time and O(#cycle + bloomfilter_space) space if you use them
If the issue is specifically in-placeness and not memory usage per se -- if you want this to have side effects, in other words -- then you could use slice assignment. Stealing from Peter Collingridge:
other_spam_list = spam_list
spam_list[:] = [spam_list[i] for i in spam_order]
assert other_spam_list == spam_list
It seems you might even be able to do this with a generator expression! But I suspect this still implicitly creates a new sequence of some sort -- probably a tuple. If it didn't, I think it would exhibit wrong behavior; but I tested it, and its behavior seemed correct.
spam_list[:] = (spam_list[i] for i in spam_order)
Aha! See this excellent answer by the inimitable Sven Marnach -- generator slice assignment does indeed generate an implicit tuple. Which means it's safe, but not as memory efficient as you might think. Still, tuples are more memory efficient than lists, so the generator expression is preferable from that perspective.
map(lambda x:spam_list[x], spam_order)
If you actually don't care about the efficiency at all, and just want in-place semantics (which is a bit odd, because there are entire programming languages dedicated to avoiding in-place semantics), then you can do this:
def modifyList(toModify, newList):
toModify[:] = newList
def permuteAndUpdate(toPermute, permutation):
modifyList(toPermute, [toPermute[i] for i in permutation])
permuteAndUpdate(spam_list, spam_order)
print(spam_list)
# ['We', 'are', 'the', 'Ni', 'knights', 'who', 'say']
Credit goes to senderle for recognizing that this is what the OP may actually be after; he should feel free to copy this answer into his own. Should not accept this answer unless you really prefer it over his.
You may use numpy.
import numpy as np
spam_list = list(np.array(spam_list)[spam_order])