A data-structure for 1:1 mappings in python? - python

I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]

The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).

class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]

The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.

Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)

Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.

"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).

It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]

How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.

Related

Why does {}.values() == {}.values() return False? [duplicate]

With Python 3:
>>> from collections import OrderedDict
>>> d1 = OrderedDict([('foo', 'bar')])
>>> d2 = OrderedDict([('foo', 'bar')])
I wanted to check for equality:
>>> d1 == d2
True
>>> d1.keys() == d2.keys()
True
But:
>>> d1.values() == d2.values()
False
Do you know why values are not equal?
I've tested this with Python 3.4 and 3.5.
Following this question, I posted on the Python-Ideas mailing list to have additional details:
https://mail.python.org/pipermail/python-ideas/2015-December/037472.html
In Python 3, dict.keys() and dict.values() return special iterable classes - respectively a collections.abc.KeysView and a collections.abc.ValuesView. The first one inherit it's __eq__ method from set, the second uses the default object.__eq__ which tests on object identity.
In python3, d1.values() and d2.values() are collections.abc.ValuesView objects:
>>> d1.values()
ValuesView(OrderedDict([('foo', 'bar')]))
Don't compare them as an object, convert them to lists and then compare them:
>>> list(d1.values()) == list(d2.values())
True
Investigating why it works for comparing keys, in _collections_abc.py of CPython, KeysView is inheriting from Set while ValuesView does not:
class KeysView(MappingView, Set):
class ValuesView(MappingView):
Tracing for __eq__ in ValuesView and its parents:
MappingView ==> Sized ==> ABCMeta ==> type ==> object.
__eq__ is implemented only in object and not overridden.
In the other hand, KeysView inherits __eq__ directly from Set.
Unfortunately, both current answers don't address why this is but focus on how this is done. That mailing list discussion was amazing, so I'll sum things up:
For odict.keys/dict.keys and odict.items/dict.items:
odict.keys (subclass of dict.keys) supports comparison due to its conformance to collections.abc.Set (it's a set-like object). This is possible due to the fact that keys inside a dictionary (ordered or not) are guaranteed to be unique and hashable.
odict.items (subclass of dict.items) also supports comparison for the same reason as .keys does. itemsview is allowed to do this since it raises the appropriate error if one of the items (specifically, the second element representing the value) is not hashable, uniqueness is guaranteed, though (due to keys being unique):
>>> od = OrderedDict({'a': []})
>>> set() & od.items()
TypeErrorTraceback (most recent call last)
<ipython-input-41-a5ec053d0eda> in <module>()
----> 1 set() & od.items()
TypeError: unhashable type: 'list'
For both these views keys, items, the comparison uses a simple function called all_contained_in (pretty readable) that uses the objects __contain__ method to check for membership of the elements in the views involved.
Now, about odict.values/dict.values:
As noticed, odict.values (subclass of dict.values [shocker]) doesn't compare like a set-like object. This is because the values of a valuesview cannot be represented as a set, the reasons are two-fold:
Most importantly, the view might contain duplicates which cannot be dropped.
The view might contain non-hashable objects (which, on it's own, isn't sufficient to not treat the view as set-like).
As stated in a comment by #user2357112 and by #abarnett in the mailing list, odict.values/dict.values is a multiset, a generalization of sets that allows multiple instances of it's elements.
Trying to compare these is not as trivial as comparing keys or items due to the inherent duplication, the ordering and the fact that you probably need to take into consideration the keys that correspond to those values. Should dict_values that look like this:
>>> {1:1, 2:1, 3:2}.values()
dict_values([1, 1, 2])
>>> {1:1, 2:1, 10:2}.values()
dict_values([1, 1, 2])
actually be equal even though the values that correspond to the keys isn't the same? Maybe? Maybe not? It isn't straight-forward either way and will lead to inevitable confusion.
The point to be made though is that it isn't trivial to compare these as is with keys and items, to sum up, with another comment from #abarnett on the mailing list:
If you're thinking we could define what multisets should do, despite not having a standard multiset type or an ABC for them, and apply that to values views, the next question is how to do that in better than quadratic time for non-hashable values. (And you can't assume ordering here, either.) Would having a values view hang for 30 seconds and then come back with the answer you intuitively wanted instead of giving the wrong answer in 20 millis be an improvement? (Either way, you're going to learn the same lesson: don't compare values views. I'd rather learn that in 20 millis.)

Initialising a large dict with unknown keys? Is there a better way than this?

So I have a list of around 75000 tuples that I want to push into a dict. It seems like after around 20,000 entries, the whole program slows down and I assume this is because the dict is being dynamically resized as it is filled.
The key value used for the dict is in a different position in the tuple depending on the data, so I can't just extract the keys from the list of tuples into list x and invoke d.fromkeys(x) to pre-initialise the large dict. I've tried put together a solution, but after the dict is evaluated by ast.literal_eval, all I get is a single {'None': 'None'} :/
My soln (which doesn't work).
d_frame = '{'+('\'None\': \'None\',' * 100000)+'}'
d = ast.literal_eval(d_frame)
Is there a builtin method for something like this..
Thanks,
EDIT: I realise the stupidity of my idea.. Obviously you can't have identical keys in a dictionary.... :/
Just to clarify, I have a list of tuples with data like this:
(assembly,strand,start_pos,end_pos,read_count)
key_format : assembly_strand_val ( where val = start_pos or end_pos depending on other factors )
Because I don't know the key until I evaluate each tuple, I can't initialise the dict with known keys so was just wondering if I can create an empty dict and then add to it.. It doesn't make sense to evluate each tuple just to build a list then create a dict then repeat tuple eval...
EDIT: I realized where the bottleneck was.. With each tuple, I was checking to see if the relevant key already exited in the dict, but I was using;
if key not in dict.keys():
dict[key] = foo
I didn't realise this builds a list of keys everytime and could be replaced with the far more economical
if key not in dict:
dict[key] = foo
Changing this resulted in a staggering increase in speed....
So I have a list of around 75000 tuples that I want to push into a dict.
Just call dict on the list. Like this:
>>> list_o_tuples = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
>>> d = dict(list_o_tuples)
>>> d
{1: 'a', 2: 'b', 3: 'c', 4: 'd'}
The key value used for the dict is in a different position in the tuple depending on the data
That isn't demonstrated at all by the code you showed us, but if you can write an expression or a function for pulling out the key, you can use it in a dict comprehension, or in a generator expression you can pass to the dict function.
For example, let's say the first element of the tuple is the index of the actual key element. Then you'd write this:
d = {tup[tup[0]]: tup for tup in list_o_tuples}
It seems like after around 20,000 entries, the whole program slows down and I assume this is because the dict is being dynamically resized as it is filled.
That seems unlikely. Yes, the dict is resized, but it's done so exponentially, and it's still blazingly fast at sizes well beyond 20000. Profile your program to see where it's actually slowing down. My guess would either be that you're doing some quadratic work to create or pull out the values, or you're generating huge amounts of storage and causing swapping, neither of which have anything to do with inserting the values into the dict.
At any rate, if you really do want to "pre-fill" the dict, you can always do this:
d = dict.from_keys(range(100000))
for i, tup in enumerate(list_o_tuples):
del d[i]
d[list_o_tuples[0]] = list_o_tuples[1]
Then the dict never has to resize. (Obviously if your keys overlap the ints from 0-99999 you'll want to use different filler keys, but the same idea will work.)
But I'm willing to bet this makes absolutely no difference to your performance.
I've tried put together a solution, but after the dict is evaluated by ast.literal_eval, all I get is a single {'None': 'None'}
That's because you're creating a dict with 100K copies of the same key. You can't have duplicate keys in a dict, so of course you end up with just one item.
However, this is a red herring. Creating a string to eval is almost never the answer. Your big mess of code is effectively just a slower, less memory-efficient, and harder-to-read version of this:
d = {'None': 'None' for _ in range(100000)}
Or, if you prefer:
d = dict([('None', 'None')] * 100000)

More efficient way to get unique first occurrence from a Python dict

I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values
So it would look like:
{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}
and it would output:
{E:2,F:3,G:1}
It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.
Here is the best way I have found to do it but it is extremely slow as the file gets larger.
mapp = {}
value_holder = []
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.append(mydict[i])
Must look through value_holder every time :( Is there a faster way to do this?
Yes, a trivial change makes it much faster:
value_holder = set()
(Well, you also have to change the append to add. But still pretty simple.)
Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.
One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.
You might want to look at how unique_everseen is implemented in the itertools recipes in the standard library documentation.
Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:
inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}
(This is effectively a form of the decorate-process-undecorate idiom without any processing.)
But instead of building up the dict and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set alongside the dict as you go along. For example, instead of this:
mydict = {}
for line in f:
k, v = line.split(None, 1)
mydict[k] = v
mapp = {}
value_holder = set()
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.add(mydict[i])
Just do this:
mapp = {}
value_holder = set()
for line in f:
k, v = line.split(None, 1)
if v not in value_holder:
mapp[k] = v
value_holder.add(v)
In fact, you may want to consider writing a one_to_one_dict that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:
mapp = one_to_one_dict()
for line in f:
k, v = line.split(None, 1)
mapp[k] = v
I'm not completely clear on exactly what you're doing, but set is a great way to remove duplicates. For example:
>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]
Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set once.
The first way to speed this up, as others have mentioned, is a using a set to record seen values, as checking for membership on a set is much faster.
We can also make this a lot shorter with a dict comprehension:
seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}
The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or a little bit hackishly to ensure any unseen values are added to the set. As set.add() returns None, it will not affect the outcome.
As always, in 2.x, user dict.iteritems() over dict.items().
Using a set instead of a list would speed you up considerably ...
You said you are reading from a very large file and want to keep only the first occurrence of a key. I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. This code will do that and will be fast.
values_seen = set()
mapp = {}
with open("large_file.txt") as f:
for line in f:
key, value = line.split()
if value not in values_seen:
values_seen.add(value)
mapp[key] = value
You were using a list to keep track of what keys your code had seen. Searching through a list is very slow: it gets slower the larger the list gets. A set is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). (A dict also works the way a set works.)
Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. They use hash tables to index items (see this great article). So there's no real concept of "first occurence of value" in this sort of data structure. The right way to do this would probably be a list of key-value pairs. e.g. :
kv_pairs = [(k1,v1),(k2,v2),...]
or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs:
def kv_iter(f):
# f being the file descriptor
for line in f:
yield ... # (whatever logic you use to get k, v values from a line)
Value_holder is a great candidate for a set variable. You are really just testing whether value_holder. Because values are unique, they can be indexed more efficiently using a similar hashing method. So it would end up a bit like this:
mapp = {}
value_holder = set()
for k,v in kv_iter(f):
if v in value_holder:
mapp[k] = v
value_holder.add(v)

What's the idiomatic way to fake __hash__() for dicts?

EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)

Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.
You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."
To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).
Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.
[(i, j) for i, j in myList if myDict.get(j) != i]
Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)
The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)
[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]
A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Categories

Resources