Trim an ordered dict to the last x items - python

I'm trying to trim an ordered dict to the last x items.
I have the following code, which works but doesn't seem very pythonic.
Is there a better way of doing this?
import collections
d = collections.OrderedDict()
# SNIP: POPULATE DICT HERE!
d = collections.OrderedDict(d.items()[-3:])

This works a bit faster:
for k in range(len(d) - x) : data.popitem(last = False)
Not really sure how pythonic it is though.
Benefits include not having to cast create a new OrderedDict object, and not having to look at keys or items.

If you wish to trim the dictionary in place, then you can pop the offending items:
for k in d.keys()[:-3]:
d.pop(k)
(On python 3, you'll need to convert .keys() to a list).
If you're wishing to create a new OrderedDict, then its not clear quite what is "unpythonic" about your current approach.

Related

Faster way to filter a list of dictionaries

I have a large list of dicts (200,000+) and need to filter those dicts based on a key many times (~11,000). What is the fastest way to do this?
I am retrieving a list of dicts (olist), roughly 225,000 dicts, and am trying to filter those dicts based on a single key ('type'). Currently, I build a list of all 'type' present in the dicts and then iterate over it, filtering the dicts for every 'type'. My problem is it takes ~.3s to do this initial 'type' filter, which would require almost an hour to run. I use threading which is getting me down to just over 10min but I would like to be closer to half that. Bellow are the relevant snippets of my code, is there a faster way of doing this (either faster filter or more effective algorithm)?
tLim = threading.BoundedSemaphore(500)
...
olist = _get_co_(h) ## this returns a list of ~225,000 dictionaries
idlist = list(set([d['type'] for d in olist])) ## returns list of ~11,000
for i in idlist:
t = Thread(target=_typeData_, args=(i,olist,cData))
threads.append(t)
def _typeData_(i,olist,cData):
tLim.acquire()
tList = list(filter(lambda x: x['type'] == i, olist)) ## takes ~0.3s
do stuff with tList ## takes ~0.01s
Please note, I've look at generator expressions but it seems like having to store and recall the results might be worse? I haven't tried it though and I'm not sure how I would implement it...
Also, increasing the semaphore does not improve time much, if at all.
You could group the dictionaries by type so you can avoid the filter later on:
from collections import defaultdict
id_groups = defaultdict(list)
for dct in olist:
id_groups[dct['type']].append(dct)
now you don't need to filter at all, you just iterate over this id_groups and you'll get a list of all dictionaries of that type:
for i, tList in id_groups.items():
# the i and tList are identical to your variables in the "_typeData_" function.
# do something with tList

More efficient way to get unique first occurrence from a Python dict

I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values
So it would look like:
{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}
and it would output:
{E:2,F:3,G:1}
It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.
Here is the best way I have found to do it but it is extremely slow as the file gets larger.
mapp = {}
value_holder = []
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.append(mydict[i])
Must look through value_holder every time :( Is there a faster way to do this?
Yes, a trivial change makes it much faster:
value_holder = set()
(Well, you also have to change the append to add. But still pretty simple.)
Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.
One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.
You might want to look at how unique_everseen is implemented in the itertools recipes in the standard library documentation.
Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:
inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}
(This is effectively a form of the decorate-process-undecorate idiom without any processing.)
But instead of building up the dict and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set alongside the dict as you go along. For example, instead of this:
mydict = {}
for line in f:
k, v = line.split(None, 1)
mydict[k] = v
mapp = {}
value_holder = set()
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.add(mydict[i])
Just do this:
mapp = {}
value_holder = set()
for line in f:
k, v = line.split(None, 1)
if v not in value_holder:
mapp[k] = v
value_holder.add(v)
In fact, you may want to consider writing a one_to_one_dict that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:
mapp = one_to_one_dict()
for line in f:
k, v = line.split(None, 1)
mapp[k] = v
I'm not completely clear on exactly what you're doing, but set is a great way to remove duplicates. For example:
>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]
Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set once.
The first way to speed this up, as others have mentioned, is a using a set to record seen values, as checking for membership on a set is much faster.
We can also make this a lot shorter with a dict comprehension:
seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}
The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or a little bit hackishly to ensure any unseen values are added to the set. As set.add() returns None, it will not affect the outcome.
As always, in 2.x, user dict.iteritems() over dict.items().
Using a set instead of a list would speed you up considerably ...
You said you are reading from a very large file and want to keep only the first occurrence of a key. I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. This code will do that and will be fast.
values_seen = set()
mapp = {}
with open("large_file.txt") as f:
for line in f:
key, value = line.split()
if value not in values_seen:
values_seen.add(value)
mapp[key] = value
You were using a list to keep track of what keys your code had seen. Searching through a list is very slow: it gets slower the larger the list gets. A set is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). (A dict also works the way a set works.)
Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. They use hash tables to index items (see this great article). So there's no real concept of "first occurence of value" in this sort of data structure. The right way to do this would probably be a list of key-value pairs. e.g. :
kv_pairs = [(k1,v1),(k2,v2),...]
or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs:
def kv_iter(f):
# f being the file descriptor
for line in f:
yield ... # (whatever logic you use to get k, v values from a line)
Value_holder is a great candidate for a set variable. You are really just testing whether value_holder. Because values are unique, they can be indexed more efficiently using a similar hashing method. So it would end up a bit like this:
mapp = {}
value_holder = set()
for k,v in kv_iter(f):
if v in value_holder:
mapp[k] = v
value_holder.add(v)

What's the idiomatic way to fake __hash__() for dicts?

EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)

How to print dictionary's values from left to right?

I have a dictionary:
a = {"w1": "wer", "w2": "qaz", "w3": "edc"}
When I try to print its values, they are printed from right to left:
>>> for item in a.values():
print item,
edc qaz wer
I want them to be printed from left to right:
wer qaz edc
How can I do it?
You can't. Dictionaries don't have any order you can use, so there's no concept of "left to right" with regards to dictionary literals. Decide on a sorting, and stick with it.
You can use collections.OrderedDict (python 2.7 or newer -- There's an ActiveState recipe somewhere which provides this functionality for python 2.4 or newer (I think)) to store your items. Of course, you'll need to insert the items into the dictionary in the proper order (the {} syntax will no longer work -- nor will passing key=value to the constructor, because as others have mentioned, those rely on regular dictionaries which have no concept of order)
Assuming you want them in alphabetical order of the keys, you can do something like this:
a = {"w1": "wer", "w2": "qaz", "w3": "edc"} # your dictionary
keylist = a.keys() # list of keys, in this case ["w3", "w2", "w1"]
keylist.sort() # sort alphabetically in place,
# changing keylist to ["w1", "w2", w3"]
for key in keylist:
print a[key] # access dictionary in order of sorted keys
as #IgnacioVazquez-Abrams mentioned, this is no such thing as order in dictionaries, but you can achieve a similar effect by using the ordered dict odict from http://pypi.python.org/pypi/odict
also check out PEP372 for more discussion and odict patches.
Dictionaries use hash values to associate values. The only way to sort a dictionary would look something like:
dict = {}
x = [x for x in dict]
# sort here
y = []
for z in x: y.append(dict[z])
I haven't done any real work in python in a while, so I may be a little rusty. Please correct me if I am mistaken.

How to get value on a certain index, in a python list?

I have a list which looks something like this
List = [q1,a1,q2,a2,q3,a3]
I need the final code to be something like this
dictionary = {q1:a1,q2:a2,q3:a3}
if only I can get values at a certain index e.g List[0] I can accomplish this, is there any way I can get it?
Python dictionaries can be constructed using the dict class, given an iterable containing tuples. We can use this in conjunction with the range builtin to produce a collection of tuples as in (every-odd-item, every-even-item), and pass it to dict, such that the values organize themselves into key/value pairs in the final result:
dictionary = dict([(List[i], List[i+1]) for i in range(0, len(List), 2)])
Using extended slice notation:
dictionary = dict(zip(List[0::2], List[1::2]))
The range-based answer is simpler, but there's another approach possible using the itertools package:
from itertools import izip
dictionary = dict(izip(*[iter(List)] * 2))
Breaking this down (edit: tested this time):
# Create instance of iterator wrapped around List
# which will consume items one at a time when called.
iter(List)
# Put reference to iterator into list and duplicate it so
# there are two references to the *same* iterator.
[iter(List)] * 2
# Pass each item in the list as a separate argument to the
# izip() function. This uses the special * syntax that takes
# a sequence and spreads it across a number of positional arguments.
izip(* [iter(List)] * 2)
# Use regular dict() constructor, same as in the answer by zzzeeek
dict(izip(* [iter(List)] * 2))
Edit: much thanks to Chris Lutz' sharp eyes for the double correction.
d = {}
for i in range(0, len(List), 2):
d[List[i]] = List[i+1]
You've mentioned in the comments that you have duplicate entries. We can work with this. Take your favorite method of generating the list of tuples, and expand it into a for loop:
from itertools import izip
dictionary = {}
for k, v in izip(List[::2], List[1::2]):
if k not in dictionary:
dictionary[k] = set()
dictionary[k].add(v)
Or we could use collections.defaultdict so we don't have to check if a key is already initialized:
from itertools import izip
from collections import defaultdict
dictionary = defaultdict(set)
for k, v in izip(List[::2], List[1::2]):
dictionary[k].add(v)
We'll end with a dictionary where all the keys are sets, and the sets contain the values. This still may not be appropriate, because sets, like dictionaries, cannot hold duplicates, so if you need a single key to hold two of the same value, you'll need to change it to a tuple or a list. But this should get you started.

Categories

Resources