Python dict.setdefault uses more memory? [closed]

Python dict.setdefault uses more memory? [closed] - python

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I was writing some Python code that involved something like this
values = {}
for element in iterable:
values.setdefault(element.name, []).append(element)
Because I could have sorted the input previously, I also implemented it like this
values = {}
cur_name = None
cur_list = None
for element in iterable:
if element.name != cur_name:
values[cur_name] = cur_list
cur_name = element.name
cur_list = []
cur_list.append(element)
if cur_list:
values[cur_name] = cur_list
del values[None]
Here the input is already sorted by element.name.
The second approach was much faster than the first approach, and it also used less memory.
What's the reason for this?
Or have I made some sort of mistake in the second approach?

Your original code every time round the loop will create a list that mostly then just gets thrown away. It also makes multiple dictionary lookups (looking up the method setdefault is a dictionary lookup and then the method itself does a dictionary lookup to see whether the object was set and if it isn't does another to store the value). .name and .append() are also dictionary lookups but they are still present in the revised code.
for element in iterable:
values.setdefault(element.name, []).append(element)
The revised code only looks up the dictionary when the name changes, so it it removes two dictionary lookups and a method call from every loop. That's why it's faster.
As for the memory use, when the list grows it may sometimes have to copy the data but can avoid that if the memory block can just be expanded. My guess would be that creating all of those unused temporary lists may be fragmenting the memory more and forcing more copies. In other words Python isn't actually using more memory, but it may have more allocated but unused memory.
When you feel a need for setdefault consider using collections.defaultdict instead. That avoids creating the list except when it's needed:
from collections import defaultdict
values = defaultdict(list)
for element in iterable:
values[element.name].append(element)
That will probably still be slower than your second code because it doesn't take advantage of your knowledge that names are all grouped, but for the general case it is better than setdefault.
Another way would be to use itertools.groupby. Something like this:
from itertools import groupby
from operator import attrgetter
values = { name: list(elements) for name,elements in
groupby(elements, attrgetter('name')) }
That takes advantage of the ordering and simplifies everything down to a single dictionary comprehension.

I can think of a couple of reasons why the second approach is faster.
values.setdefault(element.name, []).append(element)
Here you're creating an empty list for each element, even if you're never going to use it. You're also calling the setdefault method for every element, and that amounts to one hash table lookup and a possible hash table write, plus the cost of calling the method itself, which is not insignificant in python. Finally, as the others have pointed out after I posted this answer, you're looking up the setdefault attribute once for each element, even though it always references the same method.
In the second example you avoid all of these inefficiencies. You're only creating as many lists as you need, and you do it all without calling any method but the required list.append, interspersed with a smattering of dictionary assignments. You're also in effect replacing the hash table lookup with a simple comparison (element.name != cur_name), and that's another improvement.
I expect you also get cache benefits, since you're not jumping all over the place when adding items to lists (which would cause lots of cache misses), but work on one list at a time. This way, the relevant memory is probably in a cache layer very near the CPU so that the process is faster. This effect should not be underestimated -- getting data from RAM is two orders of magnitude (or ~100 times) slower than reading it from L1 cache (source).
Of course the sorting adds a little time, but python has one of the best and most optimized sorting algorithms in the world, all coded in C, so it doesn't outweigh the benefits listed above.
I don't know why the second solution is more memory efficient though. As Jiri points out it might be the unneccessary lists, but my understanding is that these should have been collected immediately by the garbage collector, so it should only increase the memory usage by a tiny amount -- the size of a single empty list. Maybe it's because the garbage collector is lazier than I thought.

Your first version has two inefficient parts:
calling and dereferencing values.setdefault in a loop. You can assing values_setdefault = values.setdefault before the loop and it can speed things up a bit.
as other answer suggested, creating new empty list for every element in your list is very slow and memory inefficient. I do not know how to avoid it and using setdefault at once.

Related

Best / most pythonic way to remove duplicate from the a list and sort in reverse order

I'm trying to take a list (orig_list below), and return a list (new_list below) which:
does not contain duplicate items (i.e. contains only unique elements)
is sorted in reverse order
Here is what I have so far, which seems... I'm going to say "weird," though I'm sure there is a better way to say that. I'm mostly put off by using list() twice for what seems pretty straightforward, and then I'm wondering about the efficiency of this approach.
new_list = list(reversed(sorted(list(set(orig_list)))))
Question #1 (SO-style question):
Are the following propositions correct?
There is no more efficient way to get unique elements of a list than converting the list to a set and back.
Since sets are unordered in Python one must (1) convert to a set before removing duplicate items because otherwise you'd lose the sort anyway, and (2) you have to convert back to a list before you sort.
Using list(reversed()) is programatically equivalent to using list.sort(reversed=True).
Question #2 (bonus):
Are there any ways to achieve the same result in fewer Os, or using a less verbose approach? If so, what is an / are some example(s)?

sorted(set(orig_list), reverse=True)
Shortest in code, more efficient, same result.
Depending on the size, it may or may not be faster to sort first then dedupe in linear time as user2864740 suggests in comments. (The biggest drawback to that approach is it would be entirely in Python, while the above line executes mostly in native code.)
Your questions:
You do not need to convert from set to list and back. sorted accepts any iterable, so set qualifies, and spits out a list, so no post-conversion needed.
reversed(sorted(x)) is not equivalent to sorted(x, reverse=True). You get the same result, but slower - sort is of same speed whether forward or reverse, so reversed is adding an extra operation that is not needed if you sort to the proper ordering from the start.

You've got a few mildly wasteful steps in here, but your proposition is largely correct. The only real improvements to be made are to get rid of all the unnecessary temporary lists:
new_list = sorted(set(orig_list), reverse=True)
sorted already converts its input to a list (so no need to listify before passing to sorted), and you can have it directly produce the output list sorted in reverse (so no need to produce a list only to make a copy of it in reverse).
The only conceivable improvement on big-O time is if you know the data is already sorted, in which case you can avoid O(n log n) sorting, and uniqify without losing the existing sorted order by using itertools.groupby:
new_list = [key for key, grp in itertools.groupby(orig_list)]
If orig_list is sorted in forward order, you can make the result of this reversed at essentially no cost by changing itertools.groupby(orig_list) to itertools.groupby(reversed(orig_list)).
The groupby solution isn't really practical for initially unsorted inputs, because if duplicates are even remotely common, removing them via uniquification as a O(n) step is almost always worth it, as it reduces the n in the more costly O(n log n) sorting step. groupby is also a relatively slow tool; the nature of the implementation using a bunch of temporary iterators for each group, internal caching of values, etc., means that it's a slower O(n) in practice than the O(n) uniquification via set, with its primary advantage being the streaming aspect (making it scale to data sets streamed from disk or the network and back without storing anything for the long term, where set must pull everything into memory).
The other reason to use sorted+groupby would be if your data wasn't hashable, but was comparable; in that case, set isn't an option, so the only choice is sorting and grouping.

Are tuples faster than list because they are hashable?

My teacher says that tuples are faster than lists because tuples are immutable, but I don't understand the reason.
I personally think that tuples are faster than lists because tuples are hashable and lists are not hashable.
Please tell me if I am right or wrong.

No, being hashable has nothing to do with being faster.
As in Order to access an element from a collection that is hashable it requires constant time.
You're getting thing backward. The time to look up a hashable element in a collection that uses a hash table (like a set) is constant. But that's about the elements being hashable, not the collection, and it's about the collection using a hash table instead of an array, and it's about looking them up by value instead of by index.
Looking up a value in an array by index—whether the value or the array is hashable or not—takes constant time. Searching an array by value takes linear time. (Unless, e.g., it's sorted and you search by bisecting.)
Your teacher is only partly right—but then they may have been simplifying things to avoid getting into gory details.
There are three reasons why tuples are faster than lists for some operations.
But it's worth noting that these are usually pretty small differences, and usually hard to predict.1 Almost always, you just want to use whichever one makes more sense, and if you occasionally do find a bottleneck where a few % would make a difference, pull it out and timeit both versions and see.
First, there are some operations that are optimized differently for the two types. Of course this is different for different implementations and even different versions of the same implementation, but a few examples from CPython 3.7:
When sorting a list of tuples, there's a special unsafe_tuple_compare that isn't applied to lists.
When comparing two lists for == or !=, there's a special is test to short-circuit the comparison, which sometimes speeds things up a lot, but otherwise slows things down a little. Benchmarking a whole mess of code showed that this was worth doing for lists, but not for tuples.
Mutability generally doesn't enter into it for these choices; it's more about how the two types are typically used (lists are often homogenously-typed but arbitrary-length, while tuples are often heterogenerously-typed and consistent-length). However, it's not quite irrelevant—e.g., the fact that a list can be made to contain itself (because they're mutable) and a tuple can't (because they aren't) prevents at least one minor optimization from being applied to lists.2
Second, two equal tuple constants in the same compilation unit can be merged into the same value. And at least CPython and PyPy usually do so. Which can speed some things up (if nothing else, you get better cache locality when there's less data to cache, but sometimes it means bigger savings, like being able to use is tests).
And this one is about mutability: the compiler is only allowed to merge equal values if it knows they're immutable.
Third, lists of the same size are bigger. Allocating more memory, using more cache lines, etc. slows things down a little.
And this one is also about mutability. A list has to have room to grow on the end; otherwise, calling append N times would take N**2 time. But tuples don't have to append.
1. There are a handful of cases that come up often enough in certain kinds of problems that some people who deal with those problems all the time learn them and remember them. Occasionally, you'll see an answer on an optimization question on Stack Overflow where someone chimes in, "this would probably be about 3% faster with a tuple instead of a list", and they're usually right.
2. Also, I could imagine a case where a JIT compiler like the one in PyPy could speed things up better with a tuple. If you run the same code a million times in a row with the same values, you're going to get a million copies of the same answer—unless the value changes. If the value is a tuple of two objects, PyPy can add guards to see if either of those objects changes, and otherwise just reuse the last value. If it's a list of two objects, PyPy would have to add guards to the two objects and the list, which is 50% more checking. Whether this actually happens, I have no idea; every time I try to trace through how a PyPy optimizations works and generalize from there, I turn out to be wrong, and I just end up concluding that Armin Rigo is a wizard.

Pythonic pattern for building up parallel lists

I am new-ish to Python and I am finding that I am writing the same pattern of code over and over again:
def foo(list):
results = []
for n in list:
#do some or a lot of processing on N and possibly other variables
nprime = operation(n)
results.append(nprime)
return results
I am thinking in particular about the creation of the empty list followed by the append call. Is there a more Pythonic way to express this pattern? append might not have the best performance characteristics, but I am not sure how else I would approach it in Python.
I often know exactly the length of my output, so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up. I am writing a lot of text parsing code that isn't super performance sensitive on any particular loop or piece because all of the performance is really contained in gensim or NLTK code and is in much more capable hands than mine.
Is there a better/more pythonic pattern for doing this type of operation?

First, a list comprehension may be all you need (if all the processing mentioned in your comment occurs in operation.
def foo(list):
return [operation(n) for n in list]
If a list comprehension will not work in your situation, consider whether foo really needs to build the list and could be a generator instead.
def foo(list):
for n in list:
# Processing...
yield operation(n)
In this case, you can iterate over the sequence, and each value is calculated on demand:
for x in foo(myList):
...
or you can let the caller decide if a full list is needed:
results = list(foo())
If neither of the above is suitable, then building up the return list in the body of the loop as you are now is perfectly reasonable.

[..] so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up.
If you are worried about this, don't. Python over-allocates when a new resizing of the list is required (lists are dynamically resized based on their size) in order to perform O(1) appends. Either you manually call list.append or build it with a list comprehension (which internally also uses .append) the effect, memory wise, is similar.
The list-comprehension just performs (speed wise) a bit better; it is optimized for creating lists with specialized byte-code instructions that aid it (LIST_APPEND mainly that directly calls lists append in C).
Of course, if memory usage is of concern, you could always opt for the generator approach as highlighted in chepners answer to lazily produce your results.
In the end, for loops are still great. They might seem clunky in comparison to comprehensions and maps but they still offer a recognizable and readable way to achieve a goal. for loops deserve our love too.

Why is collections.deque slower than collections.defaultdict?

Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates

These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).

Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store.
I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries.
I know there is a lot of overhead in general with lists and dictionaries. I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2.
I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration.
One of the difficulties is knowing how the power of 2 system counts "items"? Is each key:pair counted as 1 item? That's seems important to know because if you have a 100 item monolithic dictionary then space 100^2 items would be allocated. If you have 100 single item dictionaries (1 key:pair) then each dictionary would only be allocation 1^2 (aka no extra allocation)?
Any clearly laid out information would be very helpful!

Three suggestions:
Use one dictionary.
It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later.
If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables.
Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science. The short of it is that hash tables are:
average case O(1) lookup time
O(n) space (Expect about 2n, depending on various parameters)
I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables:
O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size.
O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces. Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not.
Here are some more resources that might help:
The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you.
Here are some things you might consider if you find you actually need to optimize your dictionary implementation:
Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here:
dictobject.h
dictobject.c
Here is a python implementation of that, in case you don't like reading C.
(Thanks to Ben Peterson)
The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.

If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's.
This smacks of premature optimization, not performance improvement. Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.

"Simple" is generally better than "clever", especially if you have no tested reason to go beyond "simple". And anyway "Memory efficient" is an ambiguous term, and there are tradeoffs, when you consider persisting, serializing, cacheing, swapping, and a whole bunch of other stuff that someone else has already thought through so that in most cases you don't need to.
Think "Simplest way to handle it properly" optimize much later.

Premature optimization bla bla, don't do it bla bla.
I think you're mistaken about the power of two extra allocation does. I think its just a multiplier of two. x*2, not x^2.
I've seen this question a few times on various python mailing lists.
With regards to memory, here's a paraphrased version of one such discussion (the post in question wanted to store hundreds of millions integers):
A set() is more space efficient than a dict(), if you just want to test for membership
gmpy has a bitvector type class for storing dense sets of integers
Dicts are kept between 50% and 30% empty, and an entry is about ~12 bytes (though the true amount will vary by platform a bit).
So, the fewer objects you have, the less memory you're going to be using, and the fewer lookups you're going to do (since you'll have to lookup in the index, then a second lookup in the actual value).
Like others, said, profile to see your bottlenecks. Keeping an membership set() and value dict() might be faster, but you'll be using more memory.
I'd also suggest reposting this to a python specific list, such as comp.lang.python, which is full of much more knowledgeable people than myself who would give you all sorts of useful information.

If your dictionary is so big that it does not fit into memory, you might want to have a look at ZODB, a very mature object database for Python.
The 'root' of the db has the same interface as a dictionary, and you don't need to load the whole data structure into memory at once e.g. you can iterate over only a portion of the structure by providing start and end keys.
It also provides transactions and versioning.

Honestly, you won't be able to tell the difference either way, in terms of either performance or memory usage. Unless you're dealing with tens of millions of items or more, the performance or memory impact is just noise.
From the way you worded your second sentence, it sounds like the one big dictionary is your first inclination, and matches more closely with the problem you're trying to solve. If that's true, go with that. What you'll find about Python is that the solutions that everyone considers 'right' nearly always turn out to be those that are as clear and simple as possible.

Often times, dictionaries of dictionaries are useful for other than performance reasons. ie, they allow you to store context information about the data without having extra fields on the objects themselves, and make querying subsets of the data faster.
In terms of memory usage, it would stand to reason that one large dictionary will use less ram than multiple smaller ones. Remember, if you're nesting dictionaries, each additional layer of nesting will roughly double the number of dictionaries you need to allocate.
In terms of query speed, multiple dicts will take longer due to the increased number of lookups required.
So I think the only way to answer this question is for you to profile your own code. However, my suggestion is to use the method that makes your code the cleanest and easiest to maintain. Of all the features of Python, dictionaries are probably the most heavily tweaked for optimal performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.