Python: sorted insert into list - python

This is a common task when building a list incrementally: having sorted the container, subsequent inserts should inject values efficiently at the correct location such that the sorted container stays sorted, and an iterator readout onto a standard list is O(n), being perfectly clear: I am looking for a call to compiled O(logn) inserts into what amounts to a list, as I would expect in the ordered set I'd get from std::set (where I'd have to explicitly specify std::unordered_set to get the default python behavior).
OrderedSet (the missing python type) would accomplish this task. Is there a way to get this
effect in python such that it is as efficient within the container as it would be expected to be in a general purpose compiled language?

import bisect
mylist = [1,2,5]
bisect.insort(mylist,4)
print(mylist)
# [1, 2, 4, 5]

Related

Retrieve first key-value pair from a dictionary in Python without using list, iter

In case I have a huge dictionary my_dict with a complex structure.
my_dict = {'complex_key': ('complex', 'values')}
If I want to see its first key-value pair (to understand whats inside), currently I use:
list(my_dict.items())[0]
However, this dublicates all the keys in the memory. It is also inconvienient, because pdb.set_trace() does not execute expressions starting with list. It is possible to use iterator:
next(iter(my_dict.items()))
However, its inconvenient, because I cannot access n'th element easily.
Is there any other easy way to access key-value pairs of dict_items()?
In Python 2.7 this expression used to work:
my_dict.items()[0]
Update Ended up using:
tuple(my_dict.items())[0]
This approach at least overcomes the pdb.set_trace() limitation. It also allows to easily access n'th element and does not require any imports like from itertools import islice.
The reason my_dict.items()[0] worked in Python 2.7 and not in Python 3 is because in Python 2 it returned a list and in Python 3 it returns a dictionary view.
To get the same behavior, you have to wrap it in list() or tuple().
The most memory efficient way would be to create a tuple of the keys.
keys = tuple(my_dict.keys())
my_dict[keys[0]]
However, whenever you change the dictionary, you'd have to update/recreate keys.
Another thing to note is order is not guaranteed before Python 3.7, though once keys is created, it's order is.

Why does Pythons' set() return a set item instead of a list

I quite often use set() to remove duplicates from lists. After doing so, I always directly change it back to a list.
a = [0,0,0,1,2,3,4,5]
b = list(set(a))
Why does set() return a set item, instead of simply a list?
type(set(a)) == set # is true
Is there a use for set items that I have failed to understand?
Yes, sets have many uses. They have lots of nice operations documented here which lists don't have. One very useful difference is that membership testing (x in a) can be much faster than for a list.
Okay, by doubles you mean duplicate? and set() will always return a set because it is a data structure in python like lists. when you are calling set you are creating an object of set().
rest of the information about sets you can find here
https://docs.python.org/2/library/sets.html
As already mentioned, I won't go into why set does not return a list but like you stated:
I quite often use set() to remove doubles from lists. After doing so, I always directly change it back to a list.
You could use OrderedDict if you really hate going back to changing it to a list:
source_list = [0,0,0,1,2,3,4,5]
from collections import OrderedDict
print(OrderedDict((x, True) for x in source_list).keys())
OUTPUT:
odict_keys([0, 1, 2, 3, 4, 5])
As said before, for certain operations if you use set instead of list, it is faster. Python wiki has query TimeComplexity in which speed of operations of various data types are given. Note that if you have few elements in your list or set, you will most probably do not notice difference, but with more elements it become more important.
Notice that for example if you want to make in-place removal, for list it is O(n) meaning that for 10 times longer list it will need 10 times more time, while for set and s.difference_update(t) where s is set, t is set with one element to be removed from s, time is O(1) i.e. independent from number of elements of s.

Modify list and dictionary during iteration, why does it fail on dict?

Let's consider this code which iterates over a list while removing an item each iteration:
x = list(range(5))
for i in x:
print(i)
x.pop()
It will print 0, 1, 2. Only the first three elements are printed since the last two elements in the list were removed by the first two iterations.
But if you try something similar on a dict:
y = {i: i for i in range(5)}
for i in y:
print(i)
y.pop(i)
It will print 0, then raise RuntimeError: dictionary changed size during iteration, because we are removing a key from the dictionary while iterating over it.
Of course, modifying a list during iteration is bad. But why is a RuntimeError not raised as in the case of dictionary? Is there any good reason for this behaviour?
I think the reason is simple. lists are ordered, dicts (prior to Python 3.6/3.7) and sets are not. So modifying a lists as you iterate may be not advised as best practise, but it leads to consistent, reproducible, and guaranteed behaviour.
You could use this, for example let's say you wanted to split a list with an even number of elements in half and reverse the 2nd half:
>>> lst = [0,1,2,3]
>>> lst2 = [lst.pop() for _ in lst]
>>> lst, lst2
([0, 1], [3, 2])
Of course, there are much better and more intuitive ways to perform this operation, but the point is it works.
By contrast, the behaviour for dicts and sets is totally implementation specific since the iteration order may change depending on the hashing.
You get a RunTimeError with collections.OrderedDict, presumably for consistency with the dict behaviour. I don't think any change in the dict behaviour is likely after Python 3.6 (where dicts are guaranteed to maintain insertion ordered) since it would break backward compatibility for no real use cases.
Note that collections.deque also raises a RuntimeError in this case, despite being ordered.
It wouldn't have been possible to add such a check to lists without breaking backward compatibility. For dicts, there was no such issue.
In the old, pre-iterators design, for loops worked by calling the sequence element retrieval hook with increasing integer indices until it raised IndexError. (I would say __getitem__, but this was back before type/class unification, so C types didn't have __getitem__.) len isn't even involved in this design, and there is nowhere to check for modification.
When iterators were introduced, the dict iterator had the size change check from the very first commit that introduced iterators to the language. Dicts weren't iterable at all before that, so there was no backward compatibility to break. Lists still went through the old iteration protocol, though.
When list.__iter__ was introduced, it was purely a speed optimization, not intended to be a behavioral change, and adding a modification check would have broken backward compatibility with existing code that relied on the old behavior.
Dictionary uses insertion order with an additional level of indirection, which causes hiccups when iterating while keys are removed and re-inserted, thereby changing the order and internal pointers of the dictionary.
And this problem is not fixed by iterating d.keys() instead of d, since in Python 3, d.keys() returns a dynamic view of the keys in the dict which results in the same problem. Instead, iterate over list(d) as this will produce a list from the keys of the dictionary that will not change during iteration

List vs dictionary to store zeroes in python

I am solving a problem in which I need a list of zeroes and after that I have to update some values in the list . Now I have two options in my mind how can I do this first is to simply make a list of zeroes and then update the values or I create a dictionary and then I update values .
List method :
l=[0]*n
Dictionary method :
d={}
for i in range(n):
d[i]=0
Now to complexity to build dictionary is O(n) and then updating a key is O(1) . But I don't know how python builds the list of zeroes using above method .
Let's assume n is a large number which one the above method will be better for this task ? and how is the list method implemented in python ? . Also why is the above list method faster than list comprehension method for creating list of zeroes ?
The access and update once you have pre-allocated your sequence will be roughly the same.
Pick a data-structure that makes sense for your application. In this case I suggest a list because it more naturally fits "sequence indexed by integers"
The reason [0]*n is fast is that it can make a list of the correct size in one go, rather than constantly expanding the list as more elements are added.
collections.defaultdict may be a better solution if you expect that a lot of elements will not change during your updates keeping initial value (and if you don't rely on KeyErrors somehow). Just
import collections
d = collections.defaultdict(int)
assert d[42] == 0
d[43] = 1
# ...
Another thing to consider is array.array. You can use it if you want to store only elements (counts) of one type. It should be a little faster and memory efficient than lists:
import array
l = array.array('L', [0]) * n
# use as list
After running a test using timeit:
import timeit
timeit.repeat("[0]*1000", number=1000000)
#[4.489016328923801, 4.459866205812087, 4.477892545204176]
timeit.repeat("""d={}
for i in range(1000):
d[i]=0""", number=1000000)
#[77.77789647192793, 77.88324065372811, 77.7300221235187]
timeit.repeat("""x={};x.fromkeys(range(1000),0)""", number=1000000)
#[53.62738158027423, 53.87422525293914, 53.50821399216625]
As you can see there is HUGE difference between these two methods and third one is better but not as lists! The reason is creating a list with size specified is way too faster than creating a dictionary with expanding it over iteration.
I think in this situation you should just use list, unless you want to access some data without using index.
Python list is an array. It initializes with a specific size, when it needs to store more items than its size can hold, it just copies everything to a new array, and the copying is O(k), where k is the then size of the list. this process can happen a lot of times until the list get to size bigger than or equal to n. However, [0]*n will just create the array with the right size (which is n), so it's faster than updating the list to the right size from the beginning.
For creation by list comprehension, if you mean something like [0 for i in range(n)], I think it suffers from updating the list size and so it is slower.
Python dictionary is an implementation of Hash Table, and it use a hash function to calculate the hash value for the key when you insert a new key-value pair. The execution of hash function itself is comparatively expensive, and dictionary also deals with other situations like collision, which makes it even slower. Thus, creation 0s by dictionary should be the slowest, in theory.

Python equivalent to java.util.SortedSet?

Does anybody know if Python has an equivalent to Java's SortedSet interface?
Heres what I'm looking for: lets say I have an object of type foo, and I know how to compare two objects of type foo to see whether foo1 is "greater than" or "less than" foo2. I want a way of storing many objects of type foo in a list L, so that whenever I traverse the list L, I get the objects in order, according to the comparison method I define.
Edit:
I guess I can use a dictionary or a list and sort() it every time I modify it, but is this the best way?
Take a look at BTrees. It look like you need one of them. As far as I understood you need structure that will support relatively cheap insertion of element into storage structure and cheap sorting operation (or even lack of it). BTrees offers that.
I've experience with ZODB.BTrees, and they scale to thousands and millions of elements.
You can use insort from the bisect module to insert new elements efficiently in an already sorted list:
from bisect import insort
items = [1,5,7,9]
insort(items, 3)
insort(items, 10)
print items # -> [1, 3, 5, 7, 9, 10]
Note that this does not directly correspond to SortedSet, because it uses a list. If you insert the same item more than once you will have duplicates in the list.
If you're looking for an implementation of an efficient container type for Python implemented using something like a balanced search tree (A Red-Black tree for example) then it's not part of the standard library.
I was able to find this, though:
http://www.brpreiss.com/books/opus7/
The source code is available here:
http://www.brpreiss.com/books/opus7/public/Opus7-1.0.tar.gz
I don't know how the source code is licensed, and I haven't used it myself, but it would be a good place to start looking if you're not interested in rolling your own container classes.
There's PyAVL which is a C module implementing an AVL tree.
Also, this thread might be useful to you. It contains a lot of suggestions on how to use the bisect module to enhance the existing Python dictionary to do what you're asking.
Of course, using insort() that way would be pretty expensive for insertion and deletion, so consider it carefully for your application. Implementing an appropriate data structure would probably be a better approach.
In any case, to understand whether you should keep the data structure sorted or sort it when you iterate over it you'll have to know whether you intend to insert a lot or iterate a lot. Keeping the data structure sorted makes sense if you modify its content relatively infrequently but iterate over it a lot. Conversely, if you insert and delete members all the time but iterate over the collection relatively infrequently, sorting the collection of keys before iterating will be faster. There is no one correct approach.
Similar to blist.sortedlist, the sortedcontainers module provides a sorted list, sorted set, and sorted dict data type. It uses a modified B-tree in the underlying implementation and is faster than blist in most cases.
The sortedcontainers module is pure-Python so installation is easy:
pip install sortedcontainers
Then for example:
from sortedcontainers import SortedList, SortedDict, SortedSet
help(SortedList)
The sortedcontainers module has 100% coverage testing and hours of stress. There's a pretty comprehensive performance comparison that lists most of the options you'd consider for this.
If you only need the keys, and no associated value, Python offers sets:
s = set(a_list)
for k in sorted(s):
print k
However, you'll be sorting the set each time you do this.
If that is too much overhead you may want to look at HeapQueues. They may not be as elegant and "Pythonic" but maybe they suit your needs.
Use blist.sortedlist from the blist package.
from blist import sortedlist
z = sortedlist([2, 3, 5, 7, 11])
z.add(6)
z.add(3)
z.add(10)
print z
This will output:
sortedlist([2, 3, 3, 5, 6, 7, 10, 11])
The resulting object can be used just like a python list.
>>> len(z)
8
>>> [2 * x for x in z]
[4, 6, 6, 10, 12, 14, 20, 22]
Do you have the possibility of using Jython? I just mention it because using TreeMap, TreeSet, etc. is trivial. Also if you're coming from a Java background and you want to head in a Pythonic direction Jython is wonderful for making the transition easier. Though I recognise that use of TreeSet in this case would not be part of such a "transition".
For Jython superusers I have a question myself: the blist package can't be imported because it uses a C file which must be imported. But would there be any advantage of using blist instead of TreeSet? Can we generally assume the JVM uses algorithms which are essentially as good as those of CPython stuff?

Categories

Resources