Anyone know this Python data structure?

Anyone know this Python data structure? - python

The Python class has six requirements as listed below. Only bold terms are to be read as requirements.
Close to O(1) performance for as many of the following four operations.
Maintaining sorted order while inserting an object into the container.
Ability to peek at last value (the largest value) contained in the object.
Allowing for pops on both sides (getting the smallest or largest values).
Capability of getting the total size or number of objects being stored.
Being a ready made solution like the code in Python's standard library.
What follows is left here for historical reasons (help the curious and prove that research was conducted).
After looking through Python's Standard Library (specifically the section on Data Types), I still have not found a class that fulfills the requirements requirements of a fragmentation table. collections.deque is close to what is required, but it does not support keeping the data contained in it sorted. It provides:
Efficient append and pops on either side of a deque with O(1) performance.
Pops on both sides for the data contained within the object.
Getting the total size or count of objects contained within.
Implementing an inefficient solution using lists would be trivial, but finding a class that performs well would be far more desirable. In a growing memory simulation with no upper limit, such a class could keep indexes of empty (deleted) cells and keep fragmentation levels down. The bisect module may help:
Helps keep an array in sorted order while inserting new objects in array.
Ready made solution for keeping lists sorted as objects are added.
Would allow executing array[-1] to peek at last value in the array.
The final candidate that failed to fully satisfy the requirements and appeared least promising was the heapq module. While supporting what looked like efficient insertions and assuring that array[0] was the smallest value, the array is not always in a fully sorted state. Nothing else was found to be as helpful.
Does anyone know of a class or data structure in Python that comes close to these six requirements?

Your requirements seem to be:
O(1) pop from each end
Efficient len
Sorted order
Peek at last value
for which you can use a deque with a custom insert method which rotates the deque, appends to one end, and unrotates.
>>> from collections import deque
>>> import bisect
>>> class FunkyDeque(deque):
... def _insert(self, index, value):
... self.rotate(-index)
... self.appendleft(value)
... self.rotate(index)
...
... def insert(self, value):
... self._insert(bisect.bisect_left(self, value), value)
...
... def __init__(self, iterable):
... super(FunkyDeque, self).__init__(sorted(iterable))
...
>>> foo = FunkyDeque([3,2,1])
>>> foo
deque([1, 2, 3])
>>> foo.insert(2.5)
>>> foo
deque([1, 2, 2.5, 3])
Notice that requirements 1, 2, and 4 all follow directly from the fact that the underlying data structure is a deque, and requirement 3 holds because of the way data is inserted. (Note of course that you could bypass the sorting requirement by calling e.g. _insert, but that's beside the point.)

Many thanks go out to katrielalex for providing the inspiration that led to the following Python class:
import collections
import bisect
class FastTable:
def __init__(self):
self.__deque = collections.deque()
def __len__(self):
return len(self.__deque)
def head(self):
return self.__deque.popleft()
def tail(self):
return self.__deque.pop()
def peek(self):
return self.__deque[-1]
def insert(self, obj):
index = bisect.bisect_left(self.__deque, obj)
self.__deque.rotate(-index)
self.__deque.appendleft(obj)
self.__deque.rotate(index)

blist.sortedlist
Close to O(1) performance for as many of the following four operations.
Maintaining sorted order while inserting an object into the container.
Ability to peek at last value (the largest value) contained in the object.
Allowing for pops on both sides (getting the smallest or largest values).
Capability of getting the total size or number of objects being stored.
Being a ready made solution like the code in Python's standard library.
It's a B+ Tree.

Related

Is there any usage of self-referential lists or circular reference in list, eg. appending a list to itself

So if I have a list a and append a to it, I will get a list that contains it own reference.
>>> a = [1,2]
>>> a.append(a)
>>> a
[1, 2, [...]]
>>> a[-1][-1][-1]
[1, 2, [...]]
And this basically results in seemingly infinite recursions.
And not only in lists, dictionaries as well:
>>> b = {'a':1,'b':2}
>>> b['c'] = b
>>> b
{'a': 1, 'b': 2, 'c': {...}}
It could have been a good way to store the list in last element and modify other elements, but that wouldn't work as the change will be seen in every recursive reference.
I get why this happens, i.e. due to their mutability. However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?

The use case is that Python is a dynamically typed language, where anything can reference anything, including itself.
List elements are references to other objects, just like variable names and attributes and the keys and values in dictionaries. The references are not typed, variables or lists are not restricted to only referencing, say, integers or floating point values. Every reference can reference any valid Python object. (Python is also strongly typed, in that the objects have a specific type that won't just change; strings remain strings, lists stay lists).
So, because Python is dynamically typed, the following:
foo = []
# ...
foo = False
is valid, because foo isn't restricted to a specific type of object, and the same goes for Python list objects.
The moment your language allows this, you have to account for recursive structures, because containers are allowed to reference themselves, directly or indirectly. The list representation takes this into account by not blowing up when you do this and ask for a string representation. It is instead showing you a [...] entry when there is a circular reference. This happens not just for direct references either, you can create an indirect reference too:
>>> foo = []
>>> bar = []
>>> foo.append(bar)
>>> bar.append(foo)
>>> foo
[[[...]]]
foo is the outermost [/]] pair and the [...] entry. bar is the [/] pair in the middle.
There are plenty of practical situations where you'd want a self-referencing (circular) structure. The built-in OrderedDict object uses a circular linked list to track item order, for example. This is not normally easily visible as there is a C-optimised version of the type, but we can force the Python interpreter to use the pure-Python version (you want to use a fresh interpreter, this is kind-of hackish):
>>> import sys
>>> class ImportFailedModule:
... def __getattr__(self, name):
... raise ImportError
...
>>> sys.modules["_collections"] = ImportFailedModule() # block the extension module from being loaded
>>> del sys.modules["collections"] # force a re-import
>>> from collections import OrderedDict
now we have a pure-python version we can introspect:
>>> od = OrderedDict()
>>> vars(od)
{'_OrderedDict__hardroot': <collections._Link object at 0x10a854e00>, '_OrderedDict__root': <weakproxy at 0x10a861130 to _Link at 0x10a854e00>, '_OrderedDict__map': {}}
Because this ordered dict is empty, the root references itself:
>>> od._OrderedDict__root.next is od._OrderedDict__root
True
just like a list can reference itself. Add a key or two and the linked list grows, but remains linked to itself, eventually:
>>> od["foo"] = "bar"
>>> od._OrderedDict__root.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
True
>>> od["spam"] = 42
>>> od._OrderedDict__root.next.next is od._OrderedDict__root
False
>>> od._OrderedDict__root.next.next.next is od._OrderedDict__root
True
The circular linked list makes it easy to alter the key ordering without having to rebuild the whole underlying hash table.

However, I am interested in actual use-cases of this behavior. Can somebody enlighten me?
I don't think there are many useful use-cases for this. The reason this is allowed is because there could be some actual use-cases for it and forbidding it would make the performance of these containers worse or increase their memory usage.
Python is dynamically typed and you can add any Python object to a list. That means one would need to make special precautions to forbid adding a list to itself. This is different from (most) typed-languages where this cannot happen because of the typing-system.
So in order to forbid such recursive data-structures one would either need to check on every addition/insertion/mutation if the newly added object already participates in a higher layer of the data-structure. That means in the worst case it has to check if the newly added element is anywhere where it could participate in a recursive data-structure. The problem here is that the same list can be referenced in multiple places and can be part of multiple data-structures already and data-structures such as list/dict can be (almost) arbitrarily deep. That detection would be either slow (e.g. linear search) or would take quite a bit of memory (lookup). So it's cheaper to simply allow it.
The reason why Python detects this when printing is that you don't want the interpreter entering an infinite loop, or get a RecursionError, or StackOverflow. That's why for some operations like printing (but also deepcopy) Python temporarily creates a lookup to detect these recursive data-structures and handles them appropriately.

Consider building a state machine that parse string of digits an check if you can divide by 25 you could model each node as list with 10 outgoing directions consider some connections going to them self
def canDiv25(s):
n0,n1,n1g,n2=[],[],[],[]
n0.extend((n1,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1.extend((n1g,n0,n2,n0,n0,n1,n0,n2,n0,n0))
n1g.extend(n1)
n2.extend((n1,n0,n2,n0,n0,n1g,n0,n2,n0,n0))
cn=n0
for c in s:
cn=cn[int(c)]
return cn is n1g
for i in range(144):
print("%d %d"%(i,canDiv25(str(i))),end='\t')
While this state machine by itself has little practical it show what could happen. Alternative you could have an simple Adventure game where each room is represented as a dictionary you can go for example NORTH but in that room there is of course a back link to SOUTH. Also sometimes game developers make it so that for example to simulate a tricky path in some dungeon the way in NORTH direction will point to the room itself.

A very simple application of this would be a circular linked list where the last node in a list references the first node. These are useful for creating infinite resources, state machines or graphs in general.
def to_circular_list(items):
head, *tail = items
first = { "elem": head }
current = first
for item in tail:
current['next'] = { "elem": item }
current = current['next']
current['next'] = first
return first
to_circular_list([1, 2, 3, 4])
If it's not obvious how that relates to having a self-referencing object, think about what would happen if you only called to_circular_list([1]), you would end up with a data structure that looks like
item = {
"elem": 1,
"next": item
}
If the language didn't support this kind of direct self referencing, it would be impossible to use circular linked lists and many other concepts that rely on self references as a tool in Python.

The reason this is possible is simply because the syntax of Python doesn't prohibit it, much in the way any C or C++ object can contain a reference to itself. An example might be: https://www.geeksforgeeks.org/self-referential-structures/
As #MSeifert said, you will generally get a RecursionError at some point if you're trying to access the list repeatedly from itself. Code that uses this pattern like this:
a = [1, 2]
a.append(a)
def loop(l):
for item in l:
if isinstance(item, list):
loop(l)
else: print(item)
will eventually crash without some sort of condition. I believe that even print(a) will also crash. However:
a = [1, 2]
while True:
for item in a:
print(item)
will run infinitely with the same expected output as the above. Very few recursive problems don't unravel into a simple while loop. For an example of recursive problems that do require a self-referential structure, look up Ackermann's function: http://mathworld.wolfram.com/AckermannFunction.html. This function could be modified to use a self-referential list.
There is certainly precedent for self-referential containers or tree structures, particularly in math, but on a computer they are all limited by the size of the call stack and CPU time, making it impractical to investigate them without some sort of constraint.

Ordered data structure with fixed size and pop/push simultaneous operation

What python data structure do you recommend with this requirements:
Fixed size defined at init
Ordered data
Adding a data at the beginning of the structure remove the data at it's end (as a queue structure)
Adding the data at the beginning return the data at it's end
Any data could be accessible, but not removed
Structure can be cleaned
I looked structure like lists but it does not provide me what I want. dequeue seems great but I don't specifically need to add or remove element at both sides.
I would like something like that:
last_object = my_struct.add(new_first_object)
Any ideas?

collections.deque initialized with maxlen is exactly what you need, it can do the operations you need in O(1), and the "removing at the end" is taken care of:
If maxlen is not specified or is None, deques may grow to an arbitrary length. Otherwise, the deque is bounded to the specified maximum length. Once a bounded length deque is full, when new items are added, a corresponding number of items are discarded from the opposite end.
just use .append and nothing else.
as per the docs, it also supports "peeking" at [0] or [-1] in O(1) (for the "Adding the data at the beginning return the data at it's end" requirement)
If you really don't want any other method to exist on your class (e.g. so IDEs don't auto-complete stuff other than your add) you can wrap deque in your own custom class the has just an add method that calls the deque's append.
example:
from collections import deque
class MyCollection:
def __init__(self, maxlen):
self.d = deque(maxlen=maxlen)
def add(self, new_first_object):
result = None if len(self.d)==0 else self.d[0]
self.d.append(new_first_object)
return result
my_struct = MyCollection(3)
my_struct.add(1)
my_struct.add(2)
my_struct.add(3) # my_struct is now full
print(my_struct.add(4))
print(my_struct.add(5))
print(my_struct.add(6))
Output:
1
2
3

What is the overhead of using a dictionary instead of a list?

I have a situation in one of my projects that I can either use lists or dictionaries and I am having hard time picking which one to use.
I am analyzing large number of items (>400k). And I will have (>400k) list or dictionaries which I will use very frequently. (Get/Set/Update)
In my particular situation, using a dictionary feels like more convenient than list if I wouldn't think about performance at all. However, I know I can manage writing the same thing using lists.
Should I go for readibility and use dictionaries or going with dictionary may add too much of a overhead that will dramatically decrease my performance from the perspective of both memory and time.
I know this question is a bit too-broad. But I wanted to ask it before I start building all my logic after having this decision done.
My situation in a nutshell:
I have values for keys 0,1,...,n. For now, keys will be always integers from 0 to n which I can keep in a list.
However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers. Or integers which are not consecutive.
So, the question is if using dictionaries instead of lists in the first place wouldn't add much of a memory/time cost, I will go with dictionaries in the first place. However, I am not sure having >400k dictionaries vs. having >400k lists make big of a difference in terms of performance.

In direct answer to your question: dictionaries have significantly more overhead than lists:
Each item consumes memory for both key and value, in contrast to only values for lists.
Adding or removing an item requires consulting a hash table.
Despite the fact that Python dictionaries are extremely well-designed and surprisingly fast, if you have an algorithm that can use direct index, you will save space and time.
However, from the sound of your question and subsequent discusion, it sounds like your needs may change over time and you have some uncertainty ("However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers")
If this is the case, I suggest creating a hybrid data structure of your own so that as your needs evolve you can address the efficiency of storage in an isolated place while allowing your application to use simple, readable code to store and retrieve objects.
For example, here is a Python3 class called maybelist that is derived from a list, but detects the presence of non-numeric keys, storing exceptions in a dictionary while providing mappings for some common list operations:
class maybelist(list):
def __init__(self, *args):
super().__init__(*args)
self._extras = dict()
def __setitem__(self, index, val):
try:
super().__setitem__(index, val)
return
except TypeError:
# Index is not an integer, store in dict
self._extras[index] = val
return
except IndexError:
pass
distance = index - len(self)
if distance > 0:
# Put 'None' in empty slots if need be
self.extend((None,) * distance)
self.append(val)
def __getitem__(self, index):
try:
return super().__getitem__(index)
except TypeError:
return self._extras[index]
def __str__(self):
return str([item for item in self])
def __len__(self):
return super().__len__() + len(self._extras)
def __iter__(self):
for item in itertools.chain(super().__iter__(), self._extras):
yield item
So, you could treat it like an array, and have it auto expand:
>>> x = maybelist()
>>> x[0] = 'first'
>>> x[1] = 'second'
>>> x[10] = 'eleventh'
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh']
>>> print(x[10])
eleventh
Or you could add items with non-numeric keys if they were present:
>>> x['unexpected'] = 'something else'
>>> print(x['unexpected'])
something else
And yet have the object appear to behave properly if you access it using iterators or other methods of your choosing:
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh', 'unexpected']
>>> print(len(x))
12
This is just an example, and you would need to tailor such a class to meet the needs of your application. For example, the resulting object does not strictly behave like a list (x[len(x)-1] is not the last item, for example). However, your application may not need such strict adherence, and if you are careful and plan properly, you can create an object which both provides highly optimized storage while leaving room for evolving data structure needs in the future.

dict uses a lot more memory that a list. Probably not enough to be a concern if the computer isn't very busy. There are exceptions of course - if it's a web server with 100 connections per second, you may want to consider saving memory at the expense of readability
>>> L = range(400000)
>>> sys.getsizeof(L)
3200072 # ~3 Megabytes
>>> D = dict(zip(range(400000), range(400000)))
>>> sys.getsizeof(D)
25166104 # ~25 Megabytes

Lists are what they seem - a list of values, but in a dictionary, you
have an 'index' of words, and for each of them a definition.
Dictionaries are the same, but the properties of a dict are different than lists because they work with mapping keys to values. That means you use a dictionary when:
You have to retrieve things based on some identifier, like names, addresses, or anything that can be a key.
You don't need things to be in order. Dictionaries do not normally have any notion of order, so you have to use a list for that.
You are going to be adding and removing elements and their keys.
Efficiency constrains are discussed at Stack posts Link1 & Link2.
Go for a dictionary as you have doubts regarding future values
also there is no memory constrains to bother
Reference

Not exactly the spot on answer for your not so clear question, but here are my thoughts:
You said
I am analyzing large number of items (>400k)
In that case, I'd advise you to use generators and/or process your date in chunks.
Better option would be to put your data, which are key-value pairs, in Redis and take out chunks of it at a time. Redis can handle your volume of data very easily.
You could write a script that processes one chunk at a time, and using the asyncio module, you could parallelize the chunk processing.
Something like this:
from concurrent import futures
def chunk_processor(data):
"""
Process your list data here
"""
pass
def parallelizer(map_func, your_data_list, n_workers=3):
with futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
for result in executor.map(map_func, your_data_list):
# Do whatever with your result
# Do the take out chunks of your data from Redis here
chunk_of_list = get_next_chunk_from_redis()
# Your processing starts here
parallelizer(chunk_processor, your_data_list)
Again, something better could be done, but I'm presenting you one of the ways to go about it.

LFU cache implementation in python

I have implemented LFU cache in python with the help of Priority Queue Implementation given at
https://docs.python.org/2/library/heapq.html#priority-queue-implementation-notes
I have given code in the end of the post.
But I feel that code has some serious problems:
1. To give a scenario, suppose there is only one page is continuously getting visited (say 50 times). But this code will always mark the already added node as "removed" and add it to heap again. So basically it will have 50 different nodes for the same page. Hence increasing heap size enormously.
2. This question is almost similar to Q1 of Telephonic Interview of
http://www.geeksforgeeks.org/flipkart-interview-set-2-sde-2/
And the person mentioned that doubly linked list can give better efficiency as compared to heap. Can anyone explain me, how?
from llist import dllist
import sys
from heapq import heappush, heappop
class LFUCache:
heap = []
cache_map = {}
REMOVED = "<removed-task>"
def __init__(self, cache_size):
self.cache_size = cache_size
def get_page_content(self, page_no):
if self.cache_map.has_key(page_no):
self.update_frequency_of_page_in_cache(page_no)
else:
self.add_page_in_cache(page_no)
return self.cache_map[page_no][2]
def add_page_in_cache(self, page_no):
if (len(self.cache_map) == self.cache_size):
self.delete_page_from_cache()
heap_node = [1, page_no, "content of page " + str(page_no)]
heappush(self.heap, heap_node)
self.cache_map[page_no] = heap_node
def delete_page_from_cache(self):
while self.heap:
count, page_no, page_content = heappop(self.heap)
if page_content is not self.REMOVED:
del self.cache_map[page_no]
return
def update_frequency_of_page_in_cache(self, page_no):
heap_node = self.cache_map[page_no]
heap_node[2] = self.REMOVED
count = heap_node[0]
heap_node = [count+1, page_no, "content of page " + str(page_no)]
heappush(self.heap, heap_node)
self.cache_map[page_no] = heap_node
def main():
cache_size = int(raw_input("Enter cache size "))
cache = LFUCache(cache_size)
while 1:
page_no = int(raw_input("Enter page no needed "))
print cache.get_page_content(page_no)
print cache.heap, cache.cache_map, "\n"
if __name__ == "__main__":
main()

Efficiency is a tricky thing. In real-world applications, it's often a good idea to use the simplest and easiest algorithm, and only start to optimize when that's measurably slow. And then you optimize by doing profiling to figure out where the code is slow.
If you are using CPython, it gets especially tricky, as even an inefficient algorithm implemented in C can beat an efficient algorithm implemented in Python due to the large constant factors; e.g. a double-linked list implemented in Python tends to be a lot slower than simply using the normal Python list, even for cases where in theory it should be faster.
Simple algorithm:
For an LFU, the simplest algorithm is to use a dictionary that maps keys to (item, frequency) objects, and update the frequency on each access. This makes access very fast (O(1)), but pruning the cache is slower as you need to sort by frequency to cut off the least-used elements. For certain usage characteristics, this is actually faster than other "smarter" solutions, though.
You can optimize for this pattern by not simply pruning your LFU cache to the maximum length, but to prune it to, say, 50% of the maximum length when it grows too large. That means your prune operation is called infrequently, so it can be inefficient compared to the read operation.
Using a heap:
In (1), you used a heap because that's an efficient way of storing a priority queue. But you are not implementing a priority queue. The resulting algorithm is optimized for pruning, but not access: You can easily find the n smallest elements, but it's not quite as obvious how to update the priority of an existing element. In theory, you'd have to rebalance the heap after every access, which is highly inefficient.
To avoid that, you added a trick by keeping elements around even if they are deleted. But this trades in space for time.
If you don't want to trade in time, you could update the frequencies in-place, and simply rebalance the heap before pruning the cache. You regain fast access times at the expense of slower pruning time, like the simple algorithm above. (I doubt there is any speed difference between the two, but I have not measured this.)
Using a double-linked list:
The double-linked list mentioned in (2) takes advantage of the nature of the possible changes here: An element is either added as the lowest priority (0 accesses), or an existing element's priority is incremented exactly by 1. You can use these attributes to your advantage if you design your data structures like this:
You have a double-linked list of elements which is ordered by the frequency of the elements. In addition, you have a dictionary that maps items to elements within that list.
Accessing an element then means:
Either it's not in the dictionary, that is, it's a new item, in which case you can simply append it to the end of the double-linked list (O(1))
or it's in the dictionary, in which case you increment the frequency in the element and move it leftwards through the double-linked list until the list is ordered again (O(n) worst-case, but usually closer to O(1)).
To prune the cache, you simply cut off n elements from the end of the list (O(n)).

O(1) indexable deque of integers in Python

what are my options there? I need to call a lot of appends (to the right end) and poplefts (from the left end, naturally), but also to read from the middle of the storage, which will steadily grow, by the nature of the algorithm. I would like to have all these operations in O(1).
I could implement it in C easy enough on circularly-addressed array (what's the word?) which would grow automatically when it's full; but what about Python? Pointers to other languages are appreciated too (I realize the "collections" tag is more Java etc. oriented and would appreciate the comparison, but as a secondary goal).
I come from a Lisp background and was amazed to learn that in Python removing a head element from a list is an O(n) operation. A deque could be an answer except the documentation says access is O(n) in the middle. Is there anything else, pre-built?

You can get an amortized O(1) data structure by using two python lists, one holding the left half of the deque and the other holding the right half. The front half is stored reversed so the left end of the deque is at the back of the list. Something like this:
class mydeque(object):
def __init__(self):
self.left = []
self.right = []
def pushleft(self, v):
self.left.append(v)
def pushright(self, v):
self.right.append(v)
def popleft(self):
if not self.left:
self.__fill_left()
return self.left.pop()
def popright(self):
if not self.right:
self.__fill_right()
return self.right.pop()
def __len__(self):
return len(self.left) + len(self.right)
def __getitem__(self, i):
if i >= len(self.left):
return self.right[i-len(self.left)]
else:
return self.left[-(i+1)]
def __fill_right(self):
x = len(self.left)//2
self.right.extend(self.left[0:x])
self.right.reverse()
del self.left[0:x]
def __fill_left(self):
x = len(self.right)//2
self.left.extend(self.right[0:x])
self.left.reverse()
del self.right[0:x]
I'm not 100% sure if the interaction between this code and the amortized performance of python's lists actually result in O(1) for each operation, but my gut says so.

Accessing the middle of a lisp list is also O(n).
Python lists are array lists, which is why popping the head is expensive (popping the tail is constant time).
What you are looking for is an array with (amortised) constant time deletions at the head; that basically means that you are going to have to build a datastructure on top of list that uses lazy deletion, and is able to recycle lazily-deleted slots when the queue is empty.
Alternatively, use a hashtable, and a couple of integers to keep track of the current contiguous range of keys.

Python's Queue Module may help you, although I'm not sure if access is O(1).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.