Python custom ordered data structure

Python custom ordered data structure - python

I am wondering if in Python exists a data structure for which is possible induce a custom internal ordering policy. I am aware of OrderedDict and whatnot, but they do not provide explicity what I am asking for. For example, OrderedDict just guarantees insertion order.
I really would like something that in C++ is provided with the use of comparison object: for example in std::set<Type,Compare,Allocator>, Compare is a parameter that define the internal ordering of the data structure. Usually, or probably always, it is a binary predicate that is evaluate for a pair of elements beloning to the data structure.
Is there something similar in Python? Do you know any workaround?

SortedSet & Co support a key:
>>> SortedSet([-3, 1, 4, 1], key=abs)
SortedSet([1, -3, 4], key=<built-in function abs>)

Related

Access the nth element of an object in Python

Here's the json
{u'skeleton_horde': 2, u'baby_dragon': 3, u'valkyrie': 5, u'chr_witch': 1, u'lightning': 1, u'order_volley': 6, u'building_inferno': 3, u'battle_ram': 2}
I'm trying to make the list look like this
skeleton_horde baby_dragon valkyrie lightning order_volley building_inferno
Here's the python
print(x['right']['troops'])
There's surprisingly no documentation on how to get the n element of an object (not array). I tried:
print(x['right']['troops'][1])
but it doesn't work.

First you want to extract the keys:
x['right']['troops']
Then you want to join them with spaces interspersed
' '.join(x['right']['troops'])
This will be in a different order than what you have, though, since Python dictionaries are unordered.

You want to use dict.keys() to a get a list* of the key values of the dict:
print(list(x['right']['troops'].keys()))
*It's actually a view, in Python 3. It would be a list in Python 2.

There's no way of getting the nth item in a dictionary (perhaps you've conflated Python dicts with JavaScript objects) for the simple reason that they are unordered.
There is however a type of dictionary that does maintain the order of its keys, aptly named OrderedDict.
Solution
As another commenter pointed out, there is a solution to your problem, but it still won't give you the keys in the order of definition:
' '.join(obj['right']['troops'])
Note
In a recent version of CPython (3.6), dictionary keys are indeed ordered. I'm not sure if I'd rely on implementation-specific behaviour, or whether you even need to order the keys in this case, but it's good to know. Props to #ScottColby for pointing this out to me!

What is `set` definition? AKA what does `set` do?

I just finished LearnPythonTheHardWay as my intro to programming and set my mind on a sudoku related project. I've been reading through the code of a Sudoku Generator that was uploaded here
to learn some things, and I ran into the line available = set(range(1,10)). I read that as available = set([1, 2, 3, 4, 5, 6, 7, 8, 9]) but I'm not sure what set is.
I tried googling python set, looked through the code to see if set had been defined anywhere, and now I'm coming to you.
Thanks.

Set is built-in type. From the documentation:
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

A set in Python is the collection used to mimic the mathematical notion of set. To put it very succinctly, a set is a list of unique objects, that is, it cannot contain duplicates, which a list can do.

A set is kind of like an unordered list, with unique elements. Documentation exists though, so I'm not sure why you couldn't find it:
https://docs.python.org/2/library/stdtypes.html#set

to make it easy to understand ,
lets take a list ,
a = [1,2,3,4,5,5,5,6,7,7,9]
print list(set(a))
the output will be ,
[1,2,3,4,5,6,7,9]
You can prevent repetitive number using set.
For more usage of set you have to refer to the docs.
Thanks to my friend here who reminded me about the lack of order ,
Incase if the list 'a' was like,
a =[7,7,5,5,5,1,2,3,4,6,9]
print list(set(a))
will still print the output as
[1,2,3,4,5,6,7,9]
You cant preserve order in set.

Data structures with Python

Python has a lot of convenient data structures (lists, tuples, dicts, sets, etc) which can be used to make other 'conventional' data structures (Eg, I can use a Python list to create a stack and a collections.dequeue to make a queue, dicts to make trees and graphs, etc).
There are even third-party data structures that can be used for specific tasks (for instance the structures in Pandas, pytables, etc).
So, if I know how to use lists, dicts, sets, etc, should I be able to implement any arbitrary data structure if I know what it is supposed to accomplish?
In other words, what kind of data structures can the Python data structures not be used for?
Thanks

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure. Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).
Lets say that you need something more than the rich python library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'data', 'left', 'right'
def __init__(self, data=None, left=None, right=None):
self.data = data
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

You can use the Python data structures to do anything you like. The entire programming language Lisp (now people use either Common Lisp or Scheme) is built around the linked list data structure, and Lisp programmers can build any data structure they choose.
That said, there are sometimes data structures for which the Python data structures are not the best option. For instance, if you want to build a splay tree, you should either roll your own or use an open-source project like pysplay. If the built-in data structures, solve your problem, use them. Otherwise, look beyond the built-in data structures. As always, use the best tool for the job.

Given that all data structures exist in memory, and memory is effectively just a list (array)... there is no data structure that couldn't be expressed in terms of the basic Python data structures (with appropriate code to interact with them).

It is important to realize that Python can represent hierarchical structures which are combinations of list (or tuple) and dict. For example, list-of-dict or dict-of-list or dict-of-dict are common simple structures. These are so common, that in my own practice, I append the data type to the variable name, like 'parameters_lod'. But these can go on to arbitrary depth.
The _lod datatype can be easily converted into a pandas DataFrame, for example, or any database table. In fact, some realizations of big-data tables use the _lod structure, sometimes omitting the commas between each dict and omitting the surrounding list brackets []. This makes it easy to append to a file of such lines. AWS offers tables that are dict syntax.
A _lod can be easily converted to a _dod if there is a field that is unique and can be used to index the table. An important difference between _lod and _dod is that the _lod can have multiple entries for the same keyfield, whereas a dict is required to have only one. Thus, it is more general to start with the _lod as the primary basic table structure so duplicates are allowed until the table is inspected to combine those entries.
If the lod is turned into dod, it is preferred to keep the entire dict intact, and not remove the item that is used for the keyfield.
a_lod = [
{'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
{'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
{'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
]
a_dod = {'sam': {'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
'sue': {'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
'joe': {'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
}
Thus, the dict key is added but the records are unchanged. We find this is a good practice so the underlying dicts are unchanged.
Pandas DataFrame.append() function is very slow. Therefore, you should not construct a dataframe one record at a time using this syntax:
df = df.append(record)
Instead, build it as a lod and then convert to a df, as follows.
df = pd.DataFrame.from_dict(lod)
This is much faster, as the other method will get slower and slower as the df grows, because the whole df is copied each time.
It has become important in our development to emphasize the use of _lod and avoid field names in each record that are not consistent, so they can be easily converted to Dataframe. So we avoid using key fields in dicts like 'sam':(data) and use {'name':'sam', 'dataname': (arbitrary data)} instead.
The most elegant thing about python structures is the fact that the default is to work with references to the data rather than values. This must be understood because modifying data in a reference will modify the larger structure.
If you want to make a copy, then you need to use .copy() and sometimes .deepcopy or .copy(deep=True) when using Pandas. Then the data structure will be copied, otherwise, a variable name is just a reference.
Further, we discourage using the dol structure, and instead prefer the lodol of dodol. This is because it is best to have each data item identified with a label, which also allows additional fields to be added.

Storing an array of integers with Django

I've been trying to store an array of integers in a field of a Django model. Based on this reply, I've been trying to do so using a CommaSeparatedIntegerField, however this has proved less intuitive than the name would imply.
If I have a comma-separated list of integers (list = [12,23,31]), and I store it in a CommaSeparatedIntegerField, it comes back as a string (retrieved_list outputs u'[1,2,3]'). I cannot simply retrieve my integers : for instance, int(retrieved_list[1]) outputs 1 whereas list[1] would output 23.
So, do I have to do the parsing by hand, or is there any other solution? And how exactly does a CommaSeparatedIntegerField differs from a CharField? Seems to me like they behave pretty much the same...

Eval was accepted as answer above -- avoid the temptation it's just not safe
See: Python: make eval safe
There is a literal_eval function that could be used the same way:
>>> from ast import literal_eval
>>> literal_eval("[1,2,3,4]")
[1, 2, 3, 4]

The only difference between this field type and a CharField is that it validates that the data is in the proper format - only digits separated by ','. You can view the source for the class at http://docs.nullpobug.com/django/trunk/django.db.models.fields-pysrc.html#CommaSeparatedIntegerField (expand the '+').
You can turn such a string into an actual Python list value (not array - list is the proper term for Python) by using eval.
>>> eval("[1,2,3,4]")
[1, 2, 3, 4]
EDIT: eval has safety concerns, however and a better method is to use literal_eval as suggest by Alvin in another answer.

Python equivalent to java.util.SortedSet?

Does anybody know if Python has an equivalent to Java's SortedSet interface?
Heres what I'm looking for: lets say I have an object of type foo, and I know how to compare two objects of type foo to see whether foo1 is "greater than" or "less than" foo2. I want a way of storing many objects of type foo in a list L, so that whenever I traverse the list L, I get the objects in order, according to the comparison method I define.
Edit:
I guess I can use a dictionary or a list and sort() it every time I modify it, but is this the best way?

Take a look at BTrees. It look like you need one of them. As far as I understood you need structure that will support relatively cheap insertion of element into storage structure and cheap sorting operation (or even lack of it). BTrees offers that.
I've experience with ZODB.BTrees, and they scale to thousands and millions of elements.

You can use insort from the bisect module to insert new elements efficiently in an already sorted list:
from bisect import insort
items = [1,5,7,9]
insort(items, 3)
insort(items, 10)
print items # -> [1, 3, 5, 7, 9, 10]
Note that this does not directly correspond to SortedSet, because it uses a list. If you insert the same item more than once you will have duplicates in the list.

If you're looking for an implementation of an efficient container type for Python implemented using something like a balanced search tree (A Red-Black tree for example) then it's not part of the standard library.
I was able to find this, though:
http://www.brpreiss.com/books/opus7/
The source code is available here:
http://www.brpreiss.com/books/opus7/public/Opus7-1.0.tar.gz
I don't know how the source code is licensed, and I haven't used it myself, but it would be a good place to start looking if you're not interested in rolling your own container classes.
There's PyAVL which is a C module implementing an AVL tree.
Also, this thread might be useful to you. It contains a lot of suggestions on how to use the bisect module to enhance the existing Python dictionary to do what you're asking.
Of course, using insort() that way would be pretty expensive for insertion and deletion, so consider it carefully for your application. Implementing an appropriate data structure would probably be a better approach.
In any case, to understand whether you should keep the data structure sorted or sort it when you iterate over it you'll have to know whether you intend to insert a lot or iterate a lot. Keeping the data structure sorted makes sense if you modify its content relatively infrequently but iterate over it a lot. Conversely, if you insert and delete members all the time but iterate over the collection relatively infrequently, sorting the collection of keys before iterating will be faster. There is no one correct approach.

Similar to blist.sortedlist, the sortedcontainers module provides a sorted list, sorted set, and sorted dict data type. It uses a modified B-tree in the underlying implementation and is faster than blist in most cases.
The sortedcontainers module is pure-Python so installation is easy:
pip install sortedcontainers
Then for example:
from sortedcontainers import SortedList, SortedDict, SortedSet
help(SortedList)
The sortedcontainers module has 100% coverage testing and hours of stress. There's a pretty comprehensive performance comparison that lists most of the options you'd consider for this.

If you only need the keys, and no associated value, Python offers sets:
s = set(a_list)
for k in sorted(s):
print k
However, you'll be sorting the set each time you do this.
If that is too much overhead you may want to look at HeapQueues. They may not be as elegant and "Pythonic" but maybe they suit your needs.

Use blist.sortedlist from the blist package.
from blist import sortedlist
z = sortedlist([2, 3, 5, 7, 11])
z.add(6)
z.add(3)
z.add(10)
print z
This will output:
sortedlist([2, 3, 3, 5, 6, 7, 10, 11])
The resulting object can be used just like a python list.
>>> len(z)
8
>>> [2 * x for x in z]
[4, 6, 6, 10, 12, 14, 20, 22]

Do you have the possibility of using Jython? I just mention it because using TreeMap, TreeSet, etc. is trivial. Also if you're coming from a Java background and you want to head in a Pythonic direction Jython is wonderful for making the transition easier. Though I recognise that use of TreeSet in this case would not be part of such a "transition".
For Jython superusers I have a question myself: the blist package can't be imported because it uses a C file which must be imported. But would there be any advantage of using blist instead of TreeSet? Can we generally assume the JVM uses algorithms which are essentially as good as those of CPython stuff?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.