Data Structures in Python

Data Structures in Python - python

All the books I've read on data structures so far seem to use C/C++, and make heavy use of the "manual" pointer control that they offer. Since Python hides that sort of memory management and garbage collection from the user is it even possible to implement efficient data structures in this language, and is there any reason to do so instead of using the built-ins?

Python gives you some powerful, highly optimized data structures, both as built-ins and as part of a few modules in the standard library (lists and dicts, of course, but also tuples, sets, arrays in module array, and some other containers in module collections).
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
When you need something more than the rich library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'payload', 'left', 'right'
def __init__(self, payload=None, left=None, right=None):
self.payload = payload
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure.
If however, you need something that's not available as a primitive or if the primitive doesn't perform well enough, you'll have to implement your own type.
The details like pointer management etc. are just implementation talk and don't really limit the capabilities of the language itself.

C/C++ data structure books are only attempting to teach you the underlying principles behind the various structures - they are generally not advising you to actually go out and re-invent the wheel by building your own library of stacks and lists.
Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).

How Python handles objects at a low level isn't too strange anyway. This document should disambiguate it a tad; it's basically all the pointer logic you're already familiar with.

With Python you have access to a vast assortment of library modules written and debugged by other people. Odds are very good that somewhere out there, there is a module that does at least part of what you want, and odds are even good that it might be implemented in C for performance.
For example, if you need to do matrix math, you can use NumPy, which was written in C and Fortran.
Python is slow enough that you won't be happy if you try to write some sort of really compute-intensive code (example, a Fast Fourier Transform) in native Python. On the other hand, you can get a C-coded Fourier Transform as part of SciPy, and just use it.
I have never had a situation where I wanted to solve a problem in Python and said "darn, I just can't express the data structure I need."
If you are a pioneer, and you are doing something in Python for which there just isn't any library module out there, then you can try writing it in pure Python. If it is fast enough, you are done. If it is too slow, you can profile it, figure out where the slow parts are, and rewrite them in C using the Python C API. I have never needed to do this yet.

It's not possible to implement something like a C++ vector in Python, since you don't have array primitives the way C/C++ do. However, anything more complicated can be implemented (efficiently) on top of it, including, but not limited to: linked lists, hash tables, multisets, bloom filters, etc.

Related

Is Python's flexibility in the types of list elements a consequence of dynamic typing?

I am new to Python with some experience in C++. (Unfortunately, with just two sample points, any pair of features are either uncorrelated or perfectly correlated.) In Python, the elements in the same list can have any types. In C++, the STL containers hold homogeneous types. (I suppose it is possible to mimic the flexibility in Python lists with a vector of void pointers.) The C++ STL facilitates generic programming, but Python lists has far more genericty. What causes this contrast? Is it difficult to design a language with a static type system and have something like a Python list?
More generally, I often have to resist the urge to think "Python has feature A and C++ has feature B and therefore Python does X and C++ does Y." Are there good choices of languages that provide good comparison and contrast with these two so that I can understand what features of programming languages are correlated and what features are orthogonal issues? Does a formal education in computer science teach the analogy of linguistics to programming languages? (If so how can I learn that?)

In Python, the elements in the same list can have any types. In C++, the STL containers hold homogeneous types. [...] The C++ STL facilitates generic programming, but Python lists has far more genericty. What causes this contrast? Is it difficult to design a language with a static type system and have something like a Python list?
The contrast is that Python uses duck typing, so it doesn't really care what you put into the list. On the other hand, C++ templates will figure out what types are being passed, instantiate the "skeleton" code with those types, then type-check the instantiated.
It is possible to have a heterogenous vector/list in C++. std::vector<void*> is one possibility you mentioned. With C++17 and above, a more type-safe alternative is std::vector<std::any> (more type-safe because std::any stores type information).
However, generally there are approaches which are more type-safe and less error-prone. If all you care about is your data, collecting different types of data under a single container, then you can consider the functional programming approach which is to use sum types or tagged unions (e.g. std::vector<std::variant<x, y, z>>). If you care about the element's data members and member functions, then you can consider the OOP approach which uses class-based/inheritance-based polymorphism.
Are there good choices of languages that provide good comparison and contrast with these two so that I can understand what features of programming languages are correlated and what features are orthogonal issues?
I will try to provide an objective answer to this.
Rarely will you see a PL stuff itself with every single feature. It's just unfeasible to maintain and apply in industry. PL maintainers and committees will make decisions and some feature proposals will be kicked out. Different PLs are built differently (e.g. some use the actor model for better concurrency, some use garbage collectors, some are designed more functionally). So just learn and expose yourself to different PLs. Choose from different paradigms (object-oriented, functional, concurrent).
C++ and Python are a good first step. Learning C++, you get an understanding of pointers, OOP, and types. Learning Python, you get a feel for how convenient and quick programming can be. Both languages are not without their weaknesses. While coding in Python, you might've realised: "Oh, I don't need to worry about pointers and memory leaks since it's all done under the hood!". But a few moments later: "Aiya, I returned the wrong type in this branch, why didn't Python warn me earlier?"

Use APIs for sorting or algorithm?

In a programming language like Python Which will have better efficiency? if i use a sorting algorithm like merge sort to sort an array or If I use a built in API like sort() to sort the array? If Algorithms are independent of programming languages, then what is the advantage of algorithms over built in methods or API's

Why to use public APIs:
The built in methods were written and reviewed by very experienced and many coders, and a lot of effort was invested to optimize them to be as efficient as it gets.
Since the built in methods are public APIs, it is also means they are constantly used, which means you get a massive "free" testing. You are much more likely to detect issues in public APIs than in private ones, and once something is discovered - it will be fixed for you.
Don't reinvent the wheel. Someone already programmed it for you, use it. If your profiler says there is a problem, think about replacing it. Not before.
Why to use custom made methods:
That said, the public APIs are general case. If you need something
very specific for your scenario, you might find a solution that will
be more efficient, but it will take you quite some time to actually
achieve better than the already optimize general purpose public API.
tl;dr: Use public APIs unless you:
Need it and can afford a lot of time to replace it.
Know what you are doing pretty well.
Intend to maintain it and do robust testing for it.

The libraries normally use well tested and correctly optimized algorythms. For example Python uses Timsort which:
is a stable sort (order of elements that compare equal is preserved)
in the worst case takes O( n log ⁡ n ) comparisons to sort an array of n elements
in the best case (when the input is already sorted) runs in linear time
Unless you have special requirements that make you know that for your particular data sets one sort algorythm will give best result you can use the standard library implementation.
The other reason to build a sort by hand, is evidently for academic purposes...

Breaking up functions into passive (algorithm) and active (execution) objects

Summary
What are the pros and cons of splitting pure functions into passive objects that describe the algorithms and active objects that can execute those algorithms? Note that the situation is greatly simplified by the fact that the functions have no side effects.
Detail
The portion of the code I'm writing (in Python 3) will largely adhere to functional programming.
There is some (immutable) data. There are some algorithms. And I need to apply those algorithms to the data, and get the result.
The algorithms could be represented as regular functions, which will be transformed using standard operations (e.g., I may compose two functions, then freeze some parameters using functools.partial, then passed the resulting function to another function as an argument). Many of the lower-level functions would be memoized for performance reasons.
But an idea occurred to me that perhaps I should instead represent algorithms as passive objects. Such objects wouldn't be able to execute anything themselves. When I'm ready to execute, I'll feed the algorithm object and all the inputs it expects into a special "computation" object. This would match my mental model of an algorithms far better, but I'm concerned that I might be missing some problems with this approach.
Algorithm objects could be implemented in a variety of ways; perhaps even multiple implementations could be allowed. Let's say my algorithms are instances of an abstract class Algorithm; then its subclasses could represent:
strings of text in a domain-specific language that I'll create
some kind of execution trees that I'll construct
even regular Python functions
I have never done this before, so I wanted to get some feedback on this idea. Does it offer any real design advantages, apart from my subjective feeling that it's more "natural"? Does it lead to any problems?

I don't think the design offers any major advantage or disadvantage.
Assuming that any computation object can run any Algorithm, then your class Algorithm presumably is going to have a function called something like execute that knows how to run the algorithm. Name that function __call__, and now your Algorithm class is exactly like a Python callable object (including functions).
For your strings of DSL code: under your design you'd represent them as a subclass of Algorithm that overrides execute to run an interpreter. Under the other design you'd just do something like:
def createDSLAlgorithm(code):
def coderunner(*args, **kwargs):
DSLInterpreter().interpret(code, *args, **kwargs)
return coderunner
And similar to create a function that when called will execute a specified expression tree.
Of course I might be missing something that you're planning to put into your Algorithm design that's not possible for functions. Not all Python functions have mutable attributes, for example. But since user-defined functions can be closures, can have attributes, and any object can "behave like a function" just by implementing __call__, I suspect it's different names for the same thing.
Choosing your own names, of course, is a small advantage if it aids code readability. And it might feel a bit more natural to attach attributes to "objects" than it does to attach them to "functions", if your computation objects are going to interrogate certain known attributes of Algorithms in order to help decide what to do when computing them (for example whether or not to memoize).

library for transforming a node tree

I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.

Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))

What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.

How do we use sin,cos,tan generically (including user-defined types) in Python?

Edit: Let me try to reword and improve my question. The old version is attached at the bottom.
What I am looking for is a way to express and use free functions in a type-generic way. Examples:
abs(x) # maps to x.__abs__()
next(x) # maps to x.__next__() at least in Python 3
-x # maps to x.__neg__()
In these cases the functions have been designed in a way that allows users with user-defined types to customize their behaviour by delegating the work to a non-static method call. This is nice. It allows us to write functions that don't really care about the exact parameter types as long as they "feel" like objects that model a certain concept.
Counter examples: Functions that can't be easily used generically:
math.exp # only for reals
cmath.exp # takes complex numbers
Suppose, I want to write a generic function that applies exp on a list of number-like objects. What exp function should I use? How do I select the correct one?
def listexp(lst):
return [math.exp(x) for x in lst]
Obviously, this won't work for lists of complex numbers even though there is an exp for complex numbers (in cmath). And it also won't work for any user-defined number-like type which might offer its own special exp function.
So, what I'm looking for is a way to deal with this on both sides -- ideally without special casing a lot of things. As a writer of some generic function that does not care about the exact types of parameters I want to use the correct mathematical functions that is specific to the types involved without having to deal with this explicitly. As a writer of a user-defined type, I would like to expose special mathematical functions that have been augmented to deal with additional data stored in those objects (similar to the imaginary part of complex numbers).
What is the preferred pattern/protocol/idiom for doing that? I did not yet test numpy. But I downloaded its source code. As far as I know, it offers a sin function for arrays. Unfortunately, I haven't found its implementation yet in the source code. But it would be interesting to see how they managed to pick the right sin function for the right type of numbers the array currently stores.
In C++ I would have relied on function overloading and ADL (argument-dependent lookup). With C++ being statically typed, it should come as no surprise that this (name lookup, overload resolution) is handled completely at compile-time. I suppose, I could emulate this at runtime with Python and the reflective tools Python has to offer. But I also know that trying to import a coding style into another language might be a bad idea and not very idiomatic in the new language. So, if you have a different idea for an approach, I'm all ears.
I guess, somewhere at some point I need to manually do some type-dependent dispatching in an extensible way. Maybe write a module "tgmath" (type generic math) that comes with support for real and complex support as well as allows others to register their types and special case functions... Opinions? What do the Python masters say about this?
TIA
Edit: Apparently, I'm not the only one who is interested in generic functions and type-dependent overloading. There is PEP 3124 but it is in draft state since 4 years ago.
Old version of the question:
I have a strong background in Java and C++ and just recently started learning Python. What I'm wondering about is: How do we extend mathematical functions (at least their names) so they work on other user-defined types? Do these kinds of functions offer any kind of extension point/hook I can leverage (similar to the iterator protocol where next(obj) actually delegates to obj.__next__, etc) ?
In C++ I would have simply overloaded the function with the new parameter type and have the compiler figure out which of the functions was meant using the argument expressions' static types. But since Python is a very dynamic language there is no such thing as overloading. What is the preferred Python way of doing this?
Also, when I write custom functions, I would like to avoid long chains of
if isinstance(arg,someClass):
suchandsuch
elif ...
What are the patterns I could use to make the code look prettier and more Pythonish?
I guess, I'm basically trying to deal with the lack of function overloading in Python. At least in C++ overloading and argument-dependent lookup is an important part of good C++ style.
Is it possible to make
x = udt(something) # object of user-defined type that represents a number
y = sin(x) # how do I make this invoke custom type-specific code for sin?
t = abs(x) # works because abs delegates to __abs__() which I defined.
work? I know I could make sin a non-static method of the class. But then I lose genericity because for every other kind of number-like object it's sin(x) and not x.sin().
Adding a __float__ method is not acceptable since I keep additional information in the object such as derivatives for "automatic differentiation".
TIA
Edit: If you're curious about what the code looks like, check this out. In an ideal world I would be able to use sin/cos/sqrt in a type-generic way. I consider these functions part of the objects interface even if they are "free functions". In __somefunction I did not qualify the functions with math. nor __main__.. It just works because I manually fall back on math.sin (etc) in my custom functions via the decorator. But I consider this to be an ugly hack.

you can do this, but it works backwards. you implement __float__() in your new type and then sin() will work with your class.
in other words, you don't adapt sine to work on other types; you adapt those types so that they work with sine.
this is better because it forces consistency. if there is no obvious mapping from your object to a float then there probably isn't a reasonable interpretation of sin() for that type.
[sorry if i missed the "__float__ won't work" part earlier; perhaps you added that in response to this? anyway, for convincing proof that what you want isn't possible, python has the cmath library to add sin() etc for complex numbers...]

If you want the return type of math.sin() to be your user-defined type, you appear to be out of luck. Python's math library is basically a thin wrapper around a fast native IEEE 754 floating point math library. If you want to be internally consistent and duck-typed, you can at least put the extensibility shim that python is missing into your own code.
def sin(x):
try:
return x.__sin__()
except AttributeError:
return math.sin(x)
Now you can import this sin function and use it indiscriminately wherever you used math.sin previously. It's not quite as pretty as having math.sin pick up your duck-typing automatically but at least it can be consistent within your codebase.

Define your own versions in a module. This is what's done in cmath for complex number and in numpy for arrays.

Typically the answer to questions like this is "you don't" or "use duck typing". Can you provide a little more detail about what you want to do? Have you looked at the remainder of the protocol methods for numeric types?
http://docs.python.org/reference/datamodel.html#emulating-numeric-types

Ideally, you will derive your user-defined numeric types from a native Python type, and the math functions will just work. When that isn't possible, perhaps you can define __int__() or __float__() or __complex__() or __long__() on the object so it knows how to convert itself to a type the math functions can handle.
When that isn't feasible, for example if you wish to take a sin() of an object that stores x and y displacement rather than an angle, you will need to provide either your own equivalents of such functions (usually as a method of the class) or a function such as to_angle() to convert the object's internal representation to the one needed by Python.
Finally, it is possible to provide your own math module that replaces the built-in math functions with your own varieties, so if you want to allow math on your classes without any syntax changes to the expressions, it can be done in that fashion, although it is tricky and can reduce performance, since you'll be doing (e.g.) a fair bit of preprocessing in Python before calling the native implementations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.