I always considered Python as a highly object-oriented programming language. Recently, I've been using numpy a lot and I'm beginning to wonder why a number of things are implemented there as functions only, and not as methods of the numpy.array (or ndarray) object.
If we have a given array a for example, you can do
a = np.array([1, 2, 3])
np.sum(a)
>>> 6
a.sum()
>>> 6
which seems just fine but there are a lot of calls that do not work in the same way as in:
np.amax(a)
>>> 3
a.amax()
>>> AttributeError: 'numpy.ndarray' object has no attribute 'amax'
I find this confusing, unintuitive and I do not see any reason behind it. There might be a good one, though; maybe someone can just enlighten me.
When numpy was introduced as a successor to Numeric, a lot of things that were just functions and not methods were added as methods to the ndarray type. At this particular time, having the ability to subclass the array type was a new feature. It was thought that making some of these common functions methods would allow subclasses to do the right thing more easily. For example, having .sum() as a method is useful to allow the masked array type to ignore the masked values; you can write generic code that will work for both plain ndarrays and masked arrays without any branching.
Of course, you can't get rid of the functions. A nice feature of the functions is that they will accept any object that can be coerced into an ndarray, like a list of numbers, which would not have all of the ndarray methods. And the specific list of functions that were added as methods can't be all-inclusive; that's not good OO design, either. There's no need for all of the trig functions to be added as methods, for example.
The current list of methods were mostly chosen early in numpy's development for those that we thought were going to be useful as methods in terms of subclassing or notationally convenient. With more experience under our belt, we have mostly come to the opinion that we added too many methods (seriously, there's no need for .ptp() to be a method) and that subclassing ndarray is typically a bad idea for reasons that I won't go into here.
So take the list of methods as mostly a subset of the list of functions that are available, with the np.amin() and np.amax() as slight renames for the .min() and .max() methods to avoid aliasing the builtins.
Related
Aside from the additional functionality that trunc possesses as a ufunc, is there any difference between these two functions? I expected fix to be defined using trunc as is fairly common throughout NumPy, but it is actually defined using floor and ceil.
If the ufunc functionality is the only difference, why does fix exist? (Or at least, why does fix not just wrap trunc?
There is no difference in the results. Both functions, though implemented differently internally, produce the same results on the same set of input types and produce the same results in both value and type. There is no good reason for having both functions, particularly when one will work more slowly and/or take more resources than the other and both must be maintained for as long as numpy survives. Quite frankly, this is one problem with Python and many other design-by-committee projects, and you see debates over this trunc/fix issue in other language and library development projects like Julia.
I'm writing an interface to matplotlib, which requires that lists of floats are treated as corresponding to a colour map, but other types of input are treated as specifying a particular colour.
To do this, I planned to use matplotlib.colors.colorConverter, which is an instance of a class that converts the other types of input to matplotlib RGBA colour tuples. However, it will also convert floats to a grayscale colour map. This conflicts with the existing functionality of the package I'm working on and I think that would be undesirable.
My question is: is it appropriate/Pythonic to use an isinstance() check prior to using colorConverter to make sure that I don't incorrectly handle lists of floats? Is there a better way that I haven't thought of?
I've read that I should generally code to an interface, but in this case, the interface has functionality that differs from what is required.
It's a little subjective, but I'd say: in general it's a not a good idea, but here where you're distinguishing between a container and an instance of a class it is appropriate (especially when, say, those classes may themselves be iterable, like tuples or strings and doing it the duck-typing would get quite tricky).
Aside: coding to an interface is generally recommended, but it's far more applicable to the Java-style static languages than Python, where interfaces don't really exist, unless you count abstract base classes and the abc module etc. (much deeper discussion in What's the Python version for “Code against an interface, not an object”?)
Hard to say without more details, but it sounds like you're closer to building a facade here than anything, and as such you should be free to use your own (neater / tighter / different) API, insulating users from your underlying implementation (Matplotlib).
Why not write two separate functions, one that treats its input as a color map, and another that treats its input as a color? This would be the simplest way to deal with the problem, and would both avoid surprises, and leave you room to expand functionality in the future.
I'm working on a project which requires handling matrices of custom C structures, with some C functions implementing operations over these structures.
So far, we're proceeding as follows:
Build python wrapper classes around the C structures using ctypes
Override __and__ and __xor__ for that object calling the appropriate underlying C functions
Build numpy arrays containing instances of these classes
Now we are facing some performance issues, and I feel this is not a proper way to handle this thing because we have a C library implementing natively expensive operations on the data type, then numpy implementing natively expensive operations on matrixes, but (I guess) in this configuration every single operation will be proxyed by the python wrapper class.
Is there a way to implement this with numpy that make the operations fully native? I've read there are utilities for wrapping ctypes types into numpy arrays (see here), but what about the operator overloading?
We're not bound to using ctypes, but we would love to be able to still use python (which I believe has big advantages over C in terms of code maintainability...)
Does anyone know if this is possible, and how? Would you suggest other different solutions?
Thanks a lot!
To elaborate somewhat on the above comment (arrays of structs versus struct of arrays):
import numpy as np
#create an array of structs
array_of_structs = np.zeros((3,3), dtype=[('a', np.float64), ('b',np.int32)])
#we may as well think of it as a struct of arrays
struct_of_arrays = array_of_structs['a'], array_of_structs['b']
#this should print 8
print array_of_structs['b'].ctypes.data - array_of_structs['a'].ctypes.data
Note that working with a 'struct of arrays' approach requires a rather different coding style. That is, we largely have to let go of the OOP way of organizing our code. Naturally, you may still choose to place all operations relating to a particular object in a single namespace, but such a data layout encourages a more vectorized, or numpythonic coding style. Rather than passing single objects around between functions, we rather take the whole collection or matrix of objects to be the smallest atomic construct that we manipulate.
Of course, the same data may also be accessed in an OOP manner, where it is desirable. But that will not do you much good as far as numpy is concerned. Note that for performance reasons, it is often also preferable to maintain each attribute of your object as a separate contiguous array, since placing all attributes contiguously is cache-inefficient for operations which act on only a subset of attributes of the object.
Are there better ways to apply string operations to ndarrays rather than iterating over them? I would like to use a "vectorized" operation, but I can only think of using map (example shown) or list comprehensions.
Arr = numpy.rec.fromrecords(zip(range(5),'as far as i know'.split()),
names='name, strings')
print ''.join(map(lambda x: x[0].upper()+'.',Arr['strings']))
=> A.F.A.I.K.
For instance, in the R language, string operations are also vectorized:
> (string <- unlist(strsplit("as far as i know"," ")))
[1] "as" "far" "as" "i" "know"
> paste(sprintf("%s.",toupper(substr(string,1,1))),collapse="")
[1] "A.F.A.I.K."
Yes, recent NumPy has vectorized string operations, in the numpy.char module. E.g., when you want to find all strings starting with a B in an array of strings, that's
>>> y = np.asarray("B-PER O O B-LOC I-LOC O B-ORG".split())
>>> y
array(['B-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-ORG'],
dtype='|S5')
>>> np.char.startswith(y, 'B')
array([ True, False, False, True, False, False, True], dtype=bool)
Update: See Larsman's answer to this question: Numpy recently added a numpy.char module for basic string operations.
Short answer: Numpy doesn't provide vectorized string operations. The idiomatic way is to do something like (where Arr is your numpy array):
print '.'.join(item.upper() for item in Arr['strings'])
Long answer, here's why numpy doesn't provide vectorized string operations: (and a good bit of rambling in between)
One size does not fit all when it comes to data structures.
Your question probably seems odd to people coming from a non-domain-specific programming language, but it makes a lot of sense to people coming from a domain-specific language.
Python gives you a wide variety of choices of data structures. Some data structures are better at some tasks than others.
First off, numpy array's aren't the default "hold-all" container in python. Python's builtin containers are very good at what they're designed for. Often, a list or a dict is what you want.
Numpy's ndarrays are for homogenous data.
In a nutshell, numpy doesn't have vectorized string operations.
ndarrays are a specialized container focusing on storing N-dimensional homogenous groups of items in the minimum amount of memory possible. The emphasis is really on minimizing memory usage (I'm biased, because that's mostly what I need them for, but it's a useful way to think of it.). Vectorized mathematical operations are just a nice side effect of having things stored in a contiguous block of memory.
Strings are usually of different lengths.
E.g. ['Dog', 'Cat', 'Horse']. Numpy takes the database-like approach of requiring you to define a length for your strings, but the simple fact that strings aren't expected to be a fixed length has a lot of implications.
Most useful string operations return variable length strings. (e.g. '.'.join(...) in your example)
Those that don't (e.g. upper, etc) you can mimic with other operations if you want to. (E.g. upper is roughly (x.view(np.uint8) - 32).view('S1'). I don't recommend that you do that, but you can...)
As a basic example: 'A' + 'B' yields 'AB'. 'AB' is not the same length as 'A' or 'B'. Numpy deals with other things that do this (e.g. np.uint8(4) + np.float(3.4)), but strings are much more flexible in length than numbers. ("Upcasting" and "downcasting" rules for numbers are pretty simple.)
Another reason numpy doesn't do it is that the focus is on numerical operations. 'A'**2 has no particular definition in python (You can certainly make a string class that does, but what should it be?). String arrays are second class citizens in numpy. They exist, but most operations aren't defined for them.
Python is already really good at handling string processing
The other (and really, the main) reason numpy doesn't try to offer string operations is that python is already really good at it.
Lists are fantastic flexible containers. Python has a huge set of very nice, very fast string operations. List comprehensions and generator expressions are fairly fast, and they don't suffer any overhead from trying to guess what the type or size of the returned item should be, as they don't care. (They just store a pointer to it.)
Also, iterating over numpy arrays in python is slower than iterating over a list or tuple in python, but for string operations, you're really best off just using the normal list/generator expressions. (e.g. print '.'.join(item.upper() for item in Arr['strings']) in your example) Better yet, don't use numpy arrays to store strings in the first place. It makes sense if you have a single column of a structured array with strings, but that's about it. Python gives you very rich and flexible data structures. Numpy arrays aren't the be-all and end-all, and they're a specialized case, not a generalized case.
Also, keep in mind that most of what you'd want to do with a numpy array
Learn Python, not just Numpy
I'm not trying to be cheeky here, but working with numpy arrays is very similar to a lot of things in Matlab or R or IDL, etc.
It's a familiar paradigm, and anyone's first instinct is to try to apply that same paradigm to the rest of the language.
Python is a lot more than just numpy. It's a multi-paradigm language, so it's easy to stick to the paradigms that you're already used to. Try to learn to "think in python" as well as just "thinking in numpy". Numpy provides a specific paradigm to python, but there's a lot more there, and some paradigms are a better fit for some tasks than others.
Part of this is becoming familiar with the strengths and weaknesses of different data containers (lists vs dicts vs tuples, etc), as well as different programming paradigms (e.g. object-oriented vs functional vs procedural, etc).
All in all, python has several different types of specialized data structures. This makes it somewhat different from domain-specific languages like R or Matlab, which have a few types of data structures, but focus on doing everything with one specific structure. (My experience with R is limited, so I may be wrong there, but that's my impression of it, anyway. It's certainly true of Matlab, anyway.)
At any rate, I'm not trying to rant here, but it took me quite awhile to stop writing Fortran in Matlab, and it took me even longer to stop writing Matlab in python. This rambling answer is very sort on concrete examples, but hopefully it makes at least a little bit of sense, and helps somewhat.
Edit: Let me try to reword and improve my question. The old version is attached at the bottom.
What I am looking for is a way to express and use free functions in a type-generic way. Examples:
abs(x) # maps to x.__abs__()
next(x) # maps to x.__next__() at least in Python 3
-x # maps to x.__neg__()
In these cases the functions have been designed in a way that allows users with user-defined types to customize their behaviour by delegating the work to a non-static method call. This is nice. It allows us to write functions that don't really care about the exact parameter types as long as they "feel" like objects that model a certain concept.
Counter examples: Functions that can't be easily used generically:
math.exp # only for reals
cmath.exp # takes complex numbers
Suppose, I want to write a generic function that applies exp on a list of number-like objects. What exp function should I use? How do I select the correct one?
def listexp(lst):
return [math.exp(x) for x in lst]
Obviously, this won't work for lists of complex numbers even though there is an exp for complex numbers (in cmath). And it also won't work for any user-defined number-like type which might offer its own special exp function.
So, what I'm looking for is a way to deal with this on both sides -- ideally without special casing a lot of things. As a writer of some generic function that does not care about the exact types of parameters I want to use the correct mathematical functions that is specific to the types involved without having to deal with this explicitly. As a writer of a user-defined type, I would like to expose special mathematical functions that have been augmented to deal with additional data stored in those objects (similar to the imaginary part of complex numbers).
What is the preferred pattern/protocol/idiom for doing that? I did not yet test numpy. But I downloaded its source code. As far as I know, it offers a sin function for arrays. Unfortunately, I haven't found its implementation yet in the source code. But it would be interesting to see how they managed to pick the right sin function for the right type of numbers the array currently stores.
In C++ I would have relied on function overloading and ADL (argument-dependent lookup). With C++ being statically typed, it should come as no surprise that this (name lookup, overload resolution) is handled completely at compile-time. I suppose, I could emulate this at runtime with Python and the reflective tools Python has to offer. But I also know that trying to import a coding style into another language might be a bad idea and not very idiomatic in the new language. So, if you have a different idea for an approach, I'm all ears.
I guess, somewhere at some point I need to manually do some type-dependent dispatching in an extensible way. Maybe write a module "tgmath" (type generic math) that comes with support for real and complex support as well as allows others to register their types and special case functions... Opinions? What do the Python masters say about this?
TIA
Edit: Apparently, I'm not the only one who is interested in generic functions and type-dependent overloading. There is PEP 3124 but it is in draft state since 4 years ago.
Old version of the question:
I have a strong background in Java and C++ and just recently started learning Python. What I'm wondering about is: How do we extend mathematical functions (at least their names) so they work on other user-defined types? Do these kinds of functions offer any kind of extension point/hook I can leverage (similar to the iterator protocol where next(obj) actually delegates to obj.__next__, etc) ?
In C++ I would have simply overloaded the function with the new parameter type and have the compiler figure out which of the functions was meant using the argument expressions' static types. But since Python is a very dynamic language there is no such thing as overloading. What is the preferred Python way of doing this?
Also, when I write custom functions, I would like to avoid long chains of
if isinstance(arg,someClass):
suchandsuch
elif ...
What are the patterns I could use to make the code look prettier and more Pythonish?
I guess, I'm basically trying to deal with the lack of function overloading in Python. At least in C++ overloading and argument-dependent lookup is an important part of good C++ style.
Is it possible to make
x = udt(something) # object of user-defined type that represents a number
y = sin(x) # how do I make this invoke custom type-specific code for sin?
t = abs(x) # works because abs delegates to __abs__() which I defined.
work? I know I could make sin a non-static method of the class. But then I lose genericity because for every other kind of number-like object it's sin(x) and not x.sin().
Adding a __float__ method is not acceptable since I keep additional information in the object such as derivatives for "automatic differentiation".
TIA
Edit: If you're curious about what the code looks like, check this out. In an ideal world I would be able to use sin/cos/sqrt in a type-generic way. I consider these functions part of the objects interface even if they are "free functions". In __somefunction I did not qualify the functions with math. nor __main__.. It just works because I manually fall back on math.sin (etc) in my custom functions via the decorator. But I consider this to be an ugly hack.
you can do this, but it works backwards. you implement __float__() in your new type and then sin() will work with your class.
in other words, you don't adapt sine to work on other types; you adapt those types so that they work with sine.
this is better because it forces consistency. if there is no obvious mapping from your object to a float then there probably isn't a reasonable interpretation of sin() for that type.
[sorry if i missed the "__float__ won't work" part earlier; perhaps you added that in response to this? anyway, for convincing proof that what you want isn't possible, python has the cmath library to add sin() etc for complex numbers...]
If you want the return type of math.sin() to be your user-defined type, you appear to be out of luck. Python's math library is basically a thin wrapper around a fast native IEEE 754 floating point math library. If you want to be internally consistent and duck-typed, you can at least put the extensibility shim that python is missing into your own code.
def sin(x):
try:
return x.__sin__()
except AttributeError:
return math.sin(x)
Now you can import this sin function and use it indiscriminately wherever you used math.sin previously. It's not quite as pretty as having math.sin pick up your duck-typing automatically but at least it can be consistent within your codebase.
Define your own versions in a module. This is what's done in cmath for complex number and in numpy for arrays.
Typically the answer to questions like this is "you don't" or "use duck typing". Can you provide a little more detail about what you want to do? Have you looked at the remainder of the protocol methods for numeric types?
http://docs.python.org/reference/datamodel.html#emulating-numeric-types
Ideally, you will derive your user-defined numeric types from a native Python type, and the math functions will just work. When that isn't possible, perhaps you can define __int__() or __float__() or __complex__() or __long__() on the object so it knows how to convert itself to a type the math functions can handle.
When that isn't feasible, for example if you wish to take a sin() of an object that stores x and y displacement rather than an angle, you will need to provide either your own equivalents of such functions (usually as a method of the class) or a function such as to_angle() to convert the object's internal representation to the one needed by Python.
Finally, it is possible to provide your own math module that replaces the built-in math functions with your own varieties, so if you want to allow math on your classes without any syntax changes to the expressions, it can be done in that fashion, although it is tricky and can reduce performance, since you'll be doing (e.g.) a fair bit of preprocessing in Python before calling the native implementations.