Applying string operations to numpy arrays?

Applying string operations to numpy arrays? - python

Are there better ways to apply string operations to ndarrays rather than iterating over them? I would like to use a "vectorized" operation, but I can only think of using map (example shown) or list comprehensions.
Arr = numpy.rec.fromrecords(zip(range(5),'as far as i know'.split()),
names='name, strings')
print ''.join(map(lambda x: x[0].upper()+'.',Arr['strings']))
=> A.F.A.I.K.
For instance, in the R language, string operations are also vectorized:
> (string <- unlist(strsplit("as far as i know"," ")))
[1] "as" "far" "as" "i" "know"
> paste(sprintf("%s.",toupper(substr(string,1,1))),collapse="")
[1] "A.F.A.I.K."

Yes, recent NumPy has vectorized string operations, in the numpy.char module. E.g., when you want to find all strings starting with a B in an array of strings, that's
>>> y = np.asarray("B-PER O O B-LOC I-LOC O B-ORG".split())
>>> y
array(['B-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-ORG'],
dtype='|S5')
>>> np.char.startswith(y, 'B')
array([ True, False, False, True, False, False, True], dtype=bool)

Update: See Larsman's answer to this question: Numpy recently added a numpy.char module for basic string operations.
Short answer: Numpy doesn't provide vectorized string operations. The idiomatic way is to do something like (where Arr is your numpy array):
print '.'.join(item.upper() for item in Arr['strings'])
Long answer, here's why numpy doesn't provide vectorized string operations: (and a good bit of rambling in between)
One size does not fit all when it comes to data structures.
Your question probably seems odd to people coming from a non-domain-specific programming language, but it makes a lot of sense to people coming from a domain-specific language.
Python gives you a wide variety of choices of data structures. Some data structures are better at some tasks than others.
First off, numpy array's aren't the default "hold-all" container in python. Python's builtin containers are very good at what they're designed for. Often, a list or a dict is what you want.
Numpy's ndarrays are for homogenous data.
In a nutshell, numpy doesn't have vectorized string operations.
ndarrays are a specialized container focusing on storing N-dimensional homogenous groups of items in the minimum amount of memory possible. The emphasis is really on minimizing memory usage (I'm biased, because that's mostly what I need them for, but it's a useful way to think of it.). Vectorized mathematical operations are just a nice side effect of having things stored in a contiguous block of memory.
Strings are usually of different lengths.
E.g. ['Dog', 'Cat', 'Horse']. Numpy takes the database-like approach of requiring you to define a length for your strings, but the simple fact that strings aren't expected to be a fixed length has a lot of implications.
Most useful string operations return variable length strings. (e.g. '.'.join(...) in your example)
Those that don't (e.g. upper, etc) you can mimic with other operations if you want to. (E.g. upper is roughly (x.view(np.uint8) - 32).view('S1'). I don't recommend that you do that, but you can...)
As a basic example: 'A' + 'B' yields 'AB'. 'AB' is not the same length as 'A' or 'B'. Numpy deals with other things that do this (e.g. np.uint8(4) + np.float(3.4)), but strings are much more flexible in length than numbers. ("Upcasting" and "downcasting" rules for numbers are pretty simple.)
Another reason numpy doesn't do it is that the focus is on numerical operations. 'A'**2 has no particular definition in python (You can certainly make a string class that does, but what should it be?). String arrays are second class citizens in numpy. They exist, but most operations aren't defined for them.
Python is already really good at handling string processing
The other (and really, the main) reason numpy doesn't try to offer string operations is that python is already really good at it.
Lists are fantastic flexible containers. Python has a huge set of very nice, very fast string operations. List comprehensions and generator expressions are fairly fast, and they don't suffer any overhead from trying to guess what the type or size of the returned item should be, as they don't care. (They just store a pointer to it.)
Also, iterating over numpy arrays in python is slower than iterating over a list or tuple in python, but for string operations, you're really best off just using the normal list/generator expressions. (e.g. print '.'.join(item.upper() for item in Arr['strings']) in your example) Better yet, don't use numpy arrays to store strings in the first place. It makes sense if you have a single column of a structured array with strings, but that's about it. Python gives you very rich and flexible data structures. Numpy arrays aren't the be-all and end-all, and they're a specialized case, not a generalized case.
Also, keep in mind that most of what you'd want to do with a numpy array
Learn Python, not just Numpy
I'm not trying to be cheeky here, but working with numpy arrays is very similar to a lot of things in Matlab or R or IDL, etc.
It's a familiar paradigm, and anyone's first instinct is to try to apply that same paradigm to the rest of the language.
Python is a lot more than just numpy. It's a multi-paradigm language, so it's easy to stick to the paradigms that you're already used to. Try to learn to "think in python" as well as just "thinking in numpy". Numpy provides a specific paradigm to python, but there's a lot more there, and some paradigms are a better fit for some tasks than others.
Part of this is becoming familiar with the strengths and weaknesses of different data containers (lists vs dicts vs tuples, etc), as well as different programming paradigms (e.g. object-oriented vs functional vs procedural, etc).
All in all, python has several different types of specialized data structures. This makes it somewhat different from domain-specific languages like R or Matlab, which have a few types of data structures, but focus on doing everything with one specific structure. (My experience with R is limited, so I may be wrong there, but that's my impression of it, anyway. It's certainly true of Matlab, anyway.)
At any rate, I'm not trying to rant here, but it took me quite awhile to stop writing Fortran in Matlab, and it took me even longer to stop writing Matlab in python. This rambling answer is very sort on concrete examples, but hopefully it makes at least a little bit of sense, and helps somewhat.

Related

Can whole line NumpyArrays indexing operations be cythonized?

In the Cython documentation under Efficient Indexing, on the gotcha part it says that:
This efficient indexing only affects certain index operations, namely
those with exactly ndim number of typed integer indices.
Does this mean that operations like
f[:, w] = something
are not optimized?

It probably meant "optimized [compared to pure Python code]". There are different kinds of slicings and most of them are already really fast in Python, there just is not much you can speed up. For example if you use f[:,w] you'll get a view of the array f. It involves a bit of overhead because a "view" has to be created but it's really fast already because it (excluding certain advanced indexing operations) just a memoryview.
However what Cython can speed up significantly is: accessing single elements of an array. That is a really inefficient operation in Python code because the element has to be "boxed as Python object" when accessed. Cython can avoid this "boxing", when "exactly ndim number of typed integer indices" are used.
So it's not like f[:,w] isn't optimized. It is already optimized by numpy. Cython can't improve (much) there.

Truly vectorized routines in python?

Are there really good methods in Python to vectorize matrix like data constructs/containers -operations? What are the according data constructs used?
(I could observe and read that pandas and numpy element-wise operations using vectorize or applymap (may also be the case of apply/apply along axis for rows/columns) are not much of a speed progress compared to for loops.
Given that when trying to use them, you have sometimes to mess with the specificities of the datatypes when it is usually a little bit easier in for loops, what are the benefits? Readability?)
Are there ways to achieve a gap of performance similar to what happens in Matlab when comparing for loops and vectorized operations?
(note it is not to bash numpy or pandas, these are great, whole matrix operations are ok, it is just that when you have to do element-wise operations, it becomes slow).
EDIT to explain the context:
I was only wondering because I received more than once answers mentionning the fact that apply and so on are actually similar to for loops. That's why I was wondering if there were similar functions implemented in such way that it would perform better. The actual problems were varied. They just had to be element-wise, actually, not "doing the sum, product, whatever of the whole matrix". I did a lot of comparisons with differential outputs sometimes based on other matrices, so I had to use complex functions for this. But since the matrices are huge and the implementation depended on "for loop like" mechanisms, in the end I felt that my program would not work well on a more important dataset. Hence my question. But I was not looking for review, only knowledge.

You need to provide a specific example.
Normal per-element MATLAB or Python functions cannot be vectorized in general. The whole point of vectorizing, in both MATLAB and Python, is to off-load the operation onto the much faster underlying C or Fortran libraries that are designed to work on arrays of uniform data. This cannot be done on functions that operate on scalars, either in MATLAB or Python.
For functions that operate on arrays or matrices as a whole (such as mathematical operators, sum, square, etc), MATLAB and Python behave the same. In fact they use most of the same underlying C and Fortran libraries to do their calculations.
So you need to show the actual operation you want to do, and then we can see if there is a way to vectorize it.
If it is working code and you just want to improve its performance, then Code Review stack exchange site is probably a better choice.

Numpy vectorisation of python object array

Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?

numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.

#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.

Python Arrays vs Lists

What is the main advantage of using the array module instead of lists?

The arrays will take less space.
I've never used the array module, numpy provides the same benefits plus many many more.

Arrays are very similar to lists "except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character."
http://docs.python.org/library/array.html
So if you know your array will only contain objects of a certain type then go ahead (if performance is crucial), if not just use a list.

Arrays can be very useful if you want to strictly type the data in your collection. Notwithstanding performance it can be quite convenient to be sure of the data types contained within your data structure. However, arrays don't feel very 'pythonic' to me (though I must stress this is only personal preference). Unless you are really concerned with the type of data within the collection I would stick with lists as they afford you a great deal of flexibility. The very slight memory optimisations gained between lists vs arrays are insignificant unless you have an extremely large amount of data.

Better use a tuple or numpy array for storing coordinates

I'm porting an C++ scientific application to python, and as I'm new to python, some problems come to my mind:
1) I'm defining a class that will contain the coordinates (x,y). These values will be accessed several times, but they only will be read after the class instantiation. Is it better to use an tuple or an numpy array, both in memory and access time wise?
2) In some cases, these coordinates will be used to build a complex number, evaluated on a complex function, and the real part of this function will be used. Assuming that there is no way to separate real and complex parts of this function, and the real part will have to be used on the end, maybe is better to use directly complex numbers to store (x,y)? How bad is the overhead with the transformation from complex to real in python? The code in c++ does a lot of these transformations, and this is a big slowdown in that code.
3) Also some coordinates transformations will have to be performed, and for the coordinates the x and y values will be accessed in separate, the transformation be done, and the result returned. The coordinate transformations are defined in the complex plane, so is still faster to use the components x and y directly than relying on the complex variables?
Thank you

In terms of memory consumption, numpy arrays are more compact than Python tuples.
A numpy array uses a single contiguous block of memory. All elements of the numpy array must be of a declared type (e.g. 32-bit or 64-bit float.) A Python tuple does not necessarily use a contiguous block of memory, and the elements of the tuple can be arbitrary Python objects, which generally consume more memory than numpy numeric types.
So this issue is a hands-down win for numpy, (assuming the elements of the array can be stored as a numpy numeric type).
On the issue of speed, I think the choice boils down to the question, "Can you vectorize your code?"
That is, can you express your calculations as operations done on entire arrays element-wise.
If the code can be vectorized, then numpy will most likely be faster than Python tuples. (The only case I could imagine where it might not be, is if you had many very small tuples. In this case the overhead of forming the numpy arrays and one-time cost of importing numpy might drown-out the benefit of vectorization.)
An example of code that could not be vectorized would be if your calculation involved looking at, say, the first complex number in an array z, doing a calculation which produces an integer index idx, then retrieving z[idx], doing a calculation on that number, which produces the next index idx2, then retrieving z[idx2], etc. This type of calculation might not be vectorizable. In this case, you might as well use Python tuples, since you won't be able to leverage numpy's strength.
I wouldn't worry about the speed of accessing the real/imaginary parts of a complex number. My guess is the issue of vectorization will most likely determine which method is faster. (Though, by the way, numpy can transform an array of complex numbers to their real parts simply by striding over the complex array, skipping every other float, and viewing the result as floats. Moreover, the syntax is dead simple: If z is a complex numpy array, then z.real is the real parts as a float numpy array. This should be far faster than the pure Python approach of using a list comprehension of attribute lookups: [z.real for z in zlist].)
Just out of curiosity, what is your reason for porting the C++ code to Python?

A numpy array with an extra dimension is tighter in memory use, and at least as fast!, as a numpy array of tuples; complex numbers are at least as good or even better, including for your third question. BTW, you may have noticed that -- while questions asked later than yours were getting answers aplenty -- your was laying fallow: part of the reason is no doubt that asking three questions within a question turns responders off. Why not just ask one question per question? It's not as if you get charged for questions or anything, you know...!-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.