Is there a faster/smarter way to perform operations on every element of a numpy array? What I specifically have is a list of datetime objects like, e.g.:
hh = np.array( [ dt.date(2000, 1, 1), dt.date(2001, 1, 1) ] )
To get a list of of years from that I do at the moment:
years = np.array( [ x.year for x in hh ] )
Is there a smarter way to do this? I'm thinking something like
hh.year
which obviously doesn't work.
I have a script in which I need different variations of a (much longer) array constantly (year, month, hours...). Of course I could always just define a separate array for everything but like there should be a more elegant solution.
If you evaluate a python expression for each element, it doesn't matter whether the iteration will be done in C++ or Python. What will have weight is the python-complexity of the evaluated (in-loop) expression. This means: If your (in-loop) expression takes 1 microsec (a very simple script), it will be significantly harder than the difference between using a python iteration or a C++ iteration (you have a "marshalling" between C++ and PyObjects, and that applies to python functions as well).
For that reason, calling vectorize is -under the hoods- done in Python: what is called inside is python code. The idea behind vectorize is not performance, but code readability and ease of iteration: vectorize performs introspection (of function's parameters) and serves well for N-dimensional iterations (i.e. a lambda x,y: x+y automagically serves to iterate in two dimensions).
So: no, there's no "fast" way to iterate python code. The final speed that matters is the speed of your inner python code.
Edit: your -desired- hh.year looks like hh*.year equivalent in groovy, but even there under the hoods is the same as an in-code iteration. Comprehensions are the fastest (and equivalent) way in python. The real pity is being forced to:
years = np.array( [ x.year for x in hh ] )
(which forces you to create another provably-huge-sized) instead of letting you use any type of iterator:
years = np.array( x.year for x in hh )
Edit (suggestion by #Jaime): You can't construct array with that function from an iterator. For that, you must use:
np.fromiter(x.year for x in hh, dtype=int, count=len(x))
which lets you save the time and memory of building an intermediate array. This exact approach works for any sequence to avoid the inner-list creation (this one would be your case) but does not work with other types of generators, for future cases you'd need.
You can use numpy.vectorize.
Doing some benchmarking, performance is pretty similar (vectorize slightly slower than a list comprehension), and in my opinion numpy.vectorize(lambda j: j.year)(hh) (or something similar) doesn't look super elegant.
Related
Suppose that you have two NumPy arrays, a and b, and you want to test whether any value of a is greater than the corresponding value of b.
Now you could calculate a boolean array and call its any method:
(a > b).any()
This will do all the looping internally, which is good, but it suffers from the need to perform the comparison on all the pairs even if, say, the very first result evaluates as True.
Alternatively, you could do an explicit loop over scalar comparisons. An example implementation in the case where a and b are the same shape (so broadcasting is not required) might look like:
any(ai > bi for ai, bi in zip(a.flatten(), b.flatten()))
This will benefit from the ability to stop processing after the first True result is encountered, but with all the costs associated with an explicit loop in Python (albeit inside a comprehension).
Is there any way, either in NumPy itself or in an external library, that you could pass in a description of the operation that you wish to perform, rather than the result of that operation, and then have it perform the operation internally (in optimised low-level code) inside an "any" loop that can be broken out from?
One could imagine hypothetically some kind of interface like:
from array_operations import GreaterThan, Any
expression1 = GreaterThan('x', 'y')
expression2 = Any(expression1)
print(expression2.evaluate(x=a, y=b))
If such a thing exists, clearly it could have other uses beyond efficient evaluation of all and any, in terms of being able to create functions dynamically.
Is there anything like this?
One way to solve this is with delayed/deferred/lazy evaluation. The C++ community uses something called "expression templates" to achieve this; you can find an accessible overview here: http://courses.csail.mit.edu/18.337/2015/projects/TylerOlsen/18337_tjolsen_ExpressionTemplates.pdf
In Python the easiest way to do this is using Numba. You basically just write the function you need in Python using for loops, then you decorate it with #numba.njit and it's done. Like this:
#numba.njit
def any_greater(a, b):
for ai, bi in zip(a.flatten(), b.flatten()):
if ai > bi:
return True
return False
There is/was a NumPy enhancement proposal that could help your use case, but I don't think it has been implemented: https://docs.scipy.org/doc/numpy-1.13.0/neps/deferred-ufunc-evaluation.html
I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.
I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.
The related question How do I merge two python iterators? works well for two independent iterators. However, I haven't been able to find or think of the tools necessary for merging two iterators where one is recursive and takes the other as an input. I have iterator stuff that is a simple list. Then I have iterator theta that takes a function func and yields x, func(x), func(func(x)), where one of the inputs to func is an element of stuff. I've solved this with mutable state as follows:
theta = some_initial_theta
for thing in stuff:
theta = update_theta(theta, thing)
return theta
A concrete example in this format:
def update_theta(theta, thing):
return thing * 2 + theta
stuff = [100, 200, 300, 400]
def my_iteration():
theta = 0
for thing in stuff:
theta = update_theta(theta, thing)
print(theta)
# This prints 2000
I'm sure there's an elegant way of doing this without the mutable state and the for loop. A simple zip doesn't do it for me because the theta iterator uses its previous element as an input to the next element.
One elegant way of expressing theta is using the iterate method available in the more_itertools package:
iterate(lambda theta: update_theta(theta, thing), some_initial_theta)
However, the problem with this is that thing will be fixed throughout the iteration. It would be possible to deal with this by passing in the entire list stuff and then return the remainder of it from the update_theta method:
iterate(lambda theta: update_theta(theta[0], theta[1]), (some_initial_theta, stuff))
However, I'd really rather not modify the update_theta method to take an entire list it's not interested in and deal with the mechanics of returning the tail of that list. While it's programmatically not difficult, it's poor separation of concerns. update_theta shouldn't know anything about or care about the entire list stuff.
As Peter Wood suggests in the comments, this is exactly what the built-in function reduce does:
result = reduce(update_theta, stuff, some_initial_theta)
In Python 3, reduce has been moved to functools.reduce, so you'd need to import that:
from functools import reduce
If you want an iterator of all the intermediate values, Python 3 provides itertools.accumulate. There's no argument to specify an initial value, so you'd need to put the initial value in the iterator:
from itertools import accumulate, chain
result_iterator = accumulate(chain([some_initial_theta], stuff), update_theta)
Python 2 doesn't have itertools.accumulate, but you could copy the equivalent code from the Python 3 documentation. There's no easy way to formulate it in terms of the Python 2 standard tools, which is why people wanted it added to Python 3 in the first place.
Iterations is better or lambda functions are better in respect with time processing or memory usage and other things?
for example :
x = [10, 20, 30]
for y in x:
if y>10 or y<20:
print y
this is better or lambda function?
I want the answer with respect to the time processing, memory usage or any other comparisons.
Iterators and lambdas are two completely different things. A lambda is a simple inline function and an iterator is an object which returns successive objects. There are two major problems with your example: You are testing x instead of y and all values in x will pass for y>10 or y<20. So, correcting those, your example could be written using an iterator and a lambda like this:
for value in filter(lamdba y: y < 10 or y > 20, x):
print(value)
There are several ways you could do this, but in terms of performance it depends on what data you're processing, how you're processing it and how much of it you're processing. See http://wiki.python.org/moin/PythonSpeed/PerformanceTips for a useful guide.
For your case the classic loop is obviously better since you don't want to create a new list or generator.
Not creating such an object makes it more memory-efficient and not calling a function for each element makes it more performant.
I find the list comprehension notation much easier to read than the functional notation, especially as the complexity of the expression to be mapped increases. In addition, the list comprehension executes much faster than the solution using map and lambda. This is because calling a lambda function creates a new stack frame while the expression in the list comprehension is evaluated without creating a new stack frame. >> http://python-history.blogspot.com/2010/06/from-list-comprehensions-to-generator.html
In other words, whether you have a choice between a lambda and a loop/comprehension/generator, use the latter. I guess the most pythonic way to write your example would be something like
print [y for y in x if y < 20]
I know Ruby very well. I believe that I may need to learn Python presently. For those who know both, what concepts are similar between the two, and what are different?
I'm looking for a list similar to a primer I wrote for Learning Lua for JavaScripters: simple things like whitespace significance and looping constructs; the name of nil in Python, and what values are considered "truthy"; is it idiomatic to use the equivalent of map and each, or are mumble somethingaboutlistcomprehensions mumble the norm?
If I get a good variety of answers I'm happy to aggregate them into a community wiki. Or else you all can fight and crib from each other to try to create the one true comprehensive list.
Edit: To be clear, my goal is "proper" and idiomatic Python. If there is a Python equivalent of inject, but nobody uses it because there is a better/different way to achieve the common functionality of iterating a list and accumulating a result along the way, I want to know how you do things. Perhaps I'll update this question with a list of common goals, how you achieve them in Ruby, and ask what the equivalent is in Python.
Here are some key differences to me:
Ruby has blocks; Python does not.
Python has functions; Ruby does not. In Python, you can take any function or method and pass it to another function. In Ruby, everything is a method, and methods can't be directly passed. Instead, you have to wrap them in Proc's to pass them.
Ruby and Python both support closures, but in different ways. In Python, you can define a function inside another function. The inner function has read access to variables from the outer function, but not write access. In Ruby, you define closures using blocks. The closures have full read and write access to variables from the outer scope.
Python has list comprehensions, which are pretty expressive. For example, if you have a list of numbers, you can write
[x*x for x in values if x > 15]
to get a new list of the squares of all values greater than 15. In Ruby, you'd have to write the following:
values.select {|v| v > 15}.map {|v| v * v}
The Ruby code doesn't feel as compact. It's also not as efficient since it first converts the values array into a shorter intermediate array containing the values greater than 15. Then, it takes the intermediate array and generates a final array containing the squares of the intermediates. The intermediate array is then thrown out. So, Ruby ends up with 3 arrays in memory during the computation; Python only needs the input list and the resulting list.
Python also supplies similar map comprehensions.
Python supports tuples; Ruby doesn't. In Ruby, you have to use arrays to simulate tuples.
Ruby supports switch/case statements; Python does not.
Ruby supports the standard expr ? val1 : val2 ternary operator; Python does not.
Ruby supports only single inheritance. If you need to mimic multiple inheritance, you can define modules and use mix-ins to pull the module methods into classes. Python supports multiple inheritance rather than module mix-ins.
Python supports only single-line lambda functions. Ruby blocks, which are kind of/sort of lambda functions, can be arbitrarily big. Because of this, Ruby code is typically written in a more functional style than Python code. For example, to loop over a list in Ruby, you typically do
collection.each do |value|
...
end
The block works very much like a function being passed to collection.each. If you were to do the same thing in Python, you'd have to define a named inner function and then pass that to the collection each method (if list supported this method):
def some_operation(value):
...
collection.each(some_operation)
That doesn't flow very nicely. So, typically the following non-functional approach would be used in Python:
for value in collection:
...
Using resources in a safe way is quite different between the two languages. Here, the problem is that you want to allocate some resource (open a file, obtain a database cursor, etc), perform some arbitrary operation on it, and then close it in a safe manner even if an exception occurs.
In Ruby, because blocks are so easy to use (see #9), you would typically code this pattern as a method that takes a block for the arbitrary operation to perform on the resource.
In Python, passing in a function for the arbitrary action is a little clunkier since you have to write a named, inner function (see #9). Instead, Python uses a with statement for safe resource handling. See How do I correctly clean up a Python object? for more details.
I, like you, looked for inject and other functional methods when learning Python. I was disappointed to find that they weren't all there, or that Python favored an imperative approach. That said, most of the constructs are there if you look. In some cases, a library will make things nicer.
A couple of highlights for me:
The functional programming patterns you know from Ruby are available in Python. They just look a little different. For example, there's a map function:
def f(x):
return x + 1
map(f, [1, 2, 3]) # => [2, 3, 4]
Similarly, there is a reduce function to fold over lists, etc.
That said, Python lacks blocks and doesn't have a streamlined syntax for chaining or composing functions. (For a nice way of doing this without blocks, check out Haskell's rich syntax.)
For one reason or another, the Python community seems to prefer imperative iteration for things that would, in Ruby, be done without mutation. For example, folds (i.e., inject), are often done with an imperative for loop instead of reduce:
running_total = 0
for n in [1, 2, 3]:
running_total = running_total + n
This isn't just a convention, it's also reinforced by the Python maintainers. For example, the Python 3 release notes explicitly favor for loops over reduce:
Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.
List comprehensions are a terse way to express complex functional operations (similar to Haskell's list monad). These aren't available in Ruby and may help in some scenarios. For example, a brute-force one-liner to find all the palindromes in a string (assuming you have a function p() that returns true for palindromes) looks like this:
s = 'string-with-palindromes-like-abbalabba'
l = len(s)
[s[x:y] for x in range(l) for y in range(x,l+1) if p(s[x:y])]
Methods in Python can be treated as context-free functions in many cases, which is something you'll have to get used to from Ruby but can be quite powerful.
In case this helps, I wrote up more thoughts here in 2011: The 'ugliness' of Python. They may need updating in light of today's focus on ML.
My suggestion: Don't try to learn the differences. Learn how to approach the problem in Python. Just like there's a Ruby approach to each problem (that works very well givin the limitations and strengths of the language), there's a Python approach to the problem. they are both different. To get the best out of each language, you really should learn the language itself, and not just the "translation" from one to the other.
Now, with that said, the difference will help you adapt faster and make 1 off modifications to a Python program. And that's fine for a start to get writing. But try to learn from other projects the why behind the architecture and design decisions rather than the how behind the semantics of the language...
I know little Ruby, but here are a few bullet points about the things you mentioned:
nil, the value indicating lack of a value, would be None (note that you check for it like x is None or x is not None, not with == - or by coercion to boolean, see next point).
None, zero-esque numbers (0, 0.0, 0j (complex number)) and empty collections ([], {}, set(), the empty string "", etc.) are considered falsy, everything else is considered truthy.
For side effects, (for-)loop explicitly. For generating a new bunch of stuff without side-effects, use list comprehensions (or their relatives - generator expressions for lazy one-time iterators, dict/set comprehensions for the said collections).
Concerning looping: You have for, which operates on an iterable(! no counting), and while, which does what you would expect. The fromer is far more powerful, thanks to the extensive support for iterators. Not only nearly everything that can be an iterator instead of a list is an iterator (at least in Python 3 - in Python 2, you have both and the default is a list, sadly). The are numerous tools for working with iterators - zip iterates any number of iterables in parallel, enumerate gives you (index, item) (on any iterable, not just on lists), even slicing abritary (possibly large or infinite) iterables! I found that these make many many looping tasks much simpler. Needless to say, they integrate just fine with list comprehensions, generator expressions, etc.
In Ruby, instance variables and methods are completely unrelated, except when you explicitly relate them with attr_accessor or something like that.
In Python, methods are just a special class of attribute: one that is executable.
So for example:
>>> class foo:
... x = 5
... def y(): pass
...
>>> f = foo()
>>> type(f.x)
<type 'int'>
>>> type(f.y)
<type 'instancemethod'>
That difference has a lot of implications, like for example that referring to f.x refers to the method object, rather than calling it. Also, as you can see, f.x is public by default, whereas in Ruby, instance variables are private by default.