List of objects or parallel arrays of properties? - python

The question is, basically: what would be more preferable, both performance-wise and design-wise - to have a list of objects of a Python class or to have several lists of numerical properties?
I am writing some sort of a scientific simulation which involves a rather large system of interacting particles. For simplicity, let's say we have a set of balls bouncing inside a box so each ball has a number of numerical properties, like x-y-z-coordinates, diameter, mass, velocity vector and so on. How to store the system better? Two major options I can think of are:
to make a class "Ball" with those properties and some methods, then store a list of objects of the class, e. g. [b1, b2, b3, ...bn, ...], where for each bn we can access bn.x, bn.y, bn.mass and so on;
to make an array of numbers for each property, then for each i-th "ball" we can access it's 'x' coordinate as xs[i], 'y' coordinate as ys[i], 'mass' as masses[i] and so on;
To me it seems that the first option represents a better design. The second option looks somewhat uglier, but might be better in terms of performance, and it could be easier to use it with numpy and scipy, which I try to use as much as I can.
I am still not sure if Python will be fast enough, so it may be necessary to rewrite it in C++ or something, after initial prototyping in Python. Would the choice of data representation be different for C/C++? What about a hybrid approach, e.g. Python with C++ extension?
Update: I never expected any performance gain from parallel arrays per se, but in a mixed environment like Python + Numpy (or whatever SlowScriptingLanguage + FastNativeLibrary) using them may (or may not?) let you move more work out of you slow scripting code and into the fast native library.

Having an object for each ball in this example is certainly better design. Parallel arrays are really a workaround for languages that do not support proper objects. I wouldn't use them in a language with OO capabilities unless it's a tiny case that fits within a function (and maybe not even then) or if I've run out of every other optimization option and the profiler shows that property access is the culprit. This applies twice as much to Python as to C++, as the former places a large emphasis on readability and elegance.

I agree that parallel arrays are almost always a bad idea, but don't forget that you can use views into a numpy array when you're setting things, up, though... (Yes, I know this is effectively using parallel arrays, but I think it's the best option in this case...)
This is great if you know the number of "balls" you're going to create beforehand, as you can allocate an array for the coordinates, and store a view into that array for each ball object.
You have to be a bit careful to do operations in-place on the coords array, but it makes updating coordinates for numerous "balls" much, much, much faster.
For example...
import numpy as np
class Ball(object):
def __init__(self, coords):
self.coords = coords
def _set_coord(self, i, value):
self.coords[i] = value
x = property(lambda self: self.coords[0],
lambda self, value: self._set_coord(0, value))
y = property(lambda self: self.coords[1],
lambda self, value: self._set_coord(1, value))
def move(self, dx, dy):
self.x += dx
self.y += dy
def main():
n_balls = 10
n_dims = 2
coords = np.zeros((n_balls, n_dims))
balls = [Ball(coords[i,:]) for i in range(n_balls)]
# Just to illustrate that that the coords are updating
ball = balls[0]
# Random walk by updating coords array
print 'Moving all the balls randomly by updating coords'
for step in xrange(5):
# Add a random value to all coordinates
coords += 0.5 - np.random.random((n_balls, n_dims))
# Display the coords for a particular ball and the
# corresponding row of the coords array
print ' Value of ball.x, ball.y:', ball.x, ball.y
print ' Value of coords[0,:]:', coords[0,:]
# Move an individual ball object
print 'Moving a ball individually through Ball.move()'
ball.move(0.5, 0.5)
print ' Value of ball.x, ball.y:', ball.x, ball.y
print ' Value of coords[0,:]:', coords[0,:]
main()
Just to illustrate, this outputs something like:
Moving all the balls randomly by updating coords
Value of ball.x, ball.y: -0.125713650677 0.301692195466
Value of coords[0,:]: [-0.12571365 0.3016922 ]
Value of ball.x, ball.y: -0.304516863495 -0.0447543559805
Value of coords[0,:]: [-0.30451686 -0.04475436]
Value of ball.x, ball.y: -0.171589457954 0.334844443821
Value of coords[0,:]: [-0.17158946 0.33484444]
Value of ball.x, ball.y: -0.0452864552743 -0.0297552313656
Value of coords[0,:]: [-0.04528646 -0.02975523]
Value of ball.x, ball.y: -0.163829876915 0.0153203173857
Value of coords[0,:]: [-0.16382988 0.01532032]
Moving a ball individually through Ball.move()
Value of ball.x, ball.y: 0.336170123085 0.515320317386
Value of coords[0,:]: [ 0.33617012 0.51532032]
The advantage here is that updating a single numpy array is going to be much, much faster than iterating through all of your ball objects, but you retain a more object-oriented approach.
Just my thoughts on it, anyway..
EDIT: To give some idea of the speed difference, with 1,000,000 balls:
In [104]: %timeit coords[:,0] += 1.0
100 loops, best of 3: 11.8 ms per loop
In [105]: %timeit [item.x + 1.0 for item in balls]
1 loops, best of 3: 1.69 s per loop
So, updating the coordinates directly using numpy is roughly 2 orders of magnitude faster when using a large number of balls. (the difference is smaller when using 10 balls, as per the example, roughly a factor of 2x, rather than 150x)

I think it depends on what you're going to be doing with them, and how often you're going to be working with (all attributes of one particle) vs (one attribute of all particles). The former is better suited to the object approach; the latter is better suited to the array approach.
I was facing a similar problem (although in a different domain) a couple of years ago. The project got deprioritized before I actually implemented this phase, but I was leaning towards a hybrid approach, where in addition to the Ball class I would have an Ensemble class. The Ensemble would not be a list or other simple container of Balls, but would have its own attributes (which would be arrays) and its own methods. Whether the Ensemble is created from the Balls, or the Balls from the Ensemble, depends on how you're going to construct them.
One of my coworkers was arguing for a solution where the fundamental object was an Ensemble which might contain only one Ball, so that no calling code would ever have to know whether you were operating on just one Ball (do you ever do that for your application?) or on many.

Will you be having any forces between the balls (hard sphere/collision, gravity, electromagnetic)? I'm guessing so. Will you be having a large enough number of balls to want to use Barnes-Hut simulation ideas? If so, then you should definitely use the Ball class idea so that you can store them easily in octrees or something else along those lines. Also, using the Barnes-Hut simulation will cut down the complexity of the simulation to O(N log N) from O(N^2).
Really though, if you don't have forces between the balls or aren't using many balls, you don't need the possible speed gains from using parallel arrays and should go with the Ball class idea for that as well.

Related

Algorithm Design - When to use a dictionary vs a list to track values

For the following problem, I used a dictionary to track values while the provided answer used a list. Is there a quick way to determine the most efficient data structures for problems like these?
A robot moves in a plane starting from the original point (0,0). The
robot can move toward UP, DOWN, LEFT and RIGHT with a given steps. The
trace of robot movement is shown as the following: UP 5 DOWN 3 LEFT 3
RIGHT 2.­ The numbers after the direction are steps. Please write a
program to compute the distance from current position after a sequence
of movement and original point. If the distance is a float, then just
print the nearest integer. Example: If the following tuples are given
as input to the program: UP 5 DOWN 3 LEFT 3 RIGHT 2 Then, the output
of the program should be: 2
My answer uses a dictionary (origin["y"] for y and origin["x"] for x):
direction = 0
steps = 0
command = (direction, steps)
command_list = []
origin = {"x": 0, "y": 0}
while direction is not '':
direction = input("Direction (U, D, L, R):")
steps = input("Number of steps:")
command = (direction, steps)
command_list.append(command)
print(command_list)
while len(command_list) > 0:
current = command_list[-1]
if current[0] == 'U':
origin["y"] += int(current[1])
elif current[0] == 'D':
origin["y"] -= int(current[1])
elif current[0] == 'L':
origin["x"] -= int(current[1])
elif current[0] == 'R':
origin["x"] += int(current[1])
command_list.pop()
distance = ((origin["x"])**2 + (origin["y"])**2)**0.5
print(distance)
The provided answer uses a list (pos[0] for y, and pos[1] for x):
import math
pos = [0,0]
while True:
s = raw_input()
if not s:
break
movement = s.split(" ")
direction = movement[0]
steps = int(movement[1])
if direction=="UP":
pos[0]+=steps
elif direction=="DOWN":
pos[0]-=steps
elif direction=="LEFT":
pos[1]-=steps
elif direction=="RIGHT":
pos[1]+=steps
else:
pass
print int(round(math.sqrt(pos[1]**2+pos[0]**2)))
I'll offer a few points on your question because I strongly disagree with the close recommendations. There's much in your question that's not opinion.
In general, your choice of dictionary wasn't appropriate. For a toy program like this it doesn't make much difference, but I assume you're interested in best practice for serious programs. In production software, you wouldn't make this choice. Why?
Error prone-ness. A typo in future code, e.g. origin["t"] = 3 when you meant origin["y"] = 3 is a nasty bug, maybe difficult to find. t = 3 is more likely to cause a "fast failure." (In a statically typed language like C++ or Java, it's a sure compile-time error.)
Space overhead. A simple scalar variable requires essentially no space beyond the value itself. An array has a fixed overhead for the "dope vector" that tracks its location, current, and maximum size. A dictionary requires yet more extra space for open addressing, unused hash buckets, and fill tracking.
Speed.
Accessing a scalar variable is very fast: just a few processor instructions.
Accessing a tuple or array element when you know its index is also very fast, though not as fast as variable access. Extra instructions are needed to check array bounds. Adding one element to an array may take O(current array size) to copy current contents into a larger block of memory. The advantage of tuples and arrays is that you can access elements quickly based on a computed integer index. Scalar variables don't do this. Choose an array/tuple when you need integer index access. Favor tuples when you know the exact size and it's unlikely to change. Their immutability tends to make code more understandable (and thread safe).
Accessing a dictionary element is still more expensive because a hash value must be computed and buckets traversed with possible collision resolution. Adding a single element can also trigger a table reorganization, which is O(table size) with constant factor much bigger than list reorganization because all the elements must be rehashed. The big advantage of dictionaries is that accessing all stored pairs is likely to take the same amount of time. You should choose a dict only when you need that capability: to store a "map" from keys to values.
Conclude from all the above that the best choice for your origin coordinates would have been simple variables. If you later enhance the program in a way that requires passing (x, y) pairs to/from methods, then you'd consider a Point class.

Fast filtering of a list by attribute of each element (of type scikit-image.RegionProperties)

I have a large (>200,000) list of objects (of type RegionProperties, produced by skimage.measure.regionprops). The attributes of each object can be accessed with [] or .. For example:
my_list = skimage.measure.regionprops(...)
my_list[0].area
gets the area.
I want to filter this list to extract elements which have area > 300 to then act on them. I have tried:
# list comprehension
selection = [x for x in my_list if x.area > 300]
for foo in selection:
...
# filter (with predefined function rather than lambda, for speed)
def my_condition(x)
return(x.area > 300)
selection = filter(my_condition, my_list)
for foo in selection:
...
# generator
def filter_by_area(x):
for el in x:
if el.area > 300: yield el
for foo in filter_by_area(prop):
...
I find that generator ~ filter > comprehension in terms of speed but only marginally (4.15s, 4.16s, 4.3s). I have to repeat such a filter thousands of times, resulting in hours of CPU time just filtering a list. This simple operation if currently the bottleneck of the whole image analysis process.
Is there another way of doing this? Possibly involving C, or some peculiarity of RegionProperties objects?
Or maybe a completely different algorithm? I thought about eroding the image to make small particles disappear and only keep large ones, but the measurements have to be done on the non-eroded image and finding the correspondance between the two is long too.
Thank you very much in advance for any pointer!
As suggested by Mr. F, I tried isolating the filtering part by doing some dumb operation in the loop:
selection = [x for x in my_list if x.area > 300]
for foo in selection:
a = 1 + 1
this resulted in exactly the same times as before, even though I was extracting a few properties of the particles in the loop before. This pushed me to look more into how the area property of particles, on which I am doing the filtering, is extracted.
It turns out that skimage.measure.regionprops just prepares the data to compute the properties, it does not compute them explicitly. Extracting one property (such as area) triggers the computation of all the properties needed to get to the extracted property. It turns out that the area is computed as the first moment of the particle image, which, in turns, triggers the computation of all the moments, which triggers other computations, etc. So just doing x.area is not just extracting a pre-computed value but actually computing plenty of stuff.
There is a simpler solution to compute the area. For the record, I do it this way
numpy.sum(x._label_image[x._slice] == x.label)
So my problem is actually very specific to scikit-image RegionProperties objects. By using the formula above to compute the area, instead of using x.area, I get the filtering time down from 4.3s to ~1s.
Thanks for Mr. F for the comment which prompted me to go on this exploration of the code of scikit-image and solve my performance problem (the whole image processing routine when from several days to several hours!).
PS: by the way, with this code, it seems list comprehension gets a (very small) edge over the other two methods. And it's clearer so that's perfect!

Getting Keys Within Range/Finding Nearest Neighbor From Dictionary Keys Stored As Tuples

I have a dictionary which has coordinates as keys. They are by default in 3 dimensions, like dictionary[(x,y,z)]=values, but may be in any dimension, so the code can't be hard coded for 3.
I need to find if there are other values within a certain radius of a new coordinate, and I ideally need to do it without having to import any plugins such as numpy.
My initial thought was to split the input into a cube and check no points match, but obviously that is limited to integer coordinates, and would grow exponentially slower (radius of 5 would require 729x the processing), and with my initial code taking at least a minute for relatively small values, I can't really afford this.
I heard finding the nearest neighbor may be the best way, and ideally, cutting down the keys used to a range of +- a certain amount would be good, but I don't know how you'd do that when there's more the one point being used.Here's how I'd do it with my current knowledge:
dimensions = 3
minimumDistance = 0.9
#example dictionary + input
dictionary[(0,0,0)]=[]
dictionary[(0,0,1)]=[]
keyToAdd = [0,1,1]
closestMatch = 2**1000
tooClose = False
for keys in dictionary:
#calculate distance to new point
originalCoordinates = str(split( dictionary[keys], "," ) ).replace("(","").replace(")","")
for i in range(dimensions):
distanceToPoint = #do pythagors with originalCoordinates and keyToAdd
#if you want the overall closest match
if distanceToPoint < closestMatch:
closestMatch = distanceToPoint
#if you want to just check it's not within that radius
if distanceToPoint < minimumDistance:
tooClose = True
break
However, performing calculations this way may still run very slow (it must do this to millions of values). I've searched the problem, but most people seem to have simpler sets of data to do this to. If anyone can offer any tips I'd be grateful.
You say you need to determine IF there are any keys within a given radius of a particular point. Thus, you only need to scan the keys, computing the distance of each to the point until you find one within the specified radius. (And if you do comparisons to the square of the radius, you can avoid the square roots needed for the actual distance.)
One optimization would be to sort the keys based on their "Manhattan distance" from the point (that is, add the component offsets), since the Euclidean distance will never be less than this. This would avoid some of the more expensive calculations (though I don't think you need and trigonometry).
If, as you suggest later in the question, you need to handle multiple points, you can obviously process each individually, or you could find the center of those points and sort based on that.

How's this for collision detection?

I have to write a collide method inside a Rectangle class that takes another Rectangle object as a parameter and returns True if it collides with the rectangle performing the method and False if it doesn't. My solution was to use a for loop that iterates through every value of x and y in one rectangle to see if it falls within the other, but I suspect there might be more efficient or elegant ways to do it. This is the method (I think all the names are pretty self explanatory, just ask if anything isn't clear):
def collide(self,target):
result = False
for x in range(self.x,self.x+self.width):
if x in range(target.get_x(),target.get_x()+target.get_width()):
result = True
for y in range(self.y,self.y+self.height):
if y in range(target.get_y(),target.get_y()+target.get_height()):
result = True
return result
Thanks in advance!
The problem of collision detection is a well-known one, so I thought rather than speculate I might search for a working algorithm using a well-known search engine. It turns out that good literature on rectangle overlap is less easy to come by than you might think. Before we move on to that, perhaps I can comment on your use of constructs like
if x in range(target.get_x(),target.get_x()+target.get_width()):
It is to Python's credit that such an obvious expression of your idea actually succeeds as intended. What you may not realize is that (in Python 2, anyway) each use of range() creates a list (in Python 3 it creates a generator and iterates over that instead; if you don't know what that means please just accept that it's little better in computational terms). What I suspect you may have meant is
if target.get_x() <= x < target.get_x()+target.get_width():
(I am using open interval testing to reflect your use of range())This has the merit of replacing N equality comparisons with two chained comparisons. By a relatively simple mathematical operation (subtracting target.get_x() from each term in the comparison) we transform this into
if 0 <= x-target.get_x() < target.get_width():
Do not overlook the value of eliminating such redundant method calls, though it's often simpler to save evaluated expressions by assignment for future reference.
Of course, after that scrutiny we have to look with renewed vigor at
for x in range(self.x,self.x+self.width):
This sets a lower and an upper bound on x, and the inequality you wrote has to be false for all values of x. Delving beyond the code into the purpose of the algorithm, however, is worth doing. Because any lit creation the inner test might have done is now duplicated many times over (by the width of the object, to be precise). I take the liberty of paraphrasing
for x in range(self.x,self.x+self.width):
if x in range(target.get_x(),target.get_x()+target.get_width()):
result = True
into pseudocode: "if any x between self.x and self.x+self.width lies between the target's x and the target's x+width, then the objects are colliding". In other words, whether two ranges overlap. But you sure are doing a lot of work to find that out.
Also, just because two objects collide in the x dimension doesn't mean they collide in space. In fact, if they do not also collide in the y dimension then the objects are disjoint, otherwise you would assess these rectangles as colliding:
+----+
| |
| |
+----+
+----+
| |
| |
+----+
So you want to know if they collide in BOTH dimensions, not just one. Ideally one would define a one-dimensional collision detection (which by now we just about have ...) and then apply in both dimensions. I also hope that those accessor functions can be replaced by simple attribute access, and my code is from now on going to assume that's the case.
Having gone this far, it's probably time to take a quick look at the principles in this YouTube video, which makes the geometry relatively clear but doesn't express the formula at all well. It explains the principles quite well as long as you are using the same coordinate system. Basically two objects A and B overlap horizontally if A's left side is between B's left and right sides. They also overlap if B's right is between A's left and right. Both conditions might be true, but in Python you should think about using the keyword or to avoid unnecessary comparisons.
So let's define a one-dimensional overlap function:
def oned_ol(aleft, aright, bleft, bright):
return (aleft <= bright < aright) or (bleft <= aright < bright)
I'm going to cheat and use this for both dimensions, since the inside of my function doesn't know which dimension's data I cam calling it with. If I am correct, the following formulation should do:
def rect_overlap(self, target):
return oned_ol(self.x, self.x+self.width, target.x, target.x+target.width) \
and oned_ol(self.y, self.y+self.height, target.y, target.y+target.height
If you insist on using those accessor methods you will have to re-cast the code to include them. I've done sketchy testing on the 1-D overlap function, and none at all on rect_overlap, so please let me know - caveat lector. Two things emerge.
A superficial examination of code can lead to "optimization" of a hopelessly inefficient algorithm, so sometimes it's better to return to first principles and look more carefully at your algorithm.
If you use expressions as arguments to a function they are available by name inside the function body without the need to make an explicit assignment.
def collide(self, target):
# self left of target?
if x + self.width < target.x:
return False
# self right of target?
if x > target.x + target.width :
return False
# self above target?
if y + self.height < target.y:
return False
# self below target?
if y > target.y + target.height:
return False
return True
Something like that (depends on your coord system, i.e. y positive up or down)

python: representing square grid that wraps on itself (cylinder)

I am modeling something that occurs on a square grid that wraps on itself (i.e., if you walk up past the highest point, you end up at the lowest point, like a cylinder; if you walk to the right, you just hit the boundary). I need to keep track of the location of various agents, the amount of resources at different points, and calculate the direction that agents will be moving in based on certain rules.
What's the best way to model this?
Should I make a class that represents points, which has methods to return neighboring points in each direction? If so, I would probably need to make it hashable so that I can use it as keys for the dictionary that contains the full grid (I assume such grid should be a dictionary?)
Or should I make a class that describes the whole grid and not expose individual points as independent objects?
Or should I just use regular (x, y) tuples and have methods elsewhere that allow to look up neighbors?
A lot of what I'll need to model is not yet clearly defined. Furthermore, I expect the geometry of the surface might possibly change one day (e.g., it could wrap on itself in both directions).
EDIT: One additional question: should I attach the information about the quantity of resources to each Point instance; or should I have a separate class that contains a map of resources indexed by Point?
If you want a hashable Point class without too much work, subclass tuple and add your own neighbor methods.
class Point(tuple):
def r_neighbor(self):
return Point((self[0] + 1, self[1]))
def l_neighbor(self):
[...]
x = Point((10, 11))
print x
print x.r_neighbor()
The tuple constructor wants an iterable, hence the double-parens in Point((10, 11)); if you want to avoid that, you can always override __new__ (overriding __init__ is pointless because tuples are immutable):
def __new__(self, x, y):
return super(Point, self).__new__(self, (x, y))
This might also be the place to apply modular arithmetic -- though that will really depend on what you are doing:
def __new__(self, x, y, gridsize=100):
return super(Point, self).__new__(self, (x % gridsize, y % gridsize))
or to enable arbitrary dimension grids, and go back to using tuples in __new__:
def __new__(self, tup, gridsize=100):
return super(Point, self).__new__(self, (x % gridsize for x in tup))
Regarding your question about resources: since Point is an immutable class, it's a poor place to store information about resources that might change. A defaultdict would be handy; you wouldn't have to initialize it.
from collections import defaultdict
grid = defaultdict(list)
p = Point((10, 13))
grid[(10, 13)] = [2, 3, 4]
print grid[p] # prints [2, 3, 4]
print grid[p.r_neighbor] # no KeyError; prints []
If you want more flexibility, you could use a dict instead of a list in defaultdict; but defaultdict(defaultdict) won't work; you have to create a new defaultdict factory function.
def intdict():
return defaultdict(int)
grid = defaultdict(intdict)
or more concisely
grid = defaultdict(lambda: defaultdict(int))
then
p = Point((10, 13))
grid[(10, 13)]["coins"] = 50
print grid[p]["coins"] # prints 50
print grid[p.r_neighbor]["coins"] # prints 0; again, no KeyError
I need to keep track of the location
of various agents, the amount of
resources at different points, and
calculate the direction that agents
will be moving in based on certain
rules.
Sounds like a graph to to me, though I try to see a graph in every problem. All the operations you mentioned (move around, store resources, find out where to move to) are very common on graphs. You would also be able to easily change the topology, from a cylinder to a torus or in any other way.
The only issue is that it takes more space than other representations.
On the plus side you can use a graph library to create the graph and probably even some graph algorithms to calculate where agents go.

Categories

Resources