Good evening,
I am currently working on a first year university project to simulate continuum percolation. This involves randomly distributing some discs/spheres/hyperspheres across a square/cube/hypercube in n dimensional space and finding a cluster of connected particles that spans the boundaries.
In order to speed up what is essentially collision detection between all these particles to group them up into connected clusters, I have decided to use spatial partitioning so my program scales nicely with number of particles. This requires me to divide the n dimensional space up with evenly sized boxes/cubes/hypercubes and place particles inside the relevant boxes so that an optimised collision check may be done which requires less comparisons since only particles lying in the boxes/cubes/hypercubes adjacent to that in which the new particle lies need to be checked. All the detail has been worked out algorithmically.
However, it seemed like a good idea to use an ndarray which has "dimension" equal to that of the space being studied. Then each "point" in the ndarray would itself contain an array of particle objects. It would be easy to look at the objects in the ndarray existing in coordinates around that of the new particle and cycle through the arrays contained in those which would themselves contain the other particles against which the check must be done. I then found out that ndarray can only contain objects of a fixed size, which these arrays of particles are not since they grow as particles are randomly added to the system.
Would a normal numpy array of array of array (etc..) be the only solution or do structures similar to ndarray but able to accomodate objects of variable size exist? Ndarray seemed great because it is part of numpy which is written in the compiled language c so it would be fast. Furthermore an ndarray would not require and loops to construct as I believe an array of arrays of arrays (etc...) would (NB: dimensionality of space and the increments of spatial division are not constant as particles of different radii can be added, meaning a change in the size of the spatial division squares/cubes/hypercubes).
Speed is very important in this program and it would be a shame to see the algorithmically good optimisations I have found be ruined by bad implementation!
Have you considered using a kd-tree instead? kd-trees support fast enumeration of the neighbours of a point by splitting the space (much like you suggested with the multidimensional arrays).
As a nice bonus, there's already a decent kd-tree implementation in SciPy, the companion project to NumPy: scipy.spatial.KDTree.
Related
The Problem
I'm a physics graduate research assistant, and I'm trying to build a very ambitious Python code that, boiled down to the basics, evaluates a line integral in a background described by very large arrays of data.
I'm generating large sets of data that are essentially t x n x n arrays, with n and t on the order of 100. They represent the time-varying temperatures and velocities of a 2 dimensional space. I need to collect many of these grids, then randomly choose a dataset and calculate a numerical integral dependent on the grid data along some random path through the plane (essentially 3 separate grids: x-velocity, y-velocity, and temperature, as the vector information is important). The end goal is gross amounts of statistics on the integral values for given datasets.
Visual example of time-varying background.
That essentially entails being able to sample the background at particular points in time and space, say like (t, x, y). The data begins as a big ol' table of points, with each row organized as ['time','xpos','ypos','temp','xvel','yvel'], with an entry for each (xpos, ypos) point in each timestep time, and I can massage it how I like.
The issue is that I need to sample thousands of random paths in many different backgrounds, so time is a big concern. Barring the time required to generate the grids, the biggest holdup is being able to access the data points on the fly.
I previously built a proof of concept of this project in Mathematica, which is much friendlier to the analytic mindset that I'm approaching the project from. In that prototype, I read in a collection of 10 backgrounds and used Mathematica's ListInterpolation[] function to generate a continuous function that represented the discrete grid data. I stored these 10 interpolating functions in an array and randomly called them when calculating my numerical integrals.
This method worked well enough, after some massaging, for 10 datasets, but as the project moves forward that may rapidly expand to, say, 10000 datasets. Eventually, the project is likely to be set up to generate the datasets on the fly, saving each for future analysis if necessary, but that is some time off. By then, it will be running on a large cluster machine that should be able to parallelize a lot of this.
In the meantime, I would like to generate some number of datasets and be able to sample from them at will in whatever is likely to be the fastest process. The data must be interpolated to be continuous, but beyond that I can be very flexible with how I want to do this. My plan was to do something similar to the above, but I was hoping to find a way to generate these interpolating functions for each dataset ahead of time, then save them to some file. The code would then randomly select a background, load its interpolating function, and evaluate the line integral.
Initial Research
While hunting to see if someone else had already asked a similar question, I came across this:
Fast interpolation of grid data
The OP seemed interested in just getting back a tighter grid rather than a callable function, which might be useful to me if all else fails, but the solutions also seemed to rely on methods that are limited by the size of my datasets.
I've been Googling about for interpolation packages that could get at what I want. The only things I've found that seem to fit the bill are:
Scipy griddata()
Scipy interpn()
Numpy interp()
Attempts at a Solution
I have one sample dataset (I would make it available, but it's about 200MB or so), and I'm trying to generate and store an interpolating function for the temperature grid. Even just this step is proving pretty troubling for me, since I'm not very fluent in Python. I found that it was slightly faster to load the data through pandas, cut to the sections I'm interested in, and then stick this in a numpy array.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import griddata
# Load grid data from file
gridData = pd.read_fwf('Backgrounds\\viscous_14_moments_evo.dat', header=None, names=['time','xpos','ypos','temp','xvel','yvel'])
# Set grid parameters
# nGridSpaces is total number of grid spaces / bins.
# Will be data-dependent in the future.
nGridSpaces = 27225
# Number of timesteps is gridData's time column divided by number of grid spaces.
NT = int(len(gridData['time'])/nGridSpaces)
From here, I've tried to use Scipy's interpnd() and griddata() functions, to no avail. I believe I'm just not understanding how it wants to take the data. I think that my issue with both is trying to corral the (t, x, y) "points" corresponding to the temperature values into a useable form.
The main thrust of my efforts has been trying to get them into Numpy's meshgrid(), but I believe that maybe I'm hitting the upper limit of the size of data Numpy will take for this sort of thing.
# Lists of points individually
tList=np.ndarray.flatten(pd.DataFrame(gridData[['time']]).to_numpy())
xList=np.ndarray.flatten(pd.DataFrame(gridData[['xpos']]).to_numpy())
yList=np.ndarray.flatten(pd.DataFrame(gridData[['ypos']]).to_numpy())
# 3D grid of points
points = np.meshgrid(tList, xList, yList)
# List of temperature values
tempValues=np.ndarray.flatten(pd.DataFrame(gridData[['temp']]).to_numpy())
# Interpolate and spit out a value for a point somewhere central-ish as a check
point = np.array([1,80,80])
griddata(points, tempValues, point)
This returns a value error on the line calling meshgrid():
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
The Questions
First off... What limitations are there on the sizes of datasets I'm using? I wasn't able to find anything in Numpy's documentation about maximum sizes.
Next... Does this even sound like a smart plan? Can anyone think of a smarter framework to get where I want to go?
What are the biggest impacts on speed when handling these types of large arrays, and how can I mitigate them?
I have a very large data set comprised of (x,y) coordinates. I need to know which of these points are in certain regions of the 2D space. These regions are bounded by 4 lines in the 2D domain (some of the sides are slightly curved).
For smaller datasets I have used a cumbersome for loop to test each individual point for membership of each region. This doesn't seem like a good option any more due to the size of data set.
Is there a better way to do this?
For example:
If I have a set of points:
(0,1)
(1,2)
(3,7)
(1,4)
(7,5)
and a region bounded by the lines:
y=2
y=5
y=5*sqrt(x) +1
x=2
I want to find a way to identify the point (or points) in that region.
Thanks.
The exact code is on another computer but from memory it was something like:
point_list = []
for i in range(num_po):
a=5*sqrt(points[i,0]) +1
b=2
c=2
d=5
if (points[i,1]<a) && (points[i,0]<b) && (points[i,1]>c) && (points[i,1]<d):
point_list.append(points[i])
This isn't the exact code but should give an idea of what I've tried.
If you have a single (or small number) of regions, then it is going to be hard to do much better than to check every point. The check per point can be fast, particularly if you choose the fastest or most discriminating check first (eg in your example, perhaps, x > 2).
If you have many regions, then speed can be gained by using a spatial index (perhaps an R-Tree), which rapidly identifies a small set of candidates that are in the right area. Then each candidate is checked one by one, much as you are checking already. You could choose to index either the points or the regions.
I use the python Rtree package for spatial indexing and find it very effective.
This is called the range searching problem and is a much-studied problem in computational geometry. The topic is rather involved (with your square root making things nonlinear hence more difficult). Here is a nice blog post about using SciPy to do computational geometry in Python.
Long comment:
You are not telling us the whole story.
If you have this big set of points (say N of them) and one set of these curvilinear quadrilaterals (say M of them) and you need to solve the problem once, you cannot avoid exhaustively testing all points against the acceptance area.
Anyway, you can probably preprocess the M regions in such a way that testing a point against the acceptance area takes less than M operations (closer to Log(M)). But due to the small value of M, big savings are unlikely.
Now if you don't just have one acceptance area but many of them to be applied in turn on the same point set, then more sophisticated solutions are possible (namely range searching), that can trade N comparisons to about Log(N) of them, a quite significant improvement.
It may also be that the point set is not completely random and there is some property of the point set that can be exploited.
You should tell us more and show a sample case.
I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)
I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?
EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.
I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.
Here is one possible implementation of the cross match between list of points on the sphere using k-d tree.
http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py
Another way is to use healpy module and their get_neighbors method.
I am working on an FEM project using Scipy. Now my problem is, that
the assembly of the sparse matrices is too slow. I compute the
contribution of every element in dense small matrices (one for each
element). For the assembly of the global matrices I loop over all
small dense matrices and set the matrice entries the following way:
[i,j] = someList[k][l]
Mglobal[i,j] = Mglobal[i,j] + Mlocal[k,l]
Mglobal is a lil_matrice of appropriate size, someList maps the
indexing variables.
Of course this is rather slow and consumes most of the matrice
assembly time. Is there a better way to assemble a large sparse matrix
from many small dense matrices? I tried scipy.weave but it doesn't
seem to work with sparse matrices
I posted my response to the scipy mailing list; stack overflow is a bit easier
to access so I will post it here as well, albeit a slightly improved version.
The trick is to use the IJV storage format. This is a trio of three arrays
where the first one contains row indicies, the second has column indicies, and
the third has the values of the matrix at that location. This is the best way
to build finite element matricies (or any sparse matrix in my opinion) as access
to this format is really fast (just filling an an array).
In scipy this is called coo_matrix; the class takes the three arrays as an
argument. It is really only useful for converting to another format (CSR os
CSC) for fast linear algebra.
For finite elements, you can estimate the size of the three arrays by something
like
size = number_of_elements * number_of_basis_functions**2
so if you have 2D quadratics you would do number_of_elements * 36, for example.
This approach is convenient because if you have local matricies you definitely
have the global numbers and entry values: exactly what you need for building
the three IJV arrays. Scipy is smart enough to throw out zero entries, so
overestimating is fine.
I'm porting an C++ scientific application to python, and as I'm new to python, some problems come to my mind:
1) I'm defining a class that will contain the coordinates (x,y). These values will be accessed several times, but they only will be read after the class instantiation. Is it better to use an tuple or an numpy array, both in memory and access time wise?
2) In some cases, these coordinates will be used to build a complex number, evaluated on a complex function, and the real part of this function will be used. Assuming that there is no way to separate real and complex parts of this function, and the real part will have to be used on the end, maybe is better to use directly complex numbers to store (x,y)? How bad is the overhead with the transformation from complex to real in python? The code in c++ does a lot of these transformations, and this is a big slowdown in that code.
3) Also some coordinates transformations will have to be performed, and for the coordinates the x and y values will be accessed in separate, the transformation be done, and the result returned. The coordinate transformations are defined in the complex plane, so is still faster to use the components x and y directly than relying on the complex variables?
Thank you
In terms of memory consumption, numpy arrays are more compact than Python tuples.
A numpy array uses a single contiguous block of memory. All elements of the numpy array must be of a declared type (e.g. 32-bit or 64-bit float.) A Python tuple does not necessarily use a contiguous block of memory, and the elements of the tuple can be arbitrary Python objects, which generally consume more memory than numpy numeric types.
So this issue is a hands-down win for numpy, (assuming the elements of the array can be stored as a numpy numeric type).
On the issue of speed, I think the choice boils down to the question, "Can you vectorize your code?"
That is, can you express your calculations as operations done on entire arrays element-wise.
If the code can be vectorized, then numpy will most likely be faster than Python tuples. (The only case I could imagine where it might not be, is if you had many very small tuples. In this case the overhead of forming the numpy arrays and one-time cost of importing numpy might drown-out the benefit of vectorization.)
An example of code that could not be vectorized would be if your calculation involved looking at, say, the first complex number in an array z, doing a calculation which produces an integer index idx, then retrieving z[idx], doing a calculation on that number, which produces the next index idx2, then retrieving z[idx2], etc. This type of calculation might not be vectorizable. In this case, you might as well use Python tuples, since you won't be able to leverage numpy's strength.
I wouldn't worry about the speed of accessing the real/imaginary parts of a complex number. My guess is the issue of vectorization will most likely determine which method is faster. (Though, by the way, numpy can transform an array of complex numbers to their real parts simply by striding over the complex array, skipping every other float, and viewing the result as floats. Moreover, the syntax is dead simple: If z is a complex numpy array, then z.real is the real parts as a float numpy array. This should be far faster than the pure Python approach of using a list comprehension of attribute lookups: [z.real for z in zlist].)
Just out of curiosity, what is your reason for porting the C++ code to Python?
A numpy array with an extra dimension is tighter in memory use, and at least as fast!, as a numpy array of tuples; complex numbers are at least as good or even better, including for your third question. BTW, you may have noticed that -- while questions asked later than yours were getting answers aplenty -- your was laying fallow: part of the reason is no doubt that asking three questions within a question turns responders off. Why not just ask one question per question? It's not as if you get charged for questions or anything, you know...!-)