Mapping arrays with same values but different orders

Mapping arrays with same values but different orders - python

I have two arrays of coordinates from two separate files from a CFD calculation. One is a mesh file which contains the connectivity information and the other is the results file.
My problem is that the coordinates from each file are not in the same order. What I would like to be able to do is order ALL the arrays from the results file to be in the same order as the mesh file.
My idea would be to find the matching values of xyz coordinates and create a mapping such that the rest of the result arrays can be ordered.
I was thinking something like:
mapping = np.empty(len(co_mesh))
for i,coord in enumerate(co_mesh):
for j in range(len(co_res)):
if (coord[0]==co_res[j,0]) and (coord[1]==co_res[j,1]) and (coord[2]==co_res[j,2]):
mapping[i] = j
where co_mesh, co_res are arrays containing the x,y,z coords.
The problem is that I suspect this loop will take a long time. At the moment I'm only looping over around 70000 points but in future this could increase to 1 million or more.
Is there a faster way to write this in Python.
I'm using Python 2.6.5.
Ben
For those who are interested this is what I am currently using:
mesh_coords = zip(xm_list,ym_list,zm_list,range(len(x_po)))
res_coords = zip(xr_list,yr_list,zr_list,range(len(x)))
mesh_coords = sorted(mesh_coords , key = lambda x:(x[0],x[1],x[2]))
res_coords = sorted(res_coords , key = lambda x:(x[0],x[1],x[2]))
mapping = zip(np.array(listym)[:,-1],np.array(listyr)[:,-1])
mapping = sorted(mapping , key = lambda x:(x[0]))

How about sorting coordinate vectors in both files along x than y and than least z coordinate?
You can do this efficient and fast if you use numpy arrays for vectors.
Update:
If you don't have the node ids of the nodes in the result mesh. But the coordinates are the same. Do the following:
Add a numbering as an additional information to your vectors. Sort both mesh by x,y,z add the now unsorted numbering of your mesh to your comesh and sort the comesh along that axis. Now the comesh contains the exact order as the original mesh.

Related

Faster way to find indices in an Array using points which get from two arrays combination in Python

I have two arrays which contains instances from DATA called A and B. These two arrays then refer to another array called Distance.
I need the fast way to:
find the points combination between A and B,
find the results of the distance from the combination in Distance
For example:
DATA = [0,1,...100]
A = [0,1,2]
B = [6,7,8]
Distance = [100x100] # contains the pairwise distance of all instances from DATA
# need a function to combine A and B
points_combination=[[0,6],[0,7],[0,8],[1,6],[1,7],[1,8],[2,6],[2,7],[2,8]]
# need a function to refer points_combination with Distance, so that I can get this results
distance_points=[0.346, 0.270, 0.314, 0.339, 0.241, 0.283, 0.304, 0.294, 0.254]
I already try to solve it myself, but when it deals with large data it's very slow
Here's the code I tried:
import numpy as np
def function(pair_distances, k, clusters):
list_distance = []
cluster_qty = k
for cluster_id in range(cluster_qty):
all_clusters = clusters[:] # List of all instances ID on their own cluster
in_cluster = all_clusters.pop(cluster_id) # List of instances ID inside the cluster
not_in_cluster = all_clusters # List of instances ID outside the cluster
# combine A and B array into a points to refer to Distance array
list_dist_id = np.array(np.meshgrid(in_cluster, np.concatenate(not_in_cluster))).T.reshape(-1, 2)
temp_dist = 9999999
for instance in range(len(list_dist_id)):
# basically refer the distance value from the pair_distances array
temp_dist = min(temp_dist, (pair_distances[list_dist_id[instance][0], list_dist_id[instance][1]]))
list_distance.append(temp_dist)
return list_distance
Notice that the nested loop is the source of the time consuming problem.
This is my first time asking in this forum, so please let me know if you need more information.

The first part(points_combination) is extensively covered in this post already:
Cartesian product of x and y array points into single array of 2D points
The second part (distance_points): seems that algorithm linking points_combination to distance_points is not provided. Would be helpful if you could provide small sample data sets indicating how to go from data sets to your distance_points ?

Why is my dict being overwritten in this loop in python?

I have a dict, coords_dict, in a strange format. Which is currently being used to store a set of Cartesian coordinate points (x,y,z). The structure of the dict (which is unfortunately out of my control) is as follows.
The keys of the dict are a series of z values of a plane, and each entry consists of a single element list, which itself is a list of lists containing the coordinate points. For example, two elements in the dict can be specified as
coords_dict['3.5']=[[[1.62,2.22,3.50],[4.54,5.24,3.50]]]
coords_dict['5.0']=[[[0.33,6.74,5.00],[2.54,12.64,5.00]]]
So, I now want to apply some translational shift to all coordinate points in this dict by some shift vector [-1,-1,-1], i.e. I want all x, y, and z coordinates to be 1 less than they were before (rounded to 2 decimal places). And I want to assign the result of this translation to a new dictionary, coords_dict_translated, while also updating the dict keys to match the z locations of all points
My attempt at a solution is below
import numpy as np
shift_vector=[-1,-1,-1]
coords_dict_translated={}
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
for point_index in range(0,len(plane[0])): #loop over points in this plane
plane[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key]=plane
However, I notice that if I do this that that the original values of coords_dict are also changing. I want coords_dict to stay the same but return a completely new and entirely separate dict. I am not quite sure where the issue lies, I have tried using for key,plane in list(coords_dict.items()): as well but this did not work. Why does this loop change the values of the original dictionary?

when you are iterating over the dictionary in the for loop you are referencing the elements in your list/array:
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
If you don't want to change the items, you should just make a copy of the variable you are using instead of setting plane directly:
import copy
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
c = copy.deepcopy(plane)
for point_index in range(0,len(plane[0])): #loop over points in this plane
c[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key] = c

The most likely issue here is that you have a list that is being referenced from two different variables. This can happen even using .copy() when you have nested structure (as you do here).
If this is the problem, you can probably overcome it by using need to make sure you are making a (deep) copy of lists you want to update independently. copy.deepcopy will iteratively make copies of lists within lists etc. to avoid double references to lower-level lists.
(comment made into answer).

Most Efficient Way to Automate Grouping of List Entries

Background:I have a very large list of 3D cartesian coordinates, I need to process this list to group the coordinates by their Z coordinate (ie all coordinates in that plane). Currently, I manually create groups from the list using a loop for each Z coordinate, but if there are now dozens of possible Z (was previously handling only 2-3 planes)coordinates this becomes impractical. I know how to group lists based on like elements of course, but I am looking for a method to automate this process for n possible values of Z.Question:What's the most efficient way to automate the process of grouping list elements of the same Z coordinate and then create a unique list for each plane?
Code Snippet:
I'm just using a simple list comprehension to group individual planes:
newlist=[x for x in coordinates_xyz if insert_possible_Z in x]
I'm looking for it to automatically make a new unique list for every Z plane in the data set.
Data Format:
((x1,y1,0), (x2, y2, 0), ... (xn, yn, 0), (xn+1,yn+1, 50),(xn+2,yn+2, 50), ... (x2n+1,y2n+1, 100), (x2n+2,y2n+2, 100)...)etc. I want to automatically get all coordinates where Z=0, Z=50, Z=100 etc. Note that the value of Z (increments of 50) is an example only, the actual data can have any value.Notes:My data is imported either from a file or generated by a separate module in lists. This is necessary for interface with another program (that I have not written).

The most efficient way to group elements by Z and make a list of them so grouped is to not make a list.
itertools.groupby does the grouping you want without the overhead of creating new lists.
Python generators take a little getting used to when you aren't familiar with the general mechanism. The official generator documentation is a good starting point for learning why they are useful.

If I am interpreting this correctly, you have a set of coordinates C = (X,Y,Z) with a discrete number of Z values. If this is the case, why not use a dictionary to associate a list of the coordinates with the associated Z value as a key?
You're data structure would look something like:
z_ordered = {}
z_ordered[3] = [(x1,y1,z1),(x2,y2,z2),(x3,y3,z3)]
Where each list associated with a key has the same Z-value.
Of course, if your Z-values are continuous, you may need to modify this, say by making the key only the whole number associated with a Z-value, so you are binning in increments of 1.

So this is the simple solution I came up with:
groups=[]
groups[:]=[]
No_Planes=#Number of planes
dz=#Z spacing variable here
for i in range(No_Planes):
newlist=[x for x in coordinates_xyz if i*dz in x]
groups.append(newlist)
This lets me manipulate any plane within my data set simply with groups[i]. I can also manipulate my spacing. This is also an extension of my existing code, as I realised after reading #msw's response about itertools, looping through my current method was staring me in the face, and far more simple than I imagined!

GPS co-ordinate search --R-trees

I have a list of lists in the form of
[ [ x1,.....,x8],[x1,.......,x8],...............,[x1,.....x[8]] ] . The number of lists in that list can go upto a million. Each list has 4 gps co-ordinates which show the four points of a rectangle ( assumed that each segment is in the form of a rectangle].
Problem : Given a new point, I need to determine which segment the point falls on and create a new one if it falls in none of them. I am not uploading the data into MySQL as of now, it comes in as a simple text file. I find out the co-ordinates from the text file for any given car.
What I tried : I am thinking of using R-trees to find all points which are near to the given point . ( Near== 200 meters maximum) . But even in R-trees, there seem to be too many options . R,R*,Hilbert.
Q1. Which one should be opted for ?
Q2. Is there a better option than R-trees?Can something be done by searching faster within the list ?
Thanks a lot.
[ {a1:[........]},{a2:[.......]},{a3:[.........]},.... ,{a20:[.....]}] .

Isn't the problem "find whether a given point falls within a certain rectangle in 2D space"?
That could be separated dimensionally, couldn't it? Give each rectangle an ID, then separate into lists of one-dimensional ranges ((id, x0, x1), (id, y0, y1)) and find all the ranges in both dimensions the point falls in. (I'm fairly sure there are very efficient algorithms for this. Heck, you could even leverage, say, sqlite already.) Then just intersect the ID sets you get and you should find all rectangles the point falls in, if any. (Of course you can exit early if either of the single dimensional queries returns no result.)
Not sure if this'd be faster or smarter than R-trees or other spatial indexes though. Hope this helps anyway.

import random as ra
# my _data will hold tuples of gps readings
# under the key of (row,col), knowing that the size of
# the row and col is 10, it will give an overall grid coverage.
# Another dict could translate row/col coordinates into some
# more usefull region names
my_data = {}
def get_region(x,y,region_size=10):
"""Build a tuple of row/col based on
the values provided and region square dimension.
It's for demonstration only and it uses rather naive calculation as
coordinate / grid cell size"""
row = int(x / region_size)
col = int(y / region_size)
return (row,col)
#make some examples and build my_data
for loop in range(10000):
#simule some readings
x = ra.choice(range(100))
y = ra.choice(range(100))
my_coord = get_region(x,y)
if my_data.get(my_coord):
my_data[my_coord].append((x,y))
else:
my_data[my_coord]= [(x,y),]
print my_data

numpy arrays: filling and extracting data quickly

See important clarification at bottom of this question.
I am using numpy to speed up some processing of longitude/latitude coordinates. Unfortunately, my numpy "optimizations" made my code run about 5x more slowly than it ran without using numpy.
The bottleneck seems to be in filling the numpy array with my data, and then extracting out that data after I have done the mathematical transformations. To fill the array I basically have a loop like:
point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )
for point_index in xrange( 0, n ):
point_buffer[ point_index ] = point_list[ point_index ]
That loop, just filling in the numpy array before even operating on it, is extremely slow, much slower than the entire computation was without numpy. (That is, it's not just the slowness of the python loop itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) There is similar slowness on the other end; after I have processed the numpy arrays, I access each modified coordinate pair in a loop, again as
some_python_tuple = point_buffer[ index ]
Again that loop to pull the data out is much slower than the entire original computation without numpy. So, how do I actually fill the numpy array and extract data from the numpy array in a way that doesn't defeat the purpose of using numpy in the first place?
I am reading the data from a shape file using a C library that hands me the data as a regular python list. I understand that if the library handed me the coordinates already in a numpy array there would be no "filling" of the numpy array necessary. But unfortunately the starting point for me with the data is as a regular python list. And more to the point, in general I want to understand how you quickly fill a numpy array with data from within python.
Clarification
The loop shown above is actually oversimplified. I wrote it that way in this question because I wanted to focus on the problem I was seeing of trying to fill a numpy array slowly in a loop. I now understand that doing that is just slow.
In my actual application what I have is a shape file of coordinate points, and I have an API to retrieve the points for a given object. There are something like 200,000 objects. So I repeatedly call a function GetShapeCoords( i ) to get the coords for object i. This returns a list of lists, where each sublist is a list of lon/lat pairs, and the reason it's a list of lists is that some of the objects are multi-part (i.e., multi-polygon). Then, in my original code, as I read in each object's points, I was doing a transformation on each point by calling a regular python function, and then plotting the transformed points using PIL. The whole thing took about 20 seconds to draw all 200,000 polygons. Not terrible, but much room for improvement. I noticed that at least half of those 20 seconds were spent doing the transformation logic, so I thought I'd do that in numpy. And my original implementation was just to read in the objects one at a time, and keep appending all the points from the sublists into one big numpy array, which I then could do the math stuff on in numpy.
So, I now understand that simply passing a whole python list to numpy is the right way to set up a big array. But in my case I only read one object at a time. So one thing I could do is keep appending points together in a big python list of lists of lists. And then when I've compiled some large number of objects' points in this way (say, 10000 objects), I could simply assign that monster list to numpy.
So my question now is three parts:
(a) Is it true that numpy can take that big, irregularly shaped, list of lists of lists, and slurp it okay and quickly?
(b) I then want to be able to transform all the points in the leaves of that monster tree. What is the expression to get numpy to, for instance, "go into each sublist, and then into each subsublist, and then for each coordinate pair you find in those subsublists multiply the first (lon coordinate) by 0.5"? Can I do that?
(c) Finally, I need to get those transformed coordinates back out in order to plot them.
Winston's answer below seems to give some hint at how I might do this all using itertools. What I want to do is pretty much like what Winston does, flattening the list out. But I can't quite just flatten it out. When I go to draw the data, I need to be able to know when one polygon stops and the next starts. So, I think I could make it work if there were a way to quickly mark the end of each polygon (i.e., each subsublist) with a special coordinate pair like (-1000, -1000) or something like that. Then I could flatten with itertools as in Winston's answer, and then do the transforms in numpy. Then I need to actually draw from point to point using PIL, and here I think I'd need to reassign the modified numpy array back to a python list, and then iterate through that list in a regular python loop to do the drawing. Does that seem like my best option short of just writing a C module to handle all the reading and drawing for me in one step?

You describe your data as being "lists of lists of lists of coordinates". From this I'm guessing your extraction looks like this:
for x in points:
for y in x:
for Z in y:
# z is a tuple with GPS coordinates
Do this:
# initially, points is a list of lists of lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing coordinates
points = itertools.chain.from_iterable(points)
# now points is an iterable producing individual floating points values
data = numpy.fromiter(points, float)
# data is a numpy array containing all the coordinates
data = data.reshape( data.size/2,2)
# data has now been reshaped to be an nx2 array
itertools and numpy.fromiter are both implemented in c and really efficient. As a result, this should do the transformation very quickly.
The second part of your question doesn't really indicate what you want do with the data. Indexing numpy array is slower then indexing python lists. You get speed by performing operations in mass on the data. Without knowing more about what you are doing with that data, its hard to suggest how to fix it.
UPDATE:
I've gone ahead and done everything using itertools and numpy. I am not responsible from any brain damage resulting from attempting to understand this code.
# firstly, we use imap to call GetMyPoints a bunch of times
objects = itertools.imap(GetMyPoints, xrange(100))
# next, we use itertools.chain to flatten it into all of the polygons
polygons = itertools.chain.from_iterable(objects)
# tee gives us two iterators over the polygons
polygons_a, polygons_b = itertools.tee(polygons)
# the lengths will be the length of each polygon
polygon_lengths = itertools.imap(len, polygons_a)
# for the actual points, we'll flatten the polygons into points
points = itertools.chain.from_iterable(polygons_b)
# then we'll flatten the points into values
values = itertools.chain.from_iterable(points)
# package all of that into a numpy array
all_points = numpy.fromiter(values, float)
# reshape the numpy array so we have two values for each coordinate
all_points = all_points.reshape(all_points.size // 2, 2)
# produce an iterator of lengths, but put a zero in front
polygon_positions = itertools.chain([0], polygon_lengths)
# produce another numpy array from this
# however, we take the cumulative sum
# so that each index will be the starting index of a polygon
polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) )
# now for the transformation
# multiply the first coordinate of every point by *.5
all_points[:,0] *= .5
# now to get it out
# polygon_positions is all of the starting positions
# polygon_postions[1:] is the same, but shifted on forward,
# thus it gives us the end of each slice
# slice makes these all slice objects
slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:]))
# polygons produces an iterator which uses the slices to fetch
# each polygon
polygons = itertools.imap(all_points.__getitem__, slices)
# just iterate over the polygon normally
# each one will be a slice of the numpy array
for polygon in polygons:
draw_polygon(polygon)
You might find it best to deal with a single polygon at a time. Convert each polygon into a numpy array and do the vector operations on that. You'll probably get a significant speed advantage just doing that. Putting all of your data into numpy might be a little difficult.
This is more difficult then most numpy stuff because of your oddly shaped data. Numpy pretty much assumes a world of uniformly shaped data.

The point of using numpy arrays is to avoid as much as possible for loops. Writing for loops yourself will result in slow code, but with numpy arrays you can use predefined vectorized functions which are much faster (and easier!).
So for the conversion of a list to an array you can use:
point_buffer = np.array(point_list)
If the list contains elements like (lat, lon), then this will be converted to an array with two columns.
With that numpy array you can easily manipulate all elements at once. For example, to multiply the first element of each coordinate pair by 0.5 as in your question, you can do simply (assuming that the first elements are eg in the first column):
point_buffer[:,0] * 0.5

This will be faster:
numpy.array(point_buffer, dtype=numpy.float32)
Modifiy the array, not the list. It would obviously be better to avoid creating the list in the first place if possible.
Edit 1: profiling
Here is some test code that demonstrates just how efficiently numpy converts lists to arrays (it's good). And that my list-to-buffer idea is only comparable to what numpy does, not better.
import timeit
setup = '''
import numpy
import itertools
import struct
big_list = numpy.random.random((10000,2)).tolist()'''
old_way = '''
a = numpy.empty(( len(big_list), 2), numpy.float32)
for i,e in enumerate(big_list):
a[i] = e
'''
normal_way = '''
a = numpy.array(big_list, dtype=numpy.float32)
'''
iter_way = '''
chain = itertools.chain.from_iterable(big_list)
a = numpy.fromiter(chain, dtype=numpy.float32)
'''
my_way = '''
chain = itertools.chain.from_iterable(big_list)
buffer = struct.pack('f'*len(big_list)*2,*chain)
a = numpy.frombuffer(buffer, numpy.float32)
'''
for way in [old_way, normal_way, iter_way, my_way]:
print timeit.Timer(way, setup).timeit(1)
results:
0.22445492374
0.00450378469941
0.00523579114088
0.00451488946237
Edit 2: Regarding the hierarchical nature of the data
If i understand that the data is always a list of lists of lists (object - polygon - coordinate), then this is the approach I'd take: Reduce the data to the lowest dimension that creates a square array (2D in this case) and track the indices of the higher-level branches with a separate array. This is essentially an implementation of Winston's idea of using numpy.fromiter of a itertools chain object. The only added idea is the branch indexing.
import numpy, itertools
# heirarchical list of lists of coord pairs
polys = [numpy.random.random((n,2)).tolist() for n in [5,7,12,6]]
# get the indices of the polygons:
lengs = numpy.array([0]+[len(l) for l in polys])
p_idxs = numpy.add.accumulate(lengs)
# convert the flattend list to an array:
chain = itertools.chain.from_iterable
a = numpy.fromiter(chain(chain(polys)), dtype=numpy.float32).reshape(lengs.sum(), 2)
# transform the coords
a *= .5
# get a transformed polygon (using the indices)
def get_poly(n):
i0 = p_idxs[n]
i1 = p_idxs[n+1]
return a[i0:i1]
print 'poly2', get_poly(2)
print 'poly0', get_poly(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.