Most Efficient Way to Automate Grouping of List Entries

Most Efficient Way to Automate Grouping of List Entries - python

Background:I have a very large list of 3D cartesian coordinates, I need to process this list to group the coordinates by their Z coordinate (ie all coordinates in that plane). Currently, I manually create groups from the list using a loop for each Z coordinate, but if there are now dozens of possible Z (was previously handling only 2-3 planes)coordinates this becomes impractical. I know how to group lists based on like elements of course, but I am looking for a method to automate this process for n possible values of Z.Question:What's the most efficient way to automate the process of grouping list elements of the same Z coordinate and then create a unique list for each plane?
Code Snippet:
I'm just using a simple list comprehension to group individual planes:
newlist=[x for x in coordinates_xyz if insert_possible_Z in x]
I'm looking for it to automatically make a new unique list for every Z plane in the data set.
Data Format:
((x1,y1,0), (x2, y2, 0), ... (xn, yn, 0), (xn+1,yn+1, 50),(xn+2,yn+2, 50), ... (x2n+1,y2n+1, 100), (x2n+2,y2n+2, 100)...)etc. I want to automatically get all coordinates where Z=0, Z=50, Z=100 etc. Note that the value of Z (increments of 50) is an example only, the actual data can have any value.Notes:My data is imported either from a file or generated by a separate module in lists. This is necessary for interface with another program (that I have not written).

The most efficient way to group elements by Z and make a list of them so grouped is to not make a list.
itertools.groupby does the grouping you want without the overhead of creating new lists.
Python generators take a little getting used to when you aren't familiar with the general mechanism. The official generator documentation is a good starting point for learning why they are useful.

If I am interpreting this correctly, you have a set of coordinates C = (X,Y,Z) with a discrete number of Z values. If this is the case, why not use a dictionary to associate a list of the coordinates with the associated Z value as a key?
You're data structure would look something like:
z_ordered = {}
z_ordered[3] = [(x1,y1,z1),(x2,y2,z2),(x3,y3,z3)]
Where each list associated with a key has the same Z-value.
Of course, if your Z-values are continuous, you may need to modify this, say by making the key only the whole number associated with a Z-value, so you are binning in increments of 1.

So this is the simple solution I came up with:
groups=[]
groups[:]=[]
No_Planes=#Number of planes
dz=#Z spacing variable here
for i in range(No_Planes):
newlist=[x for x in coordinates_xyz if i*dz in x]
groups.append(newlist)
This lets me manipulate any plane within my data set simply with groups[i]. I can also manipulate my spacing. This is also an extension of my existing code, as I realised after reading #msw's response about itertools, looping through my current method was staring me in the face, and far more simple than I imagined!

Related

Why is my dict being overwritten in this loop in python?

I have a dict, coords_dict, in a strange format. Which is currently being used to store a set of Cartesian coordinate points (x,y,z). The structure of the dict (which is unfortunately out of my control) is as follows.
The keys of the dict are a series of z values of a plane, and each entry consists of a single element list, which itself is a list of lists containing the coordinate points. For example, two elements in the dict can be specified as
coords_dict['3.5']=[[[1.62,2.22,3.50],[4.54,5.24,3.50]]]
coords_dict['5.0']=[[[0.33,6.74,5.00],[2.54,12.64,5.00]]]
So, I now want to apply some translational shift to all coordinate points in this dict by some shift vector [-1,-1,-1], i.e. I want all x, y, and z coordinates to be 1 less than they were before (rounded to 2 decimal places). And I want to assign the result of this translation to a new dictionary, coords_dict_translated, while also updating the dict keys to match the z locations of all points
My attempt at a solution is below
import numpy as np
shift_vector=[-1,-1,-1]
coords_dict_translated={}
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
for point_index in range(0,len(plane[0])): #loop over points in this plane
plane[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key]=plane
However, I notice that if I do this that that the original values of coords_dict are also changing. I want coords_dict to stay the same but return a completely new and entirely separate dict. I am not quite sure where the issue lies, I have tried using for key,plane in list(coords_dict.items()): as well but this did not work. Why does this loop change the values of the original dictionary?

when you are iterating over the dictionary in the for loop you are referencing the elements in your list/array:
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
If you don't want to change the items, you should just make a copy of the variable you are using instead of setting plane directly:
import copy
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
c = copy.deepcopy(plane)
for point_index in range(0,len(plane[0])): #loop over points in this plane
c[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key] = c

The most likely issue here is that you have a list that is being referenced from two different variables. This can happen even using .copy() when you have nested structure (as you do here).
If this is the problem, you can probably overcome it by using need to make sure you are making a (deep) copy of lists you want to update independently. copy.deepcopy will iteratively make copies of lists within lists etc. to avoid double references to lower-level lists.
(comment made into answer).

Data Structure for fast insertion and random access in already sorted data

p = random_point(a,b)
#random_point() returns a tuple/named-tuple (x,y)
#0<x<a 0<y<b
if centers.validates(p):
centers.insert(p)
#centers is the data structure to store points
In the centers data structure all x and y coordinates are stored in two separate sorted(ascending) lists, one for x and other for y. Each node in x points to the corresponding y, and vice versa, so that they can be separately sorted and still hold the pair property: centers.get_x_of(y) and centers.get_y_of(x)
Properties that I require in data structure:
Fast Insertion, in already sorted data (preferably log n)
Random access
Sort x and y separately, without losing pair property
Initially I thought of using simple Lists, and using Binary search to get the index for inserting any new element. But I found, that, it can be improved using self balancing trees like AVL or B-trees. I could make two trees each for x and y, with each node having an additional pointer that could point from x-tree node to y-tree node.
But I don't know how to build random access functionality in these trees. The function centers.validate() tries to insert x & y, and runs some checks with the neighboring elements, which requires random access:
def validate(p):
indices = get_index(p)
#returns a named tuple of indices to insert x and y, Eg: (3,7)
condition1 = func(x_list[indices.x-1], p.x) and func(x_list[indices.x+1], p.x)
condition2 = func(y_list[indices.y-1], p.y) and func(y_list[indices.y+1], p.y)
#func is some mathematical condition on neighboring elements of x and y
return condition1 and condition2
In the above function I need to access neighboring elements of x & y
data structure. I think implementing this in trees would complicate it. Are there any combination of data structure that can achieve this? I am writing this in Python(if that can help)

Class with 2 dicts that hold the values with the keys being the key of the other dict that contains the related value to the value in this dict. It would need to maintain a list per dict for the current order to call elements of that dict in when calling it (your current sort of that dicts values). You would need a binary or other efficient sort to operate on each dict for insertion, though it would really be using the order list for that dict to find each midpoint key and then checking against value from that key.

Mapping arrays with same values but different orders

I have two arrays of coordinates from two separate files from a CFD calculation. One is a mesh file which contains the connectivity information and the other is the results file.
My problem is that the coordinates from each file are not in the same order. What I would like to be able to do is order ALL the arrays from the results file to be in the same order as the mesh file.
My idea would be to find the matching values of xyz coordinates and create a mapping such that the rest of the result arrays can be ordered.
I was thinking something like:
mapping = np.empty(len(co_mesh))
for i,coord in enumerate(co_mesh):
for j in range(len(co_res)):
if (coord[0]==co_res[j,0]) and (coord[1]==co_res[j,1]) and (coord[2]==co_res[j,2]):
mapping[i] = j
where co_mesh, co_res are arrays containing the x,y,z coords.
The problem is that I suspect this loop will take a long time. At the moment I'm only looping over around 70000 points but in future this could increase to 1 million or more.
Is there a faster way to write this in Python.
I'm using Python 2.6.5.
Ben
For those who are interested this is what I am currently using:
mesh_coords = zip(xm_list,ym_list,zm_list,range(len(x_po)))
res_coords = zip(xr_list,yr_list,zr_list,range(len(x)))
mesh_coords = sorted(mesh_coords , key = lambda x:(x[0],x[1],x[2]))
res_coords = sorted(res_coords , key = lambda x:(x[0],x[1],x[2]))
mapping = zip(np.array(listym)[:,-1],np.array(listyr)[:,-1])
mapping = sorted(mapping , key = lambda x:(x[0]))

How about sorting coordinate vectors in both files along x than y and than least z coordinate?
You can do this efficient and fast if you use numpy arrays for vectors.
Update:
If you don't have the node ids of the nodes in the result mesh. But the coordinates are the same. Do the following:
Add a numbering as an additional information to your vectors. Sort both mesh by x,y,z add the now unsorted numbering of your mesh to your comesh and sort the comesh along that axis. Now the comesh contains the exact order as the original mesh.

Plotting two objects using a 4-item list

I have this simulator (gravitation) I've been working on, and I've dissected the equations, math, etc. and it's totally legitimate. However, when I animate the thing I get weird behavior. I'd rather not bore everyone with the entire script because it's sorta lengthy, but the method I'm calling in line.set under the animate(i) function returns a list of four values, which are the positions of my two particles in Cartesian (x,y) coordinates. For example my list looks like:
[1.2, 3.2, 4.5, 5.1]
where the first index is the x-position of the first particle, the second index is the y-position and likewise for the the last two elements corresponding to the second particle (indices 2 and 3).
My question is whether the line.set_data(force.updatePosition(dt)) should be working the way I think it does, i.e. plotting the first particle with indices 0 and 1 and particle two with indices 2 and 3, or am I missing the point? The plotting works, the particles show up, but they get weird, non-sensical movement.
If it's completely necessary here is the script in its entirety...again it's long-ish that's why I didn't post it directly. Also, it's pretty messy as I'm still fighting with it and haven't cleaned it up yet.
Tl;DR Should line.set_data() be able to plot two separate objects if it is fed a list with 4 items?
def init():
line.set_data([], [])
return line,
def animate(i):
line.set_data(force.updatePosition(dt))
return line,

The docs say:
Definition: l.set_data(self, *args)
Docstring:
Set the x and y data
ACCEPTS: 2D array (rows are x, y) or two 1D arrays
So I imagine you want to give it two lists:
line.set_data([x1, x2], [y1, y2])
But it seems that force.updatePosition already returns a list of two lists([pos1]+[pos2]), so you can maybe try:
line.set_data(np.transpose(force.updatePosition(dt)))
My opinion is you might be better off keeping all this info in arrays and remove half the lines of your code, since you write every line two or four times for each element.

create an array from a txt file

I'm new in python and I have a problem.
I have some measured data saved in a txt file.
the data is separated with tabs, it has this structure:
0 0 -11.007001 -14.222319 2.336769
i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.
The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates.
I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates.
I already started a bit and here is what I have so far:
## read file
coords = [x.split('\t') for x in
open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])
but I don't know how to combine these 2 lists and this one array.
in the end i would like to have something like this:
array = [simnum][num_dat_point][xyz]
thanks for your help.
I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this.
thanks again

you can combine them with zip function, like so:
for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
# do your thing
or you could avoid list comprehensions altogether and just iterate over the lines of the file:
for line in open(fname):
lst = line.split('\t')
sim, datapoint = int(lst[0]), int(lst[1])
x, y, z = [float(i) for i in lst[2:]]
# do your thing
to parse a single line you could (and should) do the following:
coords = [x.split('\t') for x in open(fname)]

This seems like a good opportunity to use itertools.groupby.
import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
simData = []
for row in rows:
simData.append([float(v) for v in row[2:]])
result.append(simData)
file.close()
This will create a 3 dimensional list named 'result'. The first index is the simulation number, and the second index is the data index within that simulation. The value is a list of integers containing the x, y, and z coordinate.
Note that this assumes the data is already sorted on simulation number and data number.

According to the zen of python, flat is better than nested. I'd just use a dict.
import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
quoting=csv.QUOTE_NONNUMERIC)
result = {}
for simn, dpoint, c1, c2, c3 in f:
result[simn, dpoint] = c1, c2, c3
# pretty-prints the result:
from pprint import pprint
pprint(result)

You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); there's a popular third-party extension known as numpy, which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; the relevant builtin types are list and dict. I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth).
I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. Two more key aspects need to be specified...
One key aspect you don't tell us about: is the list sorted in some significant way? is there, in particular, some significant order you want to keep? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once).
Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation.
With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line.
With all of these assumptions...:
def make_container(coords):
result = dict()
for s, d, x, y, z in coords:
key = int(s), int(d)
value = float(x), float(y), float(z)
result[key] = value
return result
It's always best, and fastest, to have all significant code within def statements (i.e. as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. make_container returns a dictionary which you can address with the simulation number and datapoint number; for example,
d = make_container(coords)
print d[0, 0]
will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). dicts have many useful methods, e.g. changing the print statement above to
print d.get((0, 0))
(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None, rather than get an exception, if there was no such sim/dp combinarion as (0, 0).
If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-)

essentially the difficulty is what happens if different simulations have different numbers of points.
You will therefore need to dimension an array to the appropriate sizes first.
t should be an array of at least max(simnum) x max(npts) x 3.
To eliminate confusion you should initialise with not-a-number,
this will allow you to see missing points.
then use something like
for x in coords:
t[int(x[0])][int(x[1])][0]=float(x[3])
t[int(x[0])][int(x[1])][1]=float(x[4])
t[int(x[0])][int(x[1])][2]=float(x[5])
is this what you meant?

First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-)
def parse(line):
mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
m = mch.match(line)
if m:
l = m.groups()
(idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
return (idx, data, xyz)
return None
finaldata = []
file = open("data.txt",'r')
for line in file:
r = parse(line)
if r is not None:
finaldata.append(r)
Final data should have output along the lines of:
[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]
This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)...
I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6.

Are you sure a 3d array is what you want? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates.
This code will give you that.
data = []
for coord in coords:
if coord[0] not in data:
data[coord[0]] = []
data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])
To get the coordinates at simulation 7, data point 13, just do data[7][13]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most Efficient Way to Automate Grouping of List Entries - python

Related

Why is my dict being overwritten in this loop in python?

Data Structure for fast insertion and random access in already sorted data

Mapping arrays with same values but different orders

Plotting two objects using a 4-item list

create an array from a txt file

Categories

Resources