Sum attributes of duplicate coordinates in python - python

I am going through my coordinates data and I see some duplicate coordinates with different parameters due to certain preprocessing. I want to be able to merge the attributes corresponding to the matched coordinates and get the simplified results. To clarify what I mean here is an example:
X = [1.0, 2.0, 3.0, 2.0]
Y = [8.0, 3.0, 4.0, 3.0]
A = [13, 16, 20, 8]
The above data is read as follows: point (1.0, 8.0) has a value of 13 and (2.0, 3.0) has a value of 16. Notice that the second point and fourth point have the same coordinates but different attribute values. I want to be able to remove the duplicates from the lists of coordinates and sum the attributes so the results would be new lists:
New_X = [1.0, 2.0, 3.0]
New_Y = [8.0, 3.0, 4.0]
New_A = [13, 24, 20]
24 is the sum of 16 and 8 from the second and fourth points with the same coordinates, therefore one point is kept and the values are summed.
I am not sure how to do this, I thought of using nested for loops of zips of the coordinates but I am not sure how to formulate it to sum the attributes.
Any help is appreciated!

I think that maintaining 3 lists is a bit awkward. Something like:
D = dict()
for x,y,a in zip(X,Y,A):
D[(x,y)] = D.get((x,y),0) + a
would put everything together in one place.
If you'd prefer to decompose it back into 3 lists:
for (x,y),a in D.items():
newX.append(x)
newY.append(y)
newA.append(a)

Another option here is to use itertools.groupby. But since this only groups consecutive keys, you'll have to first sort your coordinates.
First we can zip them together to create tuples of the form (x, y, a). Then sort these by the (x, y) coordinates:
sc = sorted(zip(X, Y, A), key=lambda P: (P[0], P[1])) # sorted coordinates
print(sc)
#[(1.0, 8.0, 13), (2.0, 3.0, 16), (2.0, 3.0, 8), (3.0, 4.0, 20)]
Now we can groupby the coordinates and sum the values:
from itertools import groupby
print([(*a, sum(c[2] for c in b)) for a, b in groupby(sc, key=lambda P: (P[0], P[1]))])
#[(1.0, 8.0, 13), (2.0, 3.0, 24), (3.0, 4.0, 20)]
And since zip is its own inverse, you can get New_X, New_Y, and New_A via:
New_X, New_Y, New_A = zip(
*((*a, sum(c[2] for c in b)) for a, b in groupby(sc, key=lambda P: (P[0], P[1])))
)
print(New_X)
print(New_Y)
print(New_A)
#(1.0, 2.0, 3.0)
#(8.0, 3.0, 4.0)
#(13, 24, 20)
Of course, you can do this all in one line but I broke it up into pieces so that it's easier to understand.

you could put the (x,y) coords in a dictionary:
dict = {}
for i in range(len(X)) # len(X) = len(Y)
if (X[i], Y[i]) not in dict.keys():
dict[(X[i], Y[i])] = A[i]
else:
dict[(X[i], Y[i])] += A[i]

Can use a dictionary
d = {}
for i in range(len(X)):
tup = (X[i], Y[i])
if tup in d:
d[tup] += A[i]
else:
d[tup] = A[i]
New_X = []
New_Y = []
New_A = []
for key in d.keys():
New_X.append(key[0])
New_Y.append(key[1])
New_A.append(d[key])

Try this list comprehension:
y,x,a=zip(*[e for c,e in enumerate(zip(Y,X,A)) if not e[0:1] in [x[0:1] for x in zip(X,Y,A)][c:]])

A dict seems like a more appropriate data structure here. This will build one.
from collections import Counter
D = sum((Counter({(x, y): a}) for x, y, a in zip(X, Y, A)), Counter())
print(D)
#Counter({(2.0, 3.0): 24, (3.0, 4.0): 20, (1.0, 8.0): 13})
You can unpack these back into separate lists using:
New_X, New_Y, New_A = map(list, zip(*[(x,y,a) for (x,y),a in D.items()]))
print(New_X)
print(New_Y)
print(New_A)
#[1.0, 2.0, 3.0]
#[8.0, 3.0, 4.0]
#[13, 24, 20]

Related

How to append and pair coordinate values in nested for loop

I am finding the distance between two pairs of random points, I am then duplicating the points in a 3 x 3 pattern so that the same points are seen after a certain distance, which is done with a nested for loop. I am trying to find the distance between the newly created points from the a for loop.
I tried using append within the loop to store the points, which gives me the distances, but it is only giving me 24 distances when there should be a lot more between 9 copies of 4 points.
Am I not implementing append correcting to account for additional distances?
Code
import numpy as np
import matplotlib.pyplot as plt
import random
import math
dist = []
#scale of the plot
scalevalue = 10
x = [random.uniform(1, 10) for n in range(4)]
y = [random.uniform(1, 10) for n in range(4)]
tiles = np.linspace(-scalevalue, scalevalue, 3)
for i in tiles:
for j in tiles:
bg_tile = plt.scatter(x + i,y + j, c="black", s=3)
dist.append(i)
dist.append(j)
pairs = list(zip(x + i,y + j))
plt.show()
def distance(x, y):
return math.sqrt((x[0]-x[1])**2 + (y[0]-y[1])**2)
for i in range(len(pairs)):
for j in range(i+1,len(pairs)):
dist.append(distance(pairs[i],pairs[j]))
print(dist)
Run your code:
x (and y) is a list of numbers (4):
In [553]: x
Out[553]: [8.699962201099193, 3.1643082386096975, 5.245385542599207, 3.0412506367299033]
tiles is an array:
In [554]: tiles
Out[554]: array([-10., 0., 10.])
And the first iteration - without the plot, and doing one (i,j) append, rather than the sequential. This better separates the i values from the j ones:
In [558]: dist=[]
...: for i in tiles:
...: for j in tiles:
...: dist.append((i,j))
...: pairs = list(zip(x + i,y + j))
In [559]: dist
Out[559]:
[(-10.0, -10.0), # that just reflects how you iterate on tiles
(-10.0, 0.0),
(-10.0, 10.0),
(0.0, -10.0),
(0.0, 0.0),
(0.0, 10.0),
(10.0, -10.0),
(10.0, 0.0),
(10.0, 10.0)]
That flat list you show in the comment confuses those values. Why are you doing this?
pairs ends up with the last i,j values; earlier iterations are thrown away.
In [560]: pairs
Out[560]:
[(18.699962201099193, 18.63063210113664),
(13.164308238609697, 12.329695190243902),
(15.245385542599207, 16.685778921185936),
(13.041250636729902, 15.89730196643608)]
So the first column is:
In [561]: i
Out[561]: 10.0
In [562]: x+i
Out[562]: array([18.6999622 , 13.16430824, 15.24538554, 13.04125064])
x is a list, but i is np.float64, so the addition is array addition (list 'addition' is join).
pairs
With that last pairs:
In [567]: alist = []
...: for i in range(len(pairs)):
...: for j in range(i+1,len(pairs)):
...: alist.append(distance(pairs[i],pairs[j]))
...:
In [568]: alist
Out[568]:
[0.8374876734992962,
1.442060937629651,
2.8568926932380996,
1.664725810930718,
2.9755013255616056,
3.1987125977481807]
What the iteration is doing is get the 6 combinations of these 4 pairs
In [574]: distance(pairs[0],pairs[1])
Out[574]: 0.8374876734992962
Those 6 values (different in my case because of different random numbers) have nothing to do with the tile values that you previously accumulated in dist.
If I make a 2d array from pairs:
In [575]: arr = np.array(pairs); arr
Out[575]:
array([[18.6999622 , 18.6306321 ],
[13.16430824, 12.32969519],
[15.24538554, 16.68577892],
[13.04125064, 15.89730197]])
I can replicate the distance with:
In [576]: (arr[:,1]-arr[:,0])**2
Out[576]: array([4.80666276e-03, 6.96578941e-01, 2.07473309e+00, 8.15702920e+00])
In [577]: np.sqrt(np.sum(_[:2]))
Out[577]: 0.8374876734992962
I don't know what's the significance of this. pairs is just the x,y values with an added 10:
In [579]: np.column_stack((x,y))+10
Out[579]:
array([[18.6999622 , 18.6306321 ],
[13.16430824, 12.32969519],
[15.24538554, 16.68577892],
[13.04125064, 15.89730197]])

Fastest way to find the nearest pairs between two numpy arrays without duplicates

Given two large numpy arrays A and B with different number of rows (len(B) > len(A)) but same number of columns (A.shape[1] = B.shape[1] = 3). I want to know the fastest way to get a subset C from B that has the minimum total distance (sum of all pair-wise distances) to A without duplicates (each pair must be both unique). This means C should have the same shape as A.
Below is my code, but there are two main issues:
I cannot tell if this gives the minimum total distance
In reality I have a much more expensive distance-calculating function rather than np.linalg.norm (needs to take care of periodic boundary conditions). I think this is definitely not the fastest way to go since the code below calls the distance-calculating function one pair per time. There is a significant overhead when I call the more expensive distance-calculating function and it will run forever. Any suggestions?
import numpy as np
from operator import itemgetter
import random
import time
A = 100.*np.random.rand(1000, 3)
B = A.copy()
for (i,j), _ in np.ndenumerate(B):
B[i,j] += np.random.rand()
B = np.vstack([B, 100.*np.random.rand(500, 3)])
def calc_dist(x, y):
return np.linalg.norm(x - y)
t0 = time.time()
taken = []
for rowi in A:
res = min(((k, calc_dist(rowi, rowj)) for k, rowj in enumerate(B)
if k not in taken), key=itemgetter(1))
taken.append(res[0])
C = B[taken]
print(A.shape, B.shape, C.shape)
>>> (1000, 3) (1500, 3) (1000, 3)
print(time.time() - t0)
>>> 12.406389951705933
Edit: for those who are interested in the expensive distance-calculating function, it uses the ase package (can be installed by pip install ase)
from ase.geometry import find_mic
def calc_mic_dist(x, y):
return find_mic(np.array([x]) - np.array([y]),
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1][0]
If you're OK with calculating the whole N² distances, which isn't that expensive for the sizes you've given, scipy.optimize has a function that will solve this directly.
import scipy.optimize
cost = np.linalg.norm(A[:, np.newaxis, :] - B, axis=2)
_, indexes = scipy.optimize.linear_sum_assignment(cost)
C = B[indexes]
Using the power of numpy broadcasting and vectorization
find_mic method in ase.geometry can handle 2d np arrays.
from ase.geometry import find_mic
def calc_mic_dist(x, y):
return find_mic(x - y,
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1]
Test:
x = np.random.randn(1,3)
y = np.random.randn(5,3)
print (calc_mic_dist(x,y).shape)
# It is a distance metrics so:
assert np.allclose(calc_mic_dist(x,y), calc_mic_dist(y,x))
Ouptput:
(5,)
As you can see the metrics is calculated for each value of x with each value of y, because x-y in numpy does the magic of broadcasting.
Solution:
def calc_mic_dist(x, y):
return find_mic(x - y,
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1]
t0 = time.time()
A = 100.*np.random.rand(1000, 3)
B = 100.*np.random.rand(5000, 3)
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)
print (f"Time: {time.time()-t0}")
Output:
(1000, 3) (5000, 3) (1000, 3)
Time: 9.817562341690063
Takes around 10secs on google collab
Testing:
We know that calc_mic_dist(x,x) == 0 so If A is a subset of B then C should exactly be A
A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))
Output:
(1000, 3) (2000, 3) (1000, 3)
True
Edit 1: Avoid duplicates
Once a vector in B is selected it cannot be again selected for other
values of A
This can be achieved by remove the selected vector from B once it is selected so that it does not appear again for next rows of A as a possible candidate.
A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])
B_ = B.copy()
C = np.zeros_like(A)
for i, a in enumerate(A):
s = np.argmin(calc_mic_dist(a, B_))
C[i] = B_[s]
# Remove the paried
B_ = np.delete(B_, (s), axis=0)
print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))
Output:
(1000, 3) (2000, 3) (1000, 3)
True

Cutting geological boreholes (csv data file) to extract some value using python

It's my first question/post so first of all I would like to say THANKS for all great ideas and solutions that I have found in this place.
I have a problem with pretty simple task: I've got a csv file with geophysical measurements of soil/rock electrical resistivity in some grup of boreholes. I have to find rho value at some cutoff level e.g. 5 meters. I have measurement number (m_nr), which is also a layer number, x and y coordinates, ordinate ("o" as meters above sea level), resistivity (rho), layer depth (h) and layer thickness (d). The value of rho which I'm looking for is in the first row of different borehole which meet the condition h >= cutoff. I'm using python 3.6 and that's how my code looks:
file = open('measurement.csv', newline='')
file = csv.reader(file, delimiter=';', quotechar='|')
measurements = list(file)
result = []
cutoff=5
for m_nr, x, y, ordinate, rho, h, d in measurements:
m_nr = int(m_nr)
x = int(x)
y = int(y)
o = float(ordinate)
rho = float(rho)
h = float(h)
d = float(d)
if h >= cutoff:
result.append([x, y, m_nr, o-cutoff, rho, h, d])
and some output:
[[20456, 10234, 4, 90.0, 2356.0, 7.0, 2.25],
[20456, 10234, 5, 90.0, 24563.0, 15.0, 8.0],
[20456, 10234, 6, 90.0, 250.0, 21.0, 6.0],
[10122, 15678, 3, 108.0, 245.0, 6.0, 2.0],
[10122, 15678, 4, 108.0, 2356.0, 7.0, 1.0],
[10122, 15678, 5, 108.0, 24563.0, 15.0, 8.0],
[30111, 34444, 2, 75.0, 4686.0, 12.0, 11.0],
[30111, 34444, 3, 75.0, 245.0, 16.0, 4.0],
[30111, 34444, 4, 75.0, 2356.0, 28.0, 12.0]]
That's just a test file and I expect that in some near future I will have houndrets of boreholes so effectivity of code matters... For each borehole (different set of x,y) only the first row in the list is the one that I need. I don't know how to extract it from my results and that's where I'm asking for your help.
Regards,
Matsu
I'll just go over several things.
It's cleaner to open the file using a with statement so you don't have to worry about closing it
You can use the DictReader class to make the data accessible more easily.
Don't do list(file), just iterate over the reader directly. That way you don't have to load the whole thing into memory.
You can keep track of the x, y values and skip the rest after you find a match.
Result:
with open('measurement.csv', newline='') as file:
fieldnames = ['m_nr', 'x', 'y', 'ord', 'rho', 'h', 'd']
reader = csv.DictReader(file, fieldnames=fieldnames)
result = []
last_xy = None
cutoff=5
for line in reader:
xy = int(line['x']), int(line['y'])
if xy == last_xy:
continue # skip processing since we already have a match
h = float(line['h'])
if h >= cutoff:
result.append(line)
last_xy = xy # if we find a match, save the xy
Finally, if the goal is to put the result into a new CSV file, I'd just have an output file open for writing at the same time and write out the results instead of appending them to a list. That way you never need to have more than a few lines in memory at a time.

Python: numpy shape to coordinate

I have a python script what adjusts coordinates of triangles towards the centre of gravity of the triangle.
This works just fine, however to generate a workable output (i need to write a text file wich can be imported by other software, Abaqus) i want to write a coordinate list in a text file.
But i can't get this to work proparly.
I think i first will need to create a list or tuple from the numpy array.
However this doesn't work correctly.
There's still an array per coordinate in this list.
How can i fix this?
The script i currently have i shown below.
newcoords = [[0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 0.0], [1.0, 1.0], [0.0, 1.0]]
newelems = [[0, 1, 2], [3, 4, 5]]
import numpy as np
#define triangles
triangles = np.array([[newcoords[e] for e in newelem] for newelem in newelems])
#find centroid of each triangle
CM = np.mean(triangles,axis=1)
#find vector from each point in triangle pointing towards centroid
point_to_CM_vectors = CM[:,np.newaxis] - triangles
#calculate similar triangles 1% smaller
new_triangle = triangles + 0.01*point_to_CM_vectors
newcoord = []
newcoord.append(list(zip(*new_triangle)))
print 'newcoord =', newcoord
#generate output
fout = open('_PartInput3.inp','w')
print >> fout, '*Node-new_triangle'
for i,x in enumerate(newcoord):
print >> fout, i+1, ',', x[0], ',', x[1]
fout.close()
The coordinate list in the output file '_PartInput3.inp' should look the following:
*Node-new_triangle
1, 0.00333333, 0.00333333
2, 0.99333333, 0.00333333
3, 0.00333333, 0.99333333
4, 0.00333333, 0.00666667
5, 0.99333333, 0.99666667
6, 0.00333333, 0.99666667
Thanks in advance for any help!
#generate output
fout = open('_PartInput3.inp','w')
fout.write('*Node-new_triangle\n')
s = new_triangle.shape
for i, x in enumerate(new_triangle.reshape(s[0]*s[1], 2)):
fout.write("{}, {}, {}\n".format(i+1, x[0], x[1]))
fout.close()
or better
#generate output
with open('_PartInput3.inp','w') as fout:
fout.write('*Node-new_triangle\n')
s = new_triangle.shape
for i, x in enumerate(new_triangle.reshape(s[0]*s[1], 2)):
fout.write("{}, {}, {}\n".format(i+1, x[0], x[1]))

Python sublist for a condition

I have 3 lists x, y, z and I plot them with:
ax.plot3D(x, y, z, linestyle = 'None', marker = 'o').
What is the easiest way to only plot the points where x > 0.5?
(my problem is how to define a sublist under a condition without making a for loop on that list).
I'm not sure why you're avoiding looping over a list and I'm assuming that you want the related points in the other lists also removing.
>>> x = [0.0, 0.4, 0.6, 1.0]
>>> y = [0.0, 2.2, 1.5, 1.6]
>>> z = [0.0, 9.1, 1.0, 0.9]
>>> zip(x,y,z)
[(0.0, 0.0, 0.0), (0.4, 2.2, 9.1), (0.6, 1.5, 1.0), (1.0, 1.6, 0.9)]
>>> [item for item in zip(x,y,z) if item[0] > 0.5]
[(0.6, 1.5, 1.0), (1.0, 1.6, 0.9)]
Separating the list into it's constituent lists will require looping over the list somehow.
It's impossible to verify a condition on every element of a list without iterating over it at least once. You could use numpy here for easy access to the elements after condition is passsed and do:
import numpy
x = [0.0, 0.4, 0.6, 1.0]
y = [0.0, 2.2, 1.5, 1.6]
z = [0.0, 9.1, 1.0, 0.9]
res = numpy.array([[x[i], y[i], z[i]] for i in xrange(len(x)) if x[i] > 0.5])
ax.plot3D(res[:,0], res[:,1], res[:,2], linestyle="None, marker='o'")
A simple list comprehension won't be enough to remove the (x,y,z) tuples if x <= 0.5, you'll have to do a little more, I use operator.itemgetter for the second part :
from operator import itemgetter
result = [(a, b, c) for a,b,c in zip(x,y,z) if a > 0.5] # first, remove the triplet
x = itemgetter(0)(result) # then grab from the new list the x,y,z parts
y = itemgetter(1)(result)
z = itemgetter(2)(result)
ax.plot3D(x, y, z, linestyle="None, marker='o')
EDIT:
Following and upgrading #shenshei advice we can achieve it with a one-line:
ax.plot3D(
*zip(*[(a, b, c) for a,b,c in zip(x,y,z) if a > 0.5]),
linestyle="None,
marker='o'
)
Reposting my comment as an answer as suggested by #StephenPaulger . You can do this with a generator expression and a couple of calls to the built-in zip():
x = [0.0, 0.4, 0.6, 1.0]
y = [0.0, 2.2, 1.5, 1.6]
z = [0.0, 9.1, 1.0, 0.9]
points = (point for point in zip(x, y, z) if point[0] > 0.5)
x, y, z = zip(*points)
You could also use a list comprehension for points if you want to, but - assuming Python 3, where zip() no longer precomputes a full list when called - that might hurt your memory usage and speed, especially if the number of points is large.
Probably using numpy would provide the cleanest approach. However, you will need to have lists/arrays x, y, and z as numpy arrays. So, first convert these lists to numpy arrays:
import numpy as np
x = np.asarray(x)
y = np.asarray(y)
z = np.asarray(z)
Now compute an array of indices of elements that satisfy your condition:
idx = np.where(x > 0.5)
NOTE: Alternatively, you could compute a boolean mask: idx=x>0.5 (this will not change the use of idx in the next ax.plot3D statement).
Use these indices to select only those specific points in x, y, and z that satisfy desired condition:
ax.plot3D(x[idx], y[idx], z[idx], linestyle = 'None', marker = 'o')
I don't want to steal lvc's thunder, but here's a variant on their answer:
>>> x = [0.1, 0.6, 0.2, 0.8, 0.9]
>>> y = [0.3, 0.1, 0.9, 0.5, 0.8]
>>> z = [0.9, 0.2, 0.7, 0.4, 0.3]
>>>
>>> a, b, c = zip(*filter(lambda t: t[0] > 0.5, zip(x, y, z)))
>>> print a, "\n", b, "\n", c
(0.6, 0.8, 0.9)
(0.1, 0.5, 0.8)
(0.2, 0.4, 0.3)
>>> ax.plot3D(a, b, c, linestyle = 'None', marker = 'o')

Categories

Resources