Why is Strassen's algorithm slower than the usual matrix multiplication?

Why is Strassen's algorithm slower than the usual matrix multiplication? - python

I'm trying to figure out how to multpiply matrixes really fast on Python without using NumPy. By this reason, I've recreated the Strassen algorithm and compared it with the standard multiplication of loops. Also, I compare only square matrices of size NxN where N is 2^k. Surprisingly, my Strassen algoritm is 3.5 times slower than standart one (I use from time import perf_counter() for tracking time). As an example, the average values are approximately as follows:
matrix size
Strassen
standart mult
16x16
0.006 sec
0.002 sec
32x32
0.036 sec
0.013 sec
64x64
0.26 sec
0.07 sec
128x128
1.69 sec
0.49 sec
1024x1024
771.42 sec
221.09 sec
I generated matrix elements with randint(1, 9) but anyway I tracked time only after creating matrixes for testing and stopped doing it before writing matrix result using loops. So I only tracked time of my functions. And yeah I've seen some other posts with the same problem like this one:
Why is Strassen matrix multiplication so much slower than standard matrix multiplication?
but actually I can't say that it was pretty helpful for me.
Also, It would be really cool just to change smth in my current algorithm for optimizing instead of creating new one.
Stressen algorithm
t_start = perf_counter()
def submatrices(n, matrix): # dividing matrix for pieces
A = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2))]
B = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2))]
C = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2), n)]
D = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2), n)]
return [A, B, C, D]
def addition(n, matrix1, matrix2): # just addition
res = [[matrix1[i][j] + matrix2[i][j] for j in range(n)] for i in range(n)]
return res
def subtraction(n, matrix1, matrix2): # just substraction
res = [[matrix1[i][j] - matrix2[i][j] for j in range(n)] for i in range(n)]
return res
def strassen(n, matrix1, matrix2):
if n == 2: # the last step of algorithm is just standart multiplycation
xy = [[0] * n for i in range(n)]
for i in range(n):
for j in range(n):
for x in range(n):
xy[i][j] += matrix1[i][x] * matrix2[x][j]
else:
A, B, C, D = submatrices(n, matrix1) # divide the original matrix1
E, F, G, H = submatrices(n, matrix2) # divide the original matrix2
n = int(n / 2) # the matrix size is changed now
p1 = strassen(n, A, subtraction(n, F, H))
p2 = strassen(n, addition(n, A, B), H)
p3 = strassen(n, addition(n, C, D), E)
p4 = strassen(n, D, subtraction(n, G, E))
p5 = strassen(n, addition(n, A, D), addition(n, E, H))
p6 = strassen(n, subtraction(n, B, D), addition(n, G, H))
p7 = strassen(n, subtraction(n, A, C), addition(n, E, F))
xy1 = addition(n, addition(n, p5, p6), subtraction(n, p4, p2)) # making new blocks of matrix
xy2 = addition(n, p1, p2)
xy3 = addition(n, p3, p4)
xy4 = subtraction(n, addition(n, p1, p5), addition(n, p3, p7))
xy = [xy1[i] + xy2[i] for i in range(n)] + [xy3[i] + xy4[i] for i in range(n)] # assembling a matrix of blocks
return xy
print(f'Time: {perf_counter() - t_start} sec')
# printing result
for raw in strassen(n, matrix1, matrix2):
print(*raw)
Standart multiplication
t_start = perf_counter()
def multiply(n, matrix1, matrix2):
res = [[0]*n for i in range(n)]
for i in range(n):
for j in range(n):
for x in range(n):
res[i][j] += matrix1[i][x] * matrix2[x][j]
return res
print(f'Time: {perf_counter() - t_start} sec')
for raw in multiply(n, matrix1, matrix2):
print(*raw)

Both are horribly inefficient (certainly at least 3-4 order of magnitude slower than optimized implementations). The standard implementation of Python is the CPython interpreter which is clearly not design for such a kind of computation. It is mainly meant to execute glue code calling C functions like the one of BLAS libraries. In practice, accessing a list with lst[i][j] cause many functions calls, memory indirections, object allocation/destruction, etc. All these overheads are pretty huge compare to the same since in a native compiled language (like C/C++) and they are also hard to track without a good understanding of the interpreter (another Python implementation will certainly result in completely different results). One issue with the Strassen implementation is recursion: recursion is pretty slow with CPython (you can write a naive Fibonacci recursive implementation so to easily check that). Another big issue is the creation of the many temporary sub list of lists which should be very slow too (a lot of objects need to be allocated).
If you want to compare such computationally intensive algorithm, please use a natively compiled language. The outcome of such a comparison in Python will be irrelevant in other languages and very dependent of the details of the target implementation of Python. Note that Strassen tends to be efficient only for huge matrices (generally much bigger than 1024x1024).
Besides, a first step could be to use Numpy so to create views of sub-matrices with a much smaller overhead. The operation on matrice should also be far faster and even also simpler to implement.

Related

How can I enforce PBC when simulating a 2D Reaction-Diffusion system?

i'm having some problems attempting to implement periodic boundary conditions (PBC) on a reaction diffusion system simulated in Python using 2D numpy arrays. I'll try to explain using pseudocode and attach code as to how i'm currently handling the boundaries.
import numpy as np
N = 100
# I define a 2D array for each of my species in the reaction-diffusion system
a = np.array((N, N), dtype=np.float64)
b = np.array((N, N), dtype=np.float64)
.
.
.
n = np.array((N, N), dtype=np.float64)
# And also copies to update at each time step
a_copy = np.array((N, N), dtype=np.float64)
b_copy = np.array((N, N), dtype=np.float64)
.
.
.
n_copy = np.array((N, N), dtype=np.float64)
# I calculate the laplacian using the following function
#jit(nopython=True, fastmath=True)
def laplacian_numba(field, dh2_inv, out):
"""
Compute the laplacian of an array using a 5 point stencil.
"""
for i in range(1, N - 1):
for j in range(1, N - 1):
out[i, j] = (
field[i + 1, j]
+ field[i - 1, j]
+ field[i, j + 1]
+ field[i, j - 1]
- 4 * field[i, j]
) * dh2_inv
return out
# I have a main loop to update the functions using an explicit method
#jit(nopython=True, fastmath=True)
def update(a, b, ..., n):
# Compute the laplacian of only the diffusing variables
laplacian_numba(a, dh2_inv, out=lap_a)
laplacian_numba(b, dh2_inv, out=lap_b)
# Update the copy arrays
a_copy = a[1:-1, 1:-1] + dt * (ODE stuff + lap_a * a[1:-1, 1:-1])
...
# Finally I enforce the boundary conditions on the system
# If i'm not mistaken, these are reflecting boundary conditions, not periodic
# And this is where i'm lost as to how to implement the periodicity
a_copy[0,:] = a_copy[1,:]
a_copy[-1,:] = a_copy[-2,:]
a_copy[:,0] = a_copy[:,1]
a_copy[:,-1] = a_copy[:,-2]
# Update the previous and next timestep arrays
a, a_copy = a_copy, a
b, b_copy = b_copy, b
return a, b, ..., n
The above is a very crude pseudocode of how I implemented the system, and how I update at each timestep and enforce the boundary conditions, which if I'm not mistaken are just reflecting the edges back onto the main grid. Here is my main question, what would I need to change in order to make the edges periodic and not reflecting? As I come from a biochemistry background, PDEs are not my forte, but I'm making an effort since this is a key objective in my thesis and would appreciate any help or guidance.
Thanks in advance to anyone who takes the time to read this! And I apologize for any formatting mistakes I could've made.

With periodic boundary conditions, there isn't really a boundary; your coordinate space just wraps around modulo N. So there is no need to add special first/last rows/columns at all, and no need to explicitly enforce the boundary condition either.
But you do have to make sure that all reads outside the matrix bounds are also wrapped around properly. For example, for your Laplacian you could do something like this:
#jit(nopython=True, fastmath=True)
def laplacian_numba(field, dh2_inv, out):
"""
Compute the laplacian of an array using a 5 point stencil.
"""
# Note the new range: 0, ..., N-1
for i in range(N):
for j in range(N):
out[i, j] = (
field[(i + 1) % N, j]
+ field[(i - 1) % N, j]
+ field[i, (j + 1) % N]
+ field[i, (j - 1) % N]
- 4 * field[i, j]
) * dh2_inv
return out
Incidentally, the same could be written in a oneliner using four np.roll calls, but I don't know if it'd be faster than your numba approach.

How can I improve performance in my forward substitution method for lower triangle matrices?

I tried implementing the forward substitution method, a solving process to solve the problem Lx = b with L being a lower triangle matrix and x,b as vectors.
This was an easy task:
def tri_solve(L,b):
n = len(b)
x = np.zeros(n)
x[0] = b[0]/L[0,0];
for i in range(1,n):
comp = 0;
for k in range(0,i):
index = L[i,k]
preSolution = x[k]
comp = comp + index * preSolution
x[i] = 1/L[i,i] * (b[i] - comp)
return x;
Now I compared my calculation times for different sized matrices several times with linalg.solve from the scipy module and it turns out that it is much faster. This makes sense in some points, since SciPy is written in C and C++, but I still expected similar or better calculation times for matrices up to 10x10 dimension. Beginning with 6x6 matrices, linalg.solves becomes slightly faster on average.
Is there a way to improve my rather simple solution?

You could try solve_triangular
If you want to accelerate your code, what you could do is to vectorize the inner loop.
def tri_solve(L,b):
n = len(b)
x = np.zeros(n)
x[0] = b[0]/L[0,0];
for i in range(1,n):
comp = np.sum(L[i,:i] * x[:i])
x[i] = 1/L[i,i] * (b[i] - comp)
return x;
Edit: How to use it
You have to pass as first argument a square lower triangular matrix and as second argument you can pass a 1D array
N = 20
A = np.tril(np.random.randn(N, N))
b = np.random.randn(N)
assert np.allclose(np.linalg.solve(A, b), tri_solve(A, b))
Of course this is a naive implementation and is not stable, you can't use it to solve very large or ill conditioned systems.

Numpy - create an almost zero matrix with row from other matrix

I have square matrix A and I want to create matrix Z which elements are zero everywhere except for an i'th row, and the i'th row is j'th row of matrix A.
I am aware of two ways to accomplish this. The fist one is fairly straightforward and seems to be the most effective performance-wise:
def do_this(mx: np.array, i: int, j: int):
Z = np.zeros_like(mx)
Z[i, :] = mx[j, :]
return Z
The other, less straightforward way and seemingly much less efficient, is to prepare a mx matrix beforehand, which a zero matrix of the same shape as A, but has 1 in it's (i, j) position, and then to calculate Z as mx # A.
def do_this_other_way(mx: np.array, ref_mx: np.array):
return ref_mx # mx
I decided to benchmark both approaches:
from time import time
import numpy as np
n = 20
num_iters = 5000
A = np.random.rand(n, n)
i, j = 5, 10
t = time()
for _ in range(num_iters):
Z = do_this(A, i, j)
print((time() - t) / num_iters)
ref_mx = np.zeros_like(A)
ref_mx[i, j] = 1
t = time()
for _ in range(num_iters):
Z = do_this_other_way(A, ref_mx)
print((time() - t) / num_iters)
However, when A is relatively small (on my laptop it means that A's size is less than 40), do_this_other_way wins, and when A has size like 20, it wins by an order of magnitude.
That's it: I have doubts that I am doing it the most effective way possible in numpy. Is it possible to do it better without resorting to writing your own low-level implementation of do_this?

python multiprocessing for euclidean distance loop [duplicate]

I have two points in 3D space:
a = (ax, ay, az)
b = (bx, by, bz)
I want to calculate the distance between them:
dist = sqrt((ax-bx)^2 + (ay-by)^2 + (az-bz)^2)
How do I do this with NumPy? I have:
import numpy
a = numpy.array((ax, ay, az))
b = numpy.array((bx, by, bz))

Use numpy.linalg.norm:
dist = numpy.linalg.norm(a-b)
This works because the Euclidean distance is the l2 norm, and the default value of the ord parameter in numpy.linalg.norm is 2.
For more theory, see Introduction to Data Mining:

Use scipy.spatial.distance.euclidean:
from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)

For anyone interested in computing multiple distances at once, I've done a little comparison using perfplot (a small project of mine).
The first advice is to organize your data such that the arrays have dimension (3, n) (and are C-contiguous obviously). If adding happens in the contiguous first dimension, things are faster, and it doesn't matter too much if you use sqrt-sum with axis=0, linalg.norm with axis=0, or
a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->j', a_min_b, a_min_b))
which is, by a slight margin, the fastest variant. (That actually holds true for just one row as well.)
The variants where you sum up over the second axis, axis=1, are all substantially slower.
Code to reproduce the plot:
import numpy
import perfplot
from scipy.spatial import distance
def linalg_norm(data):
a, b = data[0]
return numpy.linalg.norm(a - b, axis=1)
def linalg_norm_T(data):
a, b = data[1]
return numpy.linalg.norm(a - b, axis=0)
def sqrt_sum(data):
a, b = data[0]
return numpy.sqrt(numpy.sum((a - b) ** 2, axis=1))
def sqrt_sum_T(data):
a, b = data[1]
return numpy.sqrt(numpy.sum((a - b) ** 2, axis=0))
def scipy_distance(data):
a, b = data[0]
return list(map(distance.euclidean, a, b))
def sqrt_einsum(data):
a, b = data[0]
a_min_b = a - b
return numpy.sqrt(numpy.einsum("ij,ij->i", a_min_b, a_min_b))
def sqrt_einsum_T(data):
a, b = data[1]
a_min_b = a - b
return numpy.sqrt(numpy.einsum("ij,ij->j", a_min_b, a_min_b))
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
out0 = numpy.array([a, b])
out1 = numpy.array([a.T, b.T])
return out0, out1
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(22)],
kernels=[
linalg_norm,
linalg_norm_T,
scipy_distance,
sqrt_sum,
sqrt_sum_T,
sqrt_einsum,
sqrt_einsum_T,
],
xlabel="len(x), len(y)",
)
b.save("norm.png")

I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:
dist = numpy.linalg.norm(a-b)
Firstly - this function is designed to work over a list and return all of the values, e.g. to compare the distance from pA to the set of points sP:
sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.) # 'distances' is a list
Remember several things:
Python function calls are expensive.
[Regular] Python doesn't cache name lookups.
So
def distance(pointA, pointB):
dist = np.linalg.norm(pointA - pointB)
return dist
isn't as innocent as it looks.
>>> dis.dis(distance)
2 0 LOAD_GLOBAL 0 (np)
2 LOAD_ATTR 1 (linalg)
4 LOAD_ATTR 2 (norm)
6 LOAD_FAST 0 (pointA)
8 LOAD_FAST 1 (pointB)
10 BINARY_SUBTRACT
12 CALL_FUNCTION 1
14 STORE_FAST 2 (dist)
3 16 LOAD_FAST 2 (dist)
18 RETURN_VALUE
Firstly - every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.
Lastly, we wasted two operations on to store the result and reload it for return...
First pass at improvement: make the lookup faster, skip the store
def distance(pointA, pointB, _norm=np.linalg.norm):
return _norm(pointA - pointB)
We get the far more streamlined:
>>> dis.dis(distance)
2 0 LOAD_FAST 2 (_norm)
2 LOAD_FAST 0 (pointA)
4 LOAD_FAST 1 (pointB)
6 BINARY_SUBTRACT
8 CALL_FUNCTION 1
10 RETURN_VALUE
The function call overhead still amounts to some work, though. And you'll want to do benchmarks to determine whether you might be better doing the math yourself:
def distance(pointA, pointB):
return (
((pointA.x - pointB.x) ** 2) +
((pointA.y - pointB.y) ** 2) +
((pointA.z - pointB.z) ** 2)
) ** 0.5 # fast sqrt
On some platforms, **0.5 is faster than math.sqrt. Your mileage may vary.
**** Advanced performance notes.
Why are you calculating distance? If the sole purpose is to display it,
print("The target is %.2fm away" % (distance(a, b)))
move along. But if you're comparing distances, doing range checks, etc., I'd like to add some useful performance observations.
Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.
# Ultra naive implementations. Hold onto your hat.
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance(origin, thing))
def in_range(origin, range, things):
things_in_range = []
for thing in things:
if distance(origin, thing) <= range:
things_in_range.append(thing)
The first thing we need to remember is that we are using Pythagoras to calculate the distance (dist = sqrt(x^2 + y^2 + z^2)) so we're making a lot of sqrt calls. Math 101:
dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M
In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.
# Still naive, but much faster.
def distance_sq(left, right):
""" Returns the square of the distance between left and right. """
return (
((left.x - right.x) ** 2) +
((left.y - right.y) ** 2) +
((left.z - right.z) ** 2)
)
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance_sq(origin, thing))
def in_range(origin, range, things):
things_in_range = []
# Remember that sqrt(N)**2 == N, so if we square
# range, we don't need to root the distances.
range_sq = range**2
for thing in things:
if distance_sq(origin, thing) <= range_sq:
things_in_range.append(thing)
Great, both functions no-longer do any expensive square roots. That'll be much faster, but before you go further, check yourself: why did sort_things_by_distance need a "naive" disclaimer both times above? Answer at the very bottom (*a1).
We can improve in_range by converting it to a generator:
def in_range(origin, range, things):
range_sq = range**2
yield from (thing for thing in things
if distance_sq(origin, thing) <= range_sq)
This especially has benefits if you are doing something like:
if any(in_range(origin, max_dist, things)):
...
But if the very next thing you are going to do requires a distance,
for nearby in in_range(origin, walking_distance, hotdog_stands):
print("%s %.2fm" % (nearby.name, distance(origin, nearby)))
consider yielding tuples:
def in_range_with_dist_sq(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = distance_sq(origin, thing)
if dist_sq <= range_sq: yield (thing, dist_sq)
This can be especially useful if you might chain range checks ('find things that are near X and within Nm of Y', since you don't have to calculate the distance again).
But what about if we're searching a really large list of things and we anticipate a lot of them not being worth consideration?
There is actually a very simple optimization:
def in_range_all_the_things(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
Whether this is useful will depend on the size of 'things'.
def in_range_all_the_things(origin, range, things):
range_sq = range**2
if len(things) >= 4096:
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
elif len(things) > 32:
for things in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
else:
... just calculate distance and range-check it ...
And again, consider yielding the dist_sq. Our hotdog example then becomes:
# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
print("%s %.2fm" % (stand, dist))
(*a1: sort_things_by_distance's sort key calls distance_sq for every single item, and that innocent looking key is a lambda, which is a second function that has to be invoked...)

Another instance of this problem solving method:
def dist(x,y):
return numpy.sqrt(numpy.sum((x-y)**2))
a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)

Starting Python 3.8, the math module directly provides the dist function, which returns the euclidean distance between two points (given as tuples or lists of coordinates):
from math import dist
dist((1, 2, 6), (-2, 3, 2)) # 5.0990195135927845
And if you're working with lists:
dist([1, 2, 6], [-2, 3, 2]) # 5.0990195135927845

It can be done like the following. I don't know how fast it is, but it's not using NumPy.
from math import sqrt
a = (1, 2, 3) # Data point 1
b = (4, 5, 6) # Data point 2
print sqrt(sum( (a - b)**2 for a, b in zip(a, b)))

A nice one-liner:
dist = numpy.linalg.norm(a-b)
However, if speed is a concern I would recommend experimenting on your machine. I've found that using math library's sqrt with the ** operator for the square is much faster on my machine than the one-liner NumPy solution.
I ran my tests using this simple program:
#!/usr/bin/python
import math
import numpy
from random import uniform
def fastest_calc_dist(p1,p2):
return math.sqrt((p2[0] - p1[0]) ** 2 +
(p2[1] - p1[1]) ** 2 +
(p2[2] - p1[2]) ** 2)
def math_calc_dist(p1,p2):
return math.sqrt(math.pow((p2[0] - p1[0]), 2) +
math.pow((p2[1] - p1[1]), 2) +
math.pow((p2[2] - p1[2]), 2))
def numpy_calc_dist(p1,p2):
return numpy.linalg.norm(numpy.array(p1)-numpy.array(p2))
TOTAL_LOCATIONS = 1000
p1 = dict()
p2 = dict()
for i in range(0, TOTAL_LOCATIONS):
p1[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
p2[i] = (uniform(0,1000),uniform(0,1000),uniform(0,1000))
total_dist = 0
for i in range(0, TOTAL_LOCATIONS):
for j in range(0, TOTAL_LOCATIONS):
dist = fastest_calc_dist(p1[i], p2[j]) #change this line for testing
total_dist += dist
print total_dist
On my machine, math_calc_dist runs much faster than numpy_calc_dist: 1.5 seconds versus 23.5 seconds.
To get a measurable difference between fastest_calc_dist and math_calc_dist I had to up TOTAL_LOCATIONS to 6000. Then fastest_calc_dist takes ~50 seconds while math_calc_dist takes ~60 seconds.
You can also experiment with numpy.sqrt and numpy.square though both were slower than the math alternatives on my machine.
My tests were run with Python 2.6.6.

I find a 'dist' function in matplotlib.mlab, but I don't think it's handy enough.
I'm posting it here just for reference.
import numpy as np
import matplotlib as plt
a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
# Distance between a and b
dis = plt.mlab.dist(a, b)

You can just subtract the vectors and then innerproduct.
Following your example,
a = numpy.array((xa, ya, za))
b = numpy.array((xb, yb, zb))
tmp = a - b
sum_squared = numpy.dot(tmp.T, tmp)
result = numpy.sqrt(sum_squared)

I like np.dot (dot product):
a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
distance = (np.dot(a-b,a-b))**.5

With Python 3.8, it's very easy.
https://docs.python.org/3/library/math.html#math.dist
math.dist(p, q)
Return the Euclidean distance between two points p and q, each given
as a sequence (or iterable) of coordinates. The two points must have
the same dimension.
Roughly equivalent to:
sqrt(sum((px - qx) ** 2.0 for px, qx in zip(p, q)))

Having a and b as you defined them, you can use also:
distance = np.sqrt(np.sum((a-b)**2))

Since Python 3.8
Since Python 3.8 the math module includes the function math.dist().
See here https://docs.python.org/3.8/library/math.html#math.dist.
math.dist(p1, p2)
Return the Euclidean distance between two points p1 and p2,
each given as a sequence (or iterable) of coordinates.
import math
print( math.dist( (0,0), (1,1) )) # sqrt(2) -> 1.4142
print( math.dist( (0,0,0), (1,1,1) )) # sqrt(3) -> 1.7321

Here's some concise code for Euclidean distance in Python given two points represented as lists in Python.
def distance(v1,v2):
return sum([(x-y)**2 for (x,y) in zip(v1,v2)])**(0.5)

import math
dist = math.hypot(math.hypot(xa-xb, ya-yb), za-zb)

Calculate the Euclidean distance for multidimensional space:
import math
x = [1, 2, 6]
y = [-2, 3, 2]
dist = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(x, y)]))
5.0990195135927845

import numpy as np
from scipy.spatial import distance
input_arr = np.array([[0,3,0],[2,0,0],[0,1,3],[0,1,2],[-1,0,1],[1,1,1]])
test_case = np.array([0,0,0])
dst=[]
for i in range(0,6):
temp = distance.euclidean(test_case,input_arr[i])
dst.append(temp)
print(dst)

You can easily use the formula
distance = np.sqrt(np.sum(np.square(a-b)))
which does actually nothing more than using Pythagoras' theorem to calculate the distance, by adding the squares of Δx, Δy and Δz and rooting the result.

import numpy as np
# any two python array as two points
a = [0, 0]
b = [3, 4]
You first change list to numpy array and do like this: print(np.linalg.norm(np.array(a) - np.array(b))). Second method directly from python list as: print(np.linalg.norm(np.subtract(a,b)))

The other answers work for floating point numbers, but do not correctly compute the distance for integer dtypes which are subject to overflow and underflow. Note that even scipy.distance.euclidean has this issue:
>>> a1 = np.array([1], dtype='uint8')
>>> a2 = np.array([2], dtype='uint8')
>>> a1 - a2
array([255], dtype=uint8)
>>> np.linalg.norm(a1 - a2)
255.0
>>> from scipy.spatial import distance
>>> distance.euclidean(a1, a2)
255.0
This is common, since many image libraries represent an image as an ndarray with dtype="uint8". This means that if you have a greyscale image which consists of very dark grey pixels (say all the pixels have color #000001) and you're diffing it against black image (#000000), you can end up with x-y consisting of 255 in all cells, which registers as the two images being very far apart from each other. For unsigned integer types (e.g. uint8), you can safely compute the distance in numpy as:
np.linalg.norm(np.maximum(x, y) - np.minimum(x, y))
For signed integer types, you can cast to a float first:
np.linalg.norm(x.astype("float") - y.astype("float"))
For image data specifically, you can use opencv's norm method:
import cv2
cv2.norm(x, y, cv2.NORM_L2)

Find difference of two matrices first. Then, apply element wise multiplication with numpy's multiply command. After then, find summation of the element wise multiplied new matrix. Finally, find square root of the summation.
def findEuclideanDistance(a, b):
euclidean_distance = a - b
euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance))
euclidean_distance = np.sqrt(euclidean_distance)
return euclidean_distance

What's the best way to do this with NumPy, or with Python in general? I have:
Well best way would be safest and also the fastest
I would suggest hypot usage for reliable results for chances of underflow and overflow are very little compared to writing own sqroot calculator
Lets see math.hypot, np.hypot vs vanilla np.sqrt(np.sum((np.array([i, j, k])) ** 2, axis=1))
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
Speed wise math.hypot look better
%%timeit
math.hypot(i, j, k)
# 100 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%%timeit
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# 6.41 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Underflow
i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0
Overflow
i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf
No Underflow
i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200
No Overflow
i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200
Refer

The fastest solution I could come up with for large number of distances is using numexpr. On my machine it is faster than using numpy einsum:
import numexpr as ne
import numpy as np
np.sqrt(ne.evaluate("sum((a_min_b)**2,axis=1)"))

If you want something more explicit you can easily write the formula like this:
np.sqrt(np.sum((a-b)**2))
Even with arrays of 10_000_000 elements this still runs at 0.1s on my machine.

Using combinations or another trick to iterate though 3 different arrays?

consider my code
a,b,c = np.loadtxt ('test.dat', dtype='double', unpack=True)
a,b, and c are the same array length.
for i in range(len(a)):
q[i] = 3*10**5*c[i]/100
x[i] = q[i]*math.sin(a)*math.cos(b)
y[i] = q[i]*math.sin(a)*math.sin(b)
z[i] = q[i]*math.cos(a)
I am trying to find all the combinations for the difference between 2 points in x,y,z to iterate this equation (xi-xj)+(yi-yj)+(zi-zj) = r
I use this combination code
for combinations in it.combinations(x,2):
xdist = (combinations[0] - combinations[1])
for combinations in it.combinations(y,2):
ydist = (combinations[0] - combinations[1])
for combinations in it.combinations(z,2):
zdist = (combinations[0] - combinations[1])
r = (xdist + ydist +zdist)
This takes a long time for python for a large file I have and I am wondering if there is a faster way to get my array for r preferably using a nested loop?
Such as
if i in range(?):
if j in range(?):

Since you're apparently using numpy, let's actually use numpy; it'll be much faster. It's almost always faster and usually easier to read if you avoid python loops entirely when working with numpy, and use its vectorized array operations instead.
a, b, c = np.loadtxt('test.dat', dtype='double', unpack=True)
q = 3e5 * c / 100 # why not just 3e3 * c?
x = q * np.sin(a) * np.cos(b)
y = q * np.sin(a) * np.sin(b)
z = q * np.cos(a)
Now, your example code after this doesn't do what you probably want it to do - notice how you just say xdist = ... each time? You're overwriting that variable and not doing anything with it. I'm going to assume you want the squared euclidean distance between each pair of points, though, and make a matrix dists with dists[i, j] equal to the distance between the ith and jth points.
The easy way, if you have scipy available:
# stack the points into a num_pts x 3 matrix
pts = np.hstack([thing.reshape((-1, 1)) for thing in (x, y, z)])
# get squared euclidean distances in a matrix
dists = scipy.spatial.squareform(scipy.spatial.pdist(pts, 'sqeuclidean'))
If your list is enormous, it's more memory-efficient to not use squareform, but then it's in a condensed format that's a little harder to find specific pairs of distances with.
Slightly harder, if you can't / don't want to use scipy:
pts = np.hstack([thing.reshape((-1, 1)) for thing in (x, y, z)])
sqnorms = np.sum(pts ** 2, axis=1)
dists = sqnorms.reshape((-1, 1)) - 2 * np.dot(pts, pts.T) + sqnorms
which basically implements the formula (a - b)^2 = a^2 - 2 a b + b^2, but all vector-like.

Apologies for not posting a full solution, but you should avoid nesting calls to range(), as it will create a new tuple every time it gets called. You are better off either calling range() once and storing the result, or using a loop counter instead.
For example, instead of:
max = 50
for number in range (0, 50):
doSomething(number)
...you would do:
max = 50
current = 0
while current < max:
doSomething(number)
current += 1

Well, the complexity of your calculation is pretty high. Also, you need to have huge amounts of memory if you want to store all r values in a single list. Often, you don't need a list and a generator might be enough for what you want to do with the values.
Consider this code:
def calculate(x, y, z):
for xi, xj in combinations(x, 2):
for yi, yj in combinations(y, 2):
for zi, zj in combinations(z, 2):
yield (xi - xj) + (yi - yj) + (zi - zj)
This returns a generator that computes only one value each time you call the generator's next() method.
gen = calculate(xrange(10), xrange(10, 20), xrange(20, 30))
gen.next() # returns -3
gen.next() # returns -4 and so on

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.