I'm working on a tight binding model for graphene using pythTB. I want to incorporate spinfull elements in the calculation. The hamiltonian for the rashba hopping terms has the pauli spin matrix vector crossed with the site hopping vector.
Initially I created a list of matrices and crossed that with the vector, unfortunately this did not yield the correct result (I think that after the vector cross product was taken, then the cross product of the matrices were taken).
Next, I declared 3 symbols 's_x', 's_y', and 's_z' and used those instead of the matrices in my pauli spin matrix vector. After taking the cross product I received the correct result. The problem I am having is that I cannot substitute a matrix into the variable symbols I added in. Is it possible to do this? Or will I need to take the cross product manually?
Here is some of my code:
from __future__ import print_function
from pythtb import * # import TB model class
from sympy import symbols
import numpy as np
import matplotlib.pyplot as plt
# create list of pauli spin matrices
sx = [[0., 1.],[1., 0.]]
sy = [[0., -1.j],[1.j, 0.]]
sz = [[1., 0.],[0., -1.]]
Id = [[1., 0.], [0., 1.]]
s_pauli = np.zeros((4, 2, 2), dtype=complex)
s_pauli = [Id, sx, sy, sz]
# create s_pauli without identity matrix
s_pau = np.zeros((3, 2, 2), dtype=complex)
s_pau = [ s_x, s_y, s_z]
ab00 = [ 0.5, 0.28867513, 0.]
sig_x_ab00 = np.cross( s_pau, ab00)
If I print sig_x_ab00[2] (which is the only one I'm currently interested in), then I get:
0.288675134594813*s_x - 0.5*s_y
After obtaining that, I wanted to substitute s_pauli[1] for s_x and s_pauli[2] for s_y by doing the following command:
sig_x_ab00_ = sig_x_ab00.subs(s_x, s_pauli[1])
And I get the following error output:
AttributeError: 'numpy.ndarray' object has no attribute 'subs'
Is what I am doing at all valid? Or is there a better way to go about this?
Any input is much appreciated!
Thanks!
Let's run your code, but looking at each step. Don't make assumptions.
I'm using an isympy interactive environment; That ipython with sympy enhancements. I also imported np.
In [4]: ab00 = [ 0.5, 0.28867513, 0.]
In [5]: s_pauli
Out[5]:
[[[1.0, 0.0], [0.0, 1.0]],
[[0.0, 1.0], [1.0, 0.0]],
[[0.0, (-0-1j)], [1j, 0.0]],
[[1.0, 0.0], [0.0, -1.0]]]
This is a list. The previous np.zeros(...) expression does nothing. In Python we don't set the 'type' of a variable.
We can make an array from this list:
In [6]: np.array(s_pauli)
s_pauli[1] works because it is just list indexing.
And the added symbols:
In [11]: s_x, s_y, s_z = symbols('s_x s_y s_z')
In [12]: s_x
Out[12]: sₓ
In [13]: s_pau = [ s_x, s_y, s_z]
Again, s_pau is a list, not an array. When used in cross it will be turned into an array:
In [14]: np.array(s_pau)
Out[14]: array([s_x, s_y, s_z], dtype=object)
Note that is an object dtype array, which is still very much like a list. Some basic math works, because math like multiply and add are defined for the symbols. But transcendentals like np.log and np.sin don't work on such arrays.
cross just uses multiply and addition, so it works with these object arrays:
In [15]: sig = np.cross( s_pau, ab00)
In [16]: sig
Out[16]: array([-0.28867513*s_z, 0.5*s_z, 0.28867513*s_x - 0.5*s_y], dtype=object)
sig is a numpy array. It is not a sympy expression, and does not have a subs method. Again, it pays to pay close attention to what is happening.
The elements of the array are sympy expressions:
In [17]: sig[2]
Out[17]: 0.28867513⋅sₓ - 0.5⋅s_y
In [20]: s2 = sig[2]
subs with a scalar value works:
In [22]: s2.subs(s_x, 1)
Out[22]: 0.28867513 - 0.5⋅s_y
but not with a list
In [23]: s2.subs(s_x, s_pauli[1])
Out[23]: 0.28867513⋅sₓ - 0.5⋅s_y
However if I make sympy matrix from it:
In [24]: s_pauli[1]
Out[24]: [[0.0, 1.0], [1.0, 0.0]]
In [25]: Matrix(s_pauli[1])
Out[25]:
⎡0.0 1.0⎤
⎢ ⎥
⎣1.0 0.0⎦
In [26]: s2.subs(s_x, Out[25])
Out[26]:
⎡ 0 0.28867513⎤
-0.5⋅s_y + ⎢ ⎥
⎣0.28867513 0 ⎦
The substitution does work.
In general mixing sympy and numpy is hit-or-miss; something work, almost more by accident than by design. Others don't. sympy.lambdify is the most reliable way of making a function that will work with numpy arrays.
In this case I suspect you'd be better of using a sympy version of cross, and doing the sympy.Matrix substitutions.
Related
System
OS: Windows 10 (x64), Build 1909
Python Version: 3.8.10
Numpy Version: 1.21.2
Question
Given two 2D (N, 3) Numpy arrays of (x, y, z) floating-point data points, what is the Pythonic (vectorized) way to find the indices in one array where points are equal to the points in the other array?
(NOTE: My question differs in that I need this to work with real-world data sets where the two data sets may differ by floating point error. Please read on below for details.)
History
Very similar questions have been asked many times:
how to find indices of a 2d numpy array occuring in another 2d array
test for membership in a 2d numpy array
Get indices of intersecting rows of Numpy 2d Array
Find indices of rows of numpy 2d array in another 2D array
Indices of intersecting rows of Numpy 2d Array
Find indices of rows of numpy 2d array in another 2D array
Previous Attempts
SO Post 1 provides a working list comprehension solution, but I am looking for a solution that will scale well to large data sets (i.e. millions of points):
Code 1:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
]
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
]
)
inds = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if all(sarr == barr)
]
print(inds)
Output 1:
[1, 2]
Attempting the solution of SO Post 3 (similar to SO Post 2), but using floats does not work (and I suspect something using np.isclose will be needed):
Code 3:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
inds = np.nonzero(
np.in1d(big_array.view("f,f").reshape(-1), small_array.view("f,f").reshape(-1))
)[0]
print(inds)
Output 3:
[ 3 4 5 8 9 10 11]
My Attempt
I have tried numpy.isin with np.all and np.argwhere
inds = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
which works (and, I argue, much more readable and understandable; i.e. pythonic), but will not work for real-world data sets containing floating-point errors:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array_fpe = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
inds_no_fpe = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
inds_with_fpe = np.argwhere(
np.all(np.isin(big_array, small_array_fpe), axis=1)
).reshape(-1)
print(f"No Floating Point Error: {inds_no_fpe}")
print(f"With Floating Point Error: {inds_with_fpe}")
print(f"Are 5.0 and 5.0+1e-9 close?: {np.isclose(5.0, 5.0 + 1e-9)}")
Output:
No Floating Point Error: [1 3]
With Floating Point Error: []
Are 5.0 and 5.0+1e-9 close?: True
How can I make my above solution work (on data sets with floating point error) by incorporating np.isclose? Alternative solutions are welcome.
NOTE: Since small_array is a subset of big_array, using np.isclose directly doesn't work because the shapes won't broadcast:
np.isclose(big_array, small_array_fpe)
yields
ValueError: operands could not be broadcast together with shapes (4,3) (2,3)
Update
Currently, the only working solution I have is
inds_with_fpe = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array_fpe
if np.all(np.isclose(sarr, barr))
]
As #Michael Anderson already mentioned this can be implemented using a kd-tree. In comparsion to your answer this solution is using an absolute error. If this is acceptable or not depends on the problem.
Example
import numpy as np
from scipy import spatial
def find_nearest(big_array,small_array,tolerance):
tree_big=spatial.cKDTree(big_array)
tree_small=spatial.cKDTree(small_array)
return tree_small.query_ball_tree(tree_big,r=tolerance)
Timings
big_array=np.random.rand(100_000,3)
small_array=np.random.rand(1_000,3)
big_array[1000:2000]=small_array
%timeit find_nearest(big_array,small_array,1e-9) #find all pairs within a distance of 1e-9
#55.7 ms ± 830 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#A. Hendry
%timeit np.argwhere(np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)).reshape(-1)
#3.24 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm not going to give any code, but I've dealt with problems similar to this on a large scale. I suspect that to get decent performance with either of these approaches you'll need to implement the core in C (you might get away with using numba).
If both your arrays are huge there are a few approaches that can work.
Primarily these boil down to building a structure that can be used to find the nearest neighbor of a point from one of the arrays, and then querying it for each point in the other data set.
To do this I've previously used a Kd Tree approach, and a grid based approach.
The basis of the grid based approach is
find the 3D extents of your first array.
split this region into LNM bins.
For each input point in the second array, find its bin. Any point that matches it will be in that bin.
The edge cases you need to handle are
if the point falls on the edge of a bin, or close enough to the boundary of a bin that points considered equal to it might fall in the other bin - then you need to search more than one bin for its "equal".
if the point falls outside all the bins, but close to the edge, points "equal" to it might fall in a nearby bin.
The downsides are that this is bad for data that is not uniformly distributed.
The upside is that it is relatively simple. Expected run time for uniform data is n1 * n2 / (L*N*M) (compared to n1*n2). Typically you select L,N,M such that this becomes O(n log(n)). You also get some further uplift from sorting the second array to improve reuse of the bins. It is also relatively easy to parallelize (both the binning and searching)
The K-d Tree approach is similar. IIRC it gives O(n log(n)) behavior, but it is trickier to implement, and the building of the data structure is tricky to parallelize). It tends to not be as cache friendly which can mean that although its asymptotic run-time is better than the grid based approach it can runs slower in practice. However it does give better guarantees for non-uniformly distributed data.
Credit to #AndrasDeak for this answer
The following code snippet
inds_with_fpe = np.argwhere(
np.isclose(small_array_fpe, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
will make the code work. The corresponding output is now:
No Floating Point Error: [1 3]
With Floating Point Error: [1, 3]
Are 5.0 and 5.0+1e-9 close?: True
None in the above creates a new axis (same as np.newaxis). This changes the shape of the big_array array to (4, 1, 3), which adheres to broadcasting rules and permits np.isclose to run. That is, big_array is now a set of 4 1 x 3 points, and since one of the axes in big_array is 1, small_array_fpe can be broadcast to 2 1 x 3 arrays (i.e. shape (2, 1, 3)) and the elements can be compared element-wise.
The result is a (4, 2, 3) boolean array; every element of big_array is compared element-wise to every element of small_array_fpe and the components where they are close (within a specific tolerance) is returned. Since all is called as an object method rather than a numpy function, the first argument to the function is actually the axis rather than the input array. Hence, -1 in the above functions means "the last axis of the array".
We first return the indeces of the (4, 2, 3) array that are all True (i.e. all (x, y, z) components are equal), which yields a 4 x 2 array. Where any of these are True is the corresponding index in big_array where the points are equal, yielding a 4 x 1 array.
argwhere returns indices grouped by element, so its shape is normally (number nonzero items, num dims of input array), hence we flatten it into a 1d array with reshape(-1).
Unfortunately, this requires a quadratic amount memory w.r.t. the number of points in each array, since we must run through every element of big_array and check it against every element of small_array_fpe. For example, to search for 10,000 points in a set of another 10,000 points, for 32-bit floating point data, requires
Memory = 10000 * 10000 * 4 * 8 = 32 GiB RAM!
If anyone can devise a solution with a faster run time and reasonable amount of memory, that would be fantastic!
FYI:
from timeit import timeit
import numpy as np
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[10.0, 2.0, 5.8],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
def approach01():
return [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if np.all(np.isclose(sarr, barr))
]
def approach02():
return np.argwhere(
np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
if __name__ == "__main__":
time01 = timeit(
"approach01()",
number=10000,
setup="import numpy as np; from __main__ import approach01",
)
time02 = timeit(
"approach02()",
number=10000,
setup="import numpy as np; from __main__ import approach02",
)
print(f"Approach 1 (List Comprehension): {time01}")
print(f"Approach 2 (Vectorized): {time02}")
Output:
Approach 1 (List Comprehension): 8.1180582
Approach 2 (Vectorized): 0.9656997
Say I have some matrix, W = MxN and a long array of indices z with shape of Mx1.
Now, assume I'd like to sum up the element of each row in W, excluding the index appears for that row in z.
1-d example:
import numpy as np
W = np.array([1.0, 2.0, 8.0])
z = 2
np.sum(np.delete(W,z))
MxN example and desired output:
import numpy as np
W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
z = np.array([0,2]).reshape(2,1)
# desired output
# [10. 20.]
I tried to use np.delete and axis=1 with no success
I managed to get around it using tricks like:
W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
z = np.array([0,2])
W[np.arange(z.shape[0]), z]=0
print(np.sum(W, axis=1))
# [10. 20.]
but I'm wondering if there's a more elegant way.
Using broadcasting to get the mask to simulate deletion and then sum-reduce -
(W*(z != np.arange(W.shape[-1]))).sum(-1)
Sample runs -
For 2D case :
In [61]: W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
...: z = np.array([0,2]).reshape(2,1)
In [62]: (W*(z != np.arange(W.shape[-1]))).sum(-1)
Out[62]: array([10., 20.])
Works just as well for the 1D case :
In [59]: W = np.array([1.0, 2.0, 8.0])
...: z = 2
In [60]: (W*(z != np.arange(W.shape[-1]))).sum(-1)
Out[60]: 3.0
For 2D case :
With np.einsum for the sum-reduction -
In [53]: np.einsum('ij,ij->i',W,z != np.arange(W.shape[1]))
Out[53]: array([10., 20.])
Summing and then subtracting the z-indexed values for 2D case -
In [134]: W.sum(1) - np.take_along_axis(W,z,axis=1).squeeze(1)
Out[134]: array([10., 20.])
Extend to handle both 2D and 1D cases -
W.sum(-1)-np.take_along_axis(W,np.atleast_1d(z),axis=-1).squeeze(-1)
#Divaka answers are pretty good. I just give another perspective on your question. If you need masking to ignore certain indices and doing multiple operations on array, you should use numpy masked array np.ma.array instead of regular np.array. Masked array is truly for the purpose of ignore certain indices.
document of masked array for more info
z = np.array([0,2]).reshape(2,1)
W_ma = np.ma.array(W, mask=z == np.arange(W.shape[-1]))
In [36]: W_ma
Out[36]:
masked_array(
data=[[--, 2.0, 8.0],
[5.0, 15.0, --]],
mask=[[ True, False, False],
[False, False, True]],
fill_value=1e+20)
From this W_ma masked array, you may do almost all operations the same as np.array. For sum
W_ma.sum(1)
Out[44]:
masked_array(data=[10.0, 20.0],
mask=[False, False],
fill_value=1e+20)
To turn masked array to regular array, you may use compressed, filled, or compress_rowcols
In [46]: W_ma.sum(1).compressed()
Out[46]: array([10., 20.])
Note: I emphasize masked array is useful when you do multiple operations on ignore indices. If you only need to do one or two operations on ignore indices, there is no point to use masked array.
I have a matrix of vectors where each row is a vector. I want to take the mean of all the vectors, then calculate the cosine distance between each vector and this mean, returning an array of distances.
>>> x = arange(1,10).reshape(3,3)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> m = x.mean(0)
array([4., 5., 6.])
The cosine values are as follows
>>> from scipy.spatial.distance import cosine
cosine([1,2,3], [4,5,6])
0.0253681538029239
>>> cosine([4,5,6], [4,5,6])
0.0
>>> cosine([7,8,9], [4,5,6])
0.001809107314273195
Therefore I want to write a function f such that
>>> f(x, m)
array([0.0253681538029239, 0.0, 0.001809107314273195])
(Or the transpose of such an array. It doesn't matter.)
What is the most efficient, most numpythonic way to write f? It seems like the trick is to get the proper broadcast over the cosine function, but I haven't figured out how to do this. The following doesn't work.
>>> from numpy import frompyfunc
>>> f = frompyfunc(cosine, 2, 1)
>>> f(x, m)
array([[0.0, 0.0, 0.0],
[0.0, 0.0, 0.0],
[0.0, 0.0, 0.0]], dtype=object)
(It looks like here numpy is applying cosine element-wise instead of row-wise.)
Is there a way to do this without writing a for-loop?
It looks like this is possible with apply_along_axis.
>>> from numpy import apply_along_axis
>>> from functools import partial
>>> g = partial(cosine, m)
>>> apply_along_axis(g, 1, x)
array([0.02536815, 0. , 0.00180911])
Is this the most efficient way?
You need to reshape your mean array to be 2D.
>>> from scipy.spatial.distance import cdist
>>> cdist(x, m.reshape(1, -1), metric='cosine')
array([[2.53681538e-02],
[2.22044605e-16],
[1.80910731e-03]])
Guess the trick would be to use cdist that works on 2D arrays in a vectorized manner to get us those cosine distances. So, one way would be -
In [59]: from scipy.spatial.distance import cosine
In [61]: cdist(x,x.mean(0,keepdims=True),'cosine')
Out[61]:
array([[2.53681538e-02],
[2.22044605e-16],
[1.80910731e-03]])
That keepdims lets the input to be 2D and hence makes it compatible with the cdist input requirements.
Part of my code inverts a matrix (really an ndarray) using numpy.linalg.inv. However, this frequently errors out as follows:
numpy.linalg.linalg.LinAlgError: Singular matrix
That would be fine if the matrix was actually singular. But that doesn't seem to be the case.
For example, I'm printing the matrix before trying to invert it. So right before the error it prints this:
[[ 0.76400334 0.22660491]
[ 0.22660491 0.06721147]]
... and then returns the above singularity error when it tries to invert that matrix. But from what I can tell this matrix is invertible. Numpy even seems to agree when asked later.
>>> numpy.linalg.inv([[0.76400334, 0.22660491], [0.22660491, 0.06721147]])
array([[ 2.88436275e+07, -9.72469076e+07],
[ -9.72469076e+07, 3.27870046e+08]])
Here's the exact code snippet:
print np.dot(np.transpose(X), X)
print np.linalg.inv(np.dot(np.transpose(X),X))
The first line prints the matrix above; the second line fails.
So what distinguishes the two actions above? Why does the stand-alone code work even though it errors out in my script?
EDIT: Per Colonel Beauvel's request, if I do
try:
print np.dot(np.transpose(X), X)
z = np.linalg.inv(np.dot(np.transpose(X), X))
except:
z = "whoops"
print z
it outputs
[[ 0.01328185 0.1092696 ]
[ 0.1092696 0.89895982]]
whoops
but trying this on its own I get
>>> numpy.linalg.inv([[0.01328185, 0.1092696], [0.1092696, 0.89895982]])
array([[ 2.24677775e+08, -2.73098420e+07],
[ -2.73098420e+07, 3.31954382e+06]])
It's a matter of printing precision. The IEEE 754 doubles, that you're most likely using, have about 16 decimal digits of precision and you need to write out 17 to preserve the binary value.
Here's a small example. First create a singlular matrix:
In [1]: import numpy as np
In [2]: np.random.seed(0)
In [3]: a, b, c = np.random.rand(3)
In [4]: d = b*c / a
In [5]: X = np.array([[a, b],[c, d]])
Print and try to invert it:
In [6]: X
Out[6]:
array([[ 0.5488135 , 0.71518937],
[ 0.60276338, 0.78549444]])
In [7]: np.linalg.inv(X)
LinAlgError: Singular matrix
Try to invert the printed matrix:
In [8]: Y = np.array([[ 0.5488135 , 0.71518937],
...: [ 0.60276338, 0.78549444]])
In [9]: np.linalg.inv(Y)
Out[9]:
array([[-85805775.2940297 , 78125795.99532071],
[ 65844615.19517545, -59951242.76033063]])
Succes!
Increase printing precision and try again:
In [10]: np.set_printoptions(precision=17)
In [11]: X
Out[11]:
array([[ 0.54881350392732475, 0.71518936637241948],
[ 0.60276337607164387, 0.78549444195576024]])
In [12]: Z = np.array([[ 0.54881350392732475, 0.71518936637241948],
...: [ 0.60276337607164387, 0.78549444195576024]])
In [13]: np.linalg.inv(Z)
LinAlgError: Singular matrix
I just compute the determinant:
In [130]: m = np.array([[ 0.76400334, 0.22660491],[ 0.22660491,0.06721147]])
In [131]: np.linalg.det(m)
Out[131]: 2.3302017068132921e-09
# which is in fact for a 2D matrix 0.76400334*0.06721147 - 0.22660491*0.22660491
Which is already quit close to 0.
If a matrix m can be inverted, mathematically you can compute the adjoint and divide by the determinant to get the inverted matrix.
Numerically if the determinant is too small, this can entail the kind of error you have ...
I have two one-dimensional numpy matrices:
[[ 0.69 0.41]] and [[ 0.81818182 0.18181818]]
I want to multiply these two to get the result
[[0.883, 0.117]] (the result is normalized)
If I use np.dot I get ValueError: matrices are not aligned
Does anybody have an idea what I am doing wrong?
EDIT
I solved it in a kind of hacky way, but it worked for me, regardless of if there is a better solution or not.
new_matrix = np.matrix([ a[0,0] * b[0,0], a[0,1] * b[0,1] ])
It seems you want to do element-wise math. Numpy arrays do this by default.
In [1]: import numpy as np
In [2]: a = np.matrix([.69,.41])
In [3]: b = np.matrix([ 0.81818182, 0.18181818])
In [4]: np.asarray(a) * np.asarray(b)
Out[4]: array([[ 0.56454546, 0.07454545]])
In [5]: np.matrix(_)
Out[5]: matrix([[ 0.56454546, 0.07454545]])