Eigenvector values using Numpy - python

I have a matrix of ternary values (2 observations, 11 variables) for which I calculate the eigenvectors using np.linalg.eig() from Numpy. The matrix is (0 values are not used for this example):
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
1 1 1 1 1 1 1 1 1 -1 -1
1 1 1 1 1 1 1 1 1 -1 -1
Result of the eigenvector from largest eigenvalue:
[ 0.33333333 0. 0.33333333 0. 0.33333333 0.33333333
0.33333333 0.33333333 0.33333333 0.33333333 0.33333333]
I am not sure about the order of these coefficients. Are they following the order of the variables expressed in the matrix (i.e. first 0.33333333 is weight coefficient of v1, 0.0 is weight coefficient of v2, etc...)?
Last part of my code is:
# Matrix with rounded values
Mtx = np.matrix.round(Mtx,3)
# Cross product of Mtx
Mtx_CrossProduct = (Mtx.T).dot(Mtx)
# Calculation of eigenvectors
eigen_Value, eigen_Vector = np.linalg.eig(Mtx_CrossProduct)
eigen_Vector = np.absolute(eigen_Vector)
# Listing (eigenvalue, eigenvector) and sorting of eigenvalues to get PC1
eig_pairs = [(np.absolute(eigen_Value[i]), eigen_Vector[i,:]) for i in range(len(eigen_Value))]
eig_pairs.sort(key=lambda tup: tup[0],reverse=True)
# Getting largest eigenvector
eig_Vector_Main = np.zeros((11,))
for i in range(len(eig_pairs)):
eig_Vector_Main[i] = eig_pairs[i][1][0]

The dimensions of each vector are the same as the dimensions of your original matrix (i.e. they follow the order as you say).
I've not figured out exactly what you're doing with your lambda and 'standard' python list but you can probably do the same thing more elegantly and quickly by sticking to numpy i.e.
eigen_Value, eigen_Vector = np.linalg.eig(Mtx_CrossProduct)
eigen_Vector = np.absolute(eigen_Vector)
ix = np.argsort(eigen_Value)[::-1] # reverse sorted index
eig_Vector_Main = eigen_Vector[ix]

Related

Manipulating a distance matrix for intersection over time intervals

I've created distance matrices for time steps at every 0.1 seconds for intervals of 60 seconds. The matrices look so for each time step with distance values populating them:
time = 0.1
a1 b2 c3 d4
a1 0 5.4 9.1 10.1
b2 5.4 0 5.0 3.2
c3 9.1 5.0 0 6.6
d4 10.1 3.2 6.6 0
time = 0.2
a1 b2 c3 d4
a1 0 2.4 9.1 12.1
b2 2.4 0 6.7 3.6
c3 9.1 6.7 0 9.6
d4 12.1 3.6 9.6 0
The goal is to generate an adjacency matrix, or neighbor list at the end of each 60 second interval (examining 600 dataframes) for neighbors that maintain a distance threshold the entire minute (in each distance matrix examined).
For example, if the distance limit is d=10, then for this 0.2 second sample it would return a list of [a1, b2, c3] since for that 0.2 second interval, they all maintained a distance less than 10.
I was wondering if there is a semi-efficient or clever way to do this with pandas and python.
stack your dataframes along a 3rd dimension, then apply your threshold to get boolean values, then use numpy.logical_and.reduce to apply the "and" along your third dimension.
eg if dfs is a list of your dataframes then do
threshold = 10
stacked = np.stack(dfs, axis=2)
result = np.logical_and.reduce(stacked < threshold, axis=2)
You can then put result inside a dataframe with index and column names if you wish.
IIUC, since you have a symmetric matrix, you can use numpy to create boolean masks and filter the indices (or columns) with it. Since it's symmetric, it suffices to analyze either the upper triangle or the lower triangle (I chose lower triangle). Then among the numbers in the lower triangle, build a mask that returns False for the rows that contain a value greater than d.
import numpy as np
def get_neighbor_indices(df, d):
less_than_d = np.tril(df.lt(d).to_numpy())
upper_triangle_dummy = np.triu(np.ones(df.shape)==1)
msk = (less_than_d | upper_triangle_dummy).all(axis=1)
return df.index[msk].tolist()
>>> get_neighbor_indices(df1, 10)
['a1', 'b2', 'c3']
>>> get_neighbor_indices(df2, 10)
['a1', 'b2', 'c3']

calculate cosine similarity for all columns in a group by in a dataframe

I have a dataframe df: where APer columns range from 0-60
ID FID APerc0 ... APerc60
0 X 0.2 ... 0.5
1 Z 0.1 ... 0.3
2 Y 0.4 ... 0.9
3 X 0.2 ... 0.3
4 Z 0.9 ... 0.1
5 Z 0.1 ... 0.2
6 Y 0.8 ... 0.3
7 W 0.5 ... 0.4
8 X 0.6 ... 0.3
I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:
ID CosSim
1 0,2,4 0.997
2 1,8,7 0.514
1 3,5,6 0.925
I know how to generate cosine similarity for the whole df:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?
One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.
Here, combinations is basically the list pair of row index which you want to consider for computation.
cos = nn.CosineSimilarity(dim=0)
for i in range(len(combinations)):
row1 = df.loc[combinations[i][0], 2:62]
row2 = df.loc[combinations[i][1], 2:62]
sim = cos(row1, row2)
print(sim)
The result you can use in the way you want.
create a function for calculation, then df.apply(cosine_similarity_function()), one said that using apply function may perform hundreds times faster than row by row.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

The inverse of some matrices are different between Python and Excel. Which results I should consider?

I tested two 3x3 matrix to know the inverse in Python and Excel, but the results are different. Which I should consider as the correct or best result?
These are the matrix I tested:
Matrix 1:
1 0 0
1 2 0
1 2 3
Matrix 2:
1 0 0
4 5 0
7 8 9
The Matrix 1 inverse is the same in Python and Excel, but Matrix 2 inverse is different.
In Excel I use the MINVERSE(matrix) function, and in Python np.linalg.inv(matrix) (from Numpy library)
I can't post images yet, so I can't show the results from Excel :c
This is the code I use in Python:
# Matrix 1
A = np.array([[1,0,0],
[1,2,0],
[1,2,3]])
Ainv = np.linalg.inv(A)
print(Ainv)
Result:
[[ 1. 0. 0. ]
[-0.5 0.5 0. ]
[ 0. -0.33333333 0.33333333]]
# (This is the same in Excel)
# Matrix 2
B = np.array([[1,0,0],
[4,5,0],
[7,8,9]])
Binv = np.linalg.inv(B)
print(Binv)
Result:
[[ 1.00000000e+00 0.00000000e+00 -6.16790569e-18]
[-8.00000000e-01 2.00000000e-01 1.23358114e-17]
[-6.66666667e-02 -1.77777778e-01 1.11111111e-01]]
# (This is different in Excel)

how to group data in bins of 1 degree latitude?

I have some geographical data (global) as arrays:
latitude: lats = ([34.5,34.2,67.8,-24,...])
wind speed: u = ([2.2,2.5,6,-3,-0.5,...])
I would like to have a statement how the wind speed depends on latitude. Therefore I would like to bin the data in latitude bins of 1 degree.
latbins = np.linspace(lats.min(),lat.(max),180)
How can I calculate which wind speeds are falling in which bin. I read about pandas.groupby. Is that an option?
The numpy function np.digitize does this task.
Here one example that categorises each value in a bin:
import numpy as np
import math
# Generate random lats
lats = np.arange(0,10) - 0.5
print("{:20s}: {}".format("Lats", lats))
# Lats : [-0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5]
# Generate bins spaced by 1 from the minus to max values of lats
bins = np.arange(math.floor(lats.min()), math.ceil(lats.max()) +1, 1)
print("{:20s}: {}".format("Bins", bins))
# Bins : [-1 0 1 2 3 4 5 6 7 8 9]
lats_bins = np.digitize(lats, bins)
print("{:20s}: {}".format("Lats in bins", lats_bins))
# Lats in bins : [ 1 2 3 4 5 6 7 8 9 10]
As suggested by #High Performance Mark in the comments, since you want to split in bins with 1 degree, you can use the floor to extract the floor of each lats (note: this method introduces negative index bins if there are negative values):
lats_bins_floor = np.floor(lats)
# lats_bins_floor = lats_bins_floor + abs(min(lats_bins_floor))
print("{:20s}: {}".format("Lats in bins (floor)", lats_bins_floor))
# Lats in bins (floor): [-1. 0. 1. 2. 3. 4. 5. 6. 7. 8.]

Efficient, large-scale competition scoring in Python

Consider a large dataframe of scores S containing entries like the following. Each row represents a contest between a subset of the participants A, B, C and D.
A B C D
0.1 0.3 0.8 1
1 0.2 NaN NaN
0.7 NaN 2 0.5
NaN 4 0.6 0.8
The way to read the matrix above is: looking at the first row, the participant A scored 0.1 in that round, B scored 0.3, and so forth.
I need to build a triangular matrix C where C[X,Y] stores how much better participant X was than participant Y. More specifically, C[X,Y] would hold the mean % difference in score between X and Y.
From the example above:
C[A,B] = 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) = 33%
My matrix S is huge, so I am hoping to take advantage of JIT (Numba?) or built-in methods in numpy or pandas. I certainly want to avoid having a nested loop, since S has millions of rows.
Does an efficient algorithm for the above have a name?
Let's look at a NumPy based solution and thus let's assume that the input data is in an array named a. Now, the number of pairwise combinations for 4 such variables would be 4*3/2 = 6. We can generate the IDs corresponding to such combinations with np.triu_indices(). Then, we index into the columns of a with those indices. We perform the subtractions and divisions and simply add the columns ignoring the NaN affected results with np.nansum() for the desired output.
Thus, we would have an implementation like so -
R,C = np.triu_indices(a.shape[1],1)
out = 100*np.nansum((a[:,R] - a[:,C])/a[:,C],0)
Sample run -
In [121]: a
Out[121]:
array([[ 0.1, 0.3, 0.8, 1. ],
[ 1. , 0.2, nan, nan],
[ 0.7, nan, 2. , 0.5],
[ nan, 4. , 0.6, 0.8]])
In [122]: out
Out[122]:
array([ 333.33333333, -152.5 , -50. , 504.16666667,
330. , 255. ])
In [123]: 100 * ((0.1 - 0.3)/0.3 + (1 - 0.2)/0.2) # Sample's first o/p elem
Out[123]: 333.33333333333337
If you need the output as (4,4) array, we can use Scipy's squareform -
In [124]: from scipy.spatial.distance import squareform
In [125]: out2D = squareform(out)
Let's convert to a pandas dataframe for a good visual feedback -
In [126]: pd.DataFrame(out2D,index=list('ABCD'),columns=list('ABCD'))
Out[126]:
A B C D
A 0.000000 333.333333 -152.500000 -50
B 333.333333 0.000000 504.166667 330
C -152.500000 504.166667 0.000000 255
D -50.000000 330.000000 255.000000 0
Let's compute [B,C] manually and check back -
In [127]: 100 * ((0.3 - 0.8)/0.8 + (4 - 0.6)/0.6)
Out[127]: 504.1666666666667

Categories

Resources