More efficient way to merge columns in pandas

More efficient way to merge columns in pandas - python

My code calculates the euclidean distance between all points in a set of samples I have. What I want to know is in general this the most efficient way to perform some operation between all elements in a set and then plot them, for instance to make a correlation matrix.
The index of samples is used to initialize the dataframe and provide labels. Then the 3d coordinates are provided as tuples in three_D_coordinate_tuple_list but this could easily be any measurement and then the variable distance could be any operation. I'm curious about finding a more efficient solution to making each column and then merging them again using pandas or numpy. Am I clogging up any memory with my solution? How can I make this cleaner?
def euclidean_distance_matrix_maker(three_D_coordinate_tuple_list, index_of_samples):
#list of tuples
#well_id or index as series or list
n=len(three_D_coordinate_tuple_list)
distance_matrix_df=pd.DataFrame(index_of_samples)
for i in range(0, n):
column=[]
#iterates through all elemetns calculates distance vs this element
for j in range(0, n):
distance=euclidean_dist_threeD_for_tuples( three_D_coordinate_tuple_list[i],
three_D_coordinate_tuple_list[j])
column.append(distance)
#adds euclidean distance to a list which overwrites old data frame then
#is appeneded with concat column wise to output matrix
new_column=pd.DataFrame(column)
distance_matrix_df=pd.concat([distance_matrix_df, new_column], axis=1)
distance_matrix_df=distance_matrix_df.set_index(distance_matrix_df.iloc[:,0])
distance_matrix_df=distance_matrix_df.iloc[:,1:]
distance_matrix_df.columns=distance_matrix_df.index

Setup
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scipy.spatial.distance_matrix
from scipy.spatial import distance_matrix
distance_matrix(x, x)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])
Numpy
from scipy.spatial.distance import squareform
i, j = np.triu_indices(len(x), 1)
((x[i] - x[j]) ** 2).sum(-1) ** .5
array([ 5.19615242, 10.39230485, 5.19615242])
Which we can make into a square form with squareform
squareform(((x[i] - x[j]) ** 2).sum(-1) ** .5)
array([[ 0. , 5.19615242, 10.39230485],
[ 5.19615242, 0. , 5.19615242],
[10.39230485, 5.19615242, 0. ]])

Related

NumPy array with largest value on diagonal and other values shuffled

I am trying to create a square NumPy (or PyTorch, since PyTorch code can be turned into NumPy with minimal effort) matrix which has the following property: given a set of values, the diagonal elements in each row have the largest value and the other values are randomly shuffled for the other positions.
For example, if I have [1, 2, 3, 4], a possible desired output is:
[[4, 3, 1, 2],
[1, 4, 3, 2],
[2, 1, 4, 3],
[2, 3, 1, 4]]
There can be (several) other possible outputs, as long as the diagonal elements are the largest value (4 in this case) and the off-diagonal elements in each row contain the other values but shuffled.
A hacky/inefficient way of doing this could be first creating a square matrix (4x4) of zeros and putting the largest value (4) in all the diagonal positions, and then traversing the matrix row by row, where for each row i, populate the elements except index i with shuffled remaining values (shuffled versions of [1, 2, 3]). This would be very slow as the matrix size increases. Is there a cleaner/faster/Pythonic way of doing it? Thank you.

First you can generate a randomized array on the first axis with np.random.shuffle(), then I've used a (not so easy to understand) mathematical tricks to shift each rows:
import numpy as np
from numpy.fft import fft, ifft
# First create your randomized array with np.random.shuffle()
x = np.array([[1,2,3,4],
[2,4,3,1],
[4,1,2,3],
[2,3,1,4]])
# We use np.where to determine on which column each 4 are.
_,s = np.where(x==4);
# We compute the left shift that need to be applied to each row in order to get each 4 on the diagonal
s = s-np.r_[0:x.shape[0]]
# And here is the tricks, we can use the fast fourrier transform in order to left shift each row by a given value:
L = np.real(ifft(fft(x,axis=1)*np.exp(2*1j*np.pi/x.shape[1]*s[:,None]*np.r_[0:x.shape[1]][None,:]),axis=1).round())
# Noticed that we could also use a right shift, we simply have to negate our exponential exponant:
# np.exp(-2*1j*np.pi...
And we obtain the following matrix:
[[4. 1. 2. 3.]
[2. 4. 1. 3.]
[2. 3. 4. 1.]
[3. 2. 1. 4.]]
No hidden for loop, only pure linear algaebra stuff.
To give you an idea it take only a few milliseconds for a 1000x1000 matrix on my computer and ~20s for a 10000x10000 matrix.

Difference between scipy.linalg.expm versus hand-coded one

I was trying to implement the matrix exponential function as in scipy.linalg.expm. I gained inspiration from kaityo256's github repository. I thus wrote down the following.
from scipy.linalg import expm
from scipy.linalg import eigh
from scipy.linalg import inv
from math import exp as math_exp
from numpy import array, zeros
from numpy.random import random_sample
from numpy.testing import assert_allclose
def diag2sqr(x):
'''Makes an square matrix from a diagonal one.
Takes a 1d matrix. Determines its data type.
Finds out the shape of the 1d matrix.
Makes an empty square matrix with both
dimensions equal to largest (nonzero) dimension of
the 1d matrix. It then fills the elements of the
1d matrix into diagonal slots of the empty
square one.
Parameters
----------
x : ndarray
ndarray of be coverted to a square ndarray
Returns
-------
xsqr : ndarray
ndarray with diagonals sameas those of x
all other elements are zero
dtype same as that of x
'''
x_flat = x.ravel()
xsqr = zeros((x_flat.shape[0], x_flat.shape[0]), dtype=x.dtype)
# Making the empty matrix
for i in range(x_flat.shape[0]):
xsqr[i, i] = x_flat[i]
# filling up the ith element
print('xsqr', xsqr)
return xsqr
def kaityo_expm(x, ):
'''Exponentiates an ndarray (kaityo).
Exponentiates a ndarray in the most naive way.
Parameters
----------
x : ndarray
The ndarray to be exponentiated
Returns
-------
kexpm : ndarray
x after exponentiating
'''
rx, ux = eigh(x)
# Find eigenvalues and eigenvectors
# eigenvectors composed to form a unitary
ux_inv = inv(ux)
# Inverse of the unitary
# tx = diag([array([math_exp(i) for i in rx]).ravel()])
# tx = array([math_exp(i) for i in rx])
tx = diag2sqr(array([math_exp(i) for i in rx]))
# Constructing the diagonal matrix
kexpm1 = tx#ux_inv
kexpm = ux#kexpm1
return kexpm
Afterwards, I tried to test the above code versus scipy.linalg.expm.
x = random_sample((10, 10))
assert_allclose(expm(x), kaityo_expm(x))
This leads to the following output.
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatch: 100%
Max absolute difference: 7.04655733
Max relative difference: 0.59875635
x: array([[18.032424, 16.224408, 12.432163, 16.614248, 12.85653 , 13.705387,
15.096966, 10.577946, 18.399573, 17.938062],
[16.352809, 17.525898, 12.79079 , 16.295562, 13.512996, 14.407979,...
y: array([[18.649103, 13.157682, 11.264763, 16.099163, 15.2293 , 17.854499,
11.691586, 13.412066, 15.023189, 15.598455],
[13.157682, 13.612502, 9.628261, 12.659313, 13.559437, 13.382417,..
Obviously, both the implementations differ.
The questions are as follows:
Is it acceptable for them to differ?
Is my implementation wrong?
If my implementation is wrong, how do I fix it?
If my implementation is correct, when is it safe to use scipy.linalg.expm?
I have seen the following questions:
Matrix exponentiation with scipy: expm, expm2 and expm3

from a mathematical approach the definition of exponential of a matrix is made using the Taylor series of the exponential, so:
let A be a diagonal square matrix:
the problem arise when A is a generic square matrix, so before doing the exponential you will need do diagonalize it using eigenvalue and eigenvectors:
with U the matrix of eigenvectors and Lambda the matrix with the eigenvalues on the diagonal.
at this point we are close to finding what is an exponential of a matrix:
now lets implement this result in a simple script:
>>> import numpy as np
>>> import scipy.linalg as ln
>>> A = [[2/3, -4/3, 2],
[5/6, 4/3, -2],
[5/6, -2/3, 0]]
>>> A = np.matrix(A)
>>> print(A)
[[ 0.66666667 -1.33333333 2. ]
[ 0.83333333 1.33333333 -2. ]
[ 0.83333333 -0.66666667 0. ]]
>>> eigvalue, eigvectors = np.linalg.eig(A)
>>> print("eigvalue: ", eigvalue)
>>> print("eigvectors:")
>>> print(eigvectors)
eigvalue: [ 1. -1. 2.]
eigvectors:
[[ 0.81649658 0.27216553 0.87287156]
[ 0.40824829 -0.68041382 -0.21821789]
[ 0.40824829 -0.68041382 0.43643578]]
>>> e_Lambda = np.eye(np.size(A, 0))*(np.exp(eigvalue))
>>> print(e_Lambda)
[[2.71828183 0. 0. ]
[0. 0.36787944 0. ]
[0. 0. 7.3890561 ]]
>>> e_A = eigvectors*e_Lambda*eigvectors.I
>>> print(e_A)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> e_A2 = ln.expm(A)
>>> print(e_A2)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> np.testing.assert_allclose(e_A, e_A2)
>>> print(e_A - e_A2)
[[-1.77635684e-15 1.77635684e-15 -8.88178420e-16]
[ 4.44089210e-16 -1.77635684e-15 8.88178420e-16]
[-2.22044605e-16 0.00000000e+00 4.44089210e-16]]
we see that the result is basically the same, so i think it's safe to use scipy.linalg.expm for matrix exponentiation.
i created a repo with the notebook for further testing.

Second-order cooccurrence of terms in texts

Basically, I want to reimplement this video.
Given a corpus of documents, I want to find the terms that are most similar to each other.
I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix.
Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M has n rows and 4 columns.
HAVE:
M = [[18,34,54,65], # Term IDs similar to Term t_0
[18,12,54,65], # Term IDs similar to Term t_1
...
[21,43,55,78]] # Term IDs similar to Term t_n.
So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M above, it seems that term t_0 and term t_1 are quite similar, because three out of four terms match, where as terms t_0 and t_nare not similar, because no terms match. Let's write M as a series of lists.
M = [list_0, # Term IDs similar to Term t_0
list_1, # Term IDs similar to Term t_1
...
list_n] # Term IDs similar to Term t_n.
WANT:
C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
[f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
...
[f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]
I'd like to find the matrix C, that has as its elements, a function f applied to the lists of M. f(a,b) measures the degree of similarity between two lists a and b. Going, with the example above, the degree of similarity between t_0 and t_1 should be high, whereas the degree of similarity of t_0 and t_n should be low.
My questions:
What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?
Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Preferably a python package?
Thank you, r0f1

In fact, cosine similarity might not be too bad in this case. The problem is, that you don't want to use the index vectors (i.e. [18,34,54,65] and so on in your case), but you want vectors of length n that are zero everywhere except for the values in your index vector. Luckily, you don't have to create those vectors explicitly, but you can just count how many indices the two index vectors have in common:
def f(u, v):
return len(set(u).intersection(set(v)))
Here, I omitted a constant normalization factor k. There are some more elaborate things that one could do (for example the TF-IDF kernel), but I would stay with this for the start.
In order to run this efficiently using numpy, you would want to do two things:
Convert f to a ufunc, i.e. a numpy vectorized function. You can do that by uf = np.frompyfunc(f, 2, 1) (assuming that you did import numpy as np at some point).
Store M as a 1d array of lists (basically what you state in your second code listing). That's a little more tricky, because numpy is trying to be smart here, but you want something else. So here is how to do that:
n = len(M)
Marray = np.empty(n, dtype='O') # dtype='O' allows you to have elements of type list
for i in range(n):
Marray[i] = M[i]
Now, Marray contains essentially what you described in your second code listing. You can then use the new ufunc's outer method to get your similarity matrix. Here is how all of that would work together for your M from the example (assuming n=3):
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
n = len(M) # i.e. 3
uf = np.frompyfunc(f, 2, 1)
Marray = np.empty(n, dtype='O')
for i in range(n):
Marray[i] = M[i]
similarities = uf.outer(Marray, Marray).astype('d') # convert to float instead object type
print(similarities)
# array([[4., 3., 0.],
# [3., 4., 0.],
# [0., 0., 4.]])
I hope that answers your questions.

You asked two questions, one somewhat open-ended (the first one) and other one that has a definitive answer, so I will start by the second one:
Is there a transformation already available that takes as an input a
matrix like M and produces a matrix like C? Preferably, a python
package?
The answer is yes, there is one package named scipy.spatial.distance that contains a function that takes a matrix like M and produces a matrix like C. The following example is to show the function:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
result = squareform(pdist(M, metric='euclidean'))
print(result)
Output
[[ 0. 22. 16.1245155 ]
[22. 0. 33.76388603]
[16.1245155 33.76388603 0. ]]
As seen from the example above, pdist takes the M matrix and generates an C matrix. Note that the output of pdist is a condensed distance matrix, so you need to convert it to square form using squareform. Now onto the second issue:
What is a good choice for comparing the ordering of two lists? That
is, what is a good choice for function f?
Given that order does matter in your particular case I suggest you look at rank correlation coefficients such as: Kendall or Spearman, both are provided in the scipy.stats package, along with a whole bunch of other coefficients. Usage example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.stats import kendalltau, spearmanr
# distance function
kendall = lambda x, y : kendalltau(x, y)[0]
spearman = lambda x, y : spearmanr(x, y)[0]
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute kendall C and convert to square form
kendall_result = 1 - squareform(pdist(arr, kendall)) # subtract 1 because you want a similarity
print(kendall_result)
print()
# compute spearman C and convert to square form
spearman_result = 1 - squareform(pdist(arr, spearman)) # subtract 1 because you want a similarity
print(spearman_result)
print()
Output
[[1. 0.33333333 0. ]
[0.33333333 1. 0.33333333]
[0. 0.33333333 1. ]]
[[1. 0.2 0. ]
[0.2 1. 0.2]
[0. 0.2 1. ]]
If those do not fit your needs you can take a look at the Hamming distance, for example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute match_rank C and convert to square form
result = 1 - squareform(pdist(arr, 'hamming'))
print(result)
Output
[[1. 0.75 0. ]
[0.75 1. 0. ]
[0. 0. 1. ]]
In the end the choice of the similarity function will depend on your final application, so you will need to try out different functions and see the ones that fit your needs. Both scipy.spatial.distance and scipy.stats provide a plethora of distance and coefficient functions you can try out.
Further
The following paper contains a section on list similarity

I would suggest cosine similarity as every list is an vector.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(list0,list1)

Calculating cosine similarity of columns of a python matrix

I have a numpy matrix say A as below
array([[1, 2, 3],
[1, 2, 2]])
I want to find the cosine similarity matrix of this a matrix where cosine similarity is between the columns.
Now cosine similarity of two vectors is just a dot product of two normalized by the L2 norm product of each
But I don't want to iterate for each column in a loop and do it.
So I first tried this:
from scipy.spatial import distance
cos=distance.cdist(a.T,a.T,'cosine')
Here I am taking transpose as else it would do cosine of rows(observations). I want for columns.
However I am not sure this is the right answer. The doc of this function says it gives 1- cosine_similarity. So should I then do?
cos-1-distance.cdist(a.T,a.T,'cosine')
Please advise.
II)
Also what If I try doing something like this:
cos=(np.dot(a.T,a))/(np.linalg.norm(a, axis=0, keepdims=True))*(np.linalg.norm(a, axis=0, keepdims=True))
It won't work as some problem in getting the right L2 norm of the right column. Any idea how we can implement this without function?

Try this:
a = np.array([[1, 2, 3], [1, 2, 2]])
n = np.linalg.norm(a, axis=0).reshape(1, a.shape[1])
a.T.dot(a) / n.T.dot(n)
array([[ 1. , 1. , 0.98058068],
[ 1. , 1. , 0.98058068],
[ 0.98058068, 0.98058068, 1. ]])
This assignment for n would have also worked.
np.linalg.norm(a, axis=0)[None, :]

Correlate a single time series with a large number of time series

I have a large number (M) of time series, each with N time points, stored in an MxN matrix. Then I also have a separate time series with N time points that I would like to correlate with all the time series in the matrix.
An easy solution is to go through the matrix row by row and run numpy.corrcoef. However, I was wondering if there is a faster or more concise way to do this?

Let's use this correlation formula :
You can implement this for X as the M x N array and Y as the other separate time series array of N elements to be correlated with X. So, assuming X and Y as A and B respectively, a vectorized implementation would look something like this -
import numpy as np
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:,None]
B_mB = B - B.mean()
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum()
# Finally get corr coeff
out = np.dot(A_mA,B_mB.T).ravel()/np.sqrt(ssA*ssB)
# OR out = np.einsum('ij,j->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Verify results -
In [115]: A
Out[115]:
array([[ 0.1001229 , 0.77201334, 0.19108671, 0.83574124],
[ 0.23873773, 0.14254842, 0.1878178 , 0.32542199],
[ 0.62674274, 0.42252403, 0.52145288, 0.75656695],
[ 0.24917321, 0.73416177, 0.40779406, 0.58225605],
[ 0.91376553, 0.37977182, 0.38417424, 0.16035635]])
In [116]: B
Out[116]: array([ 0.18675642, 0.3073746 , 0.32381341, 0.01424491])
In [117]: out
Out[117]: array([-0.39788555, -0.95916359, -0.93824771, 0.02198139, 0.23052277])
In [118]: np.corrcoef(A[0],B), np.corrcoef(A[1],B), np.corrcoef(A[2],B)
Out[118]:
(array([[ 1. , -0.39788555],
[-0.39788555, 1. ]]),
array([[ 1. , -0.95916359],
[-0.95916359, 1. ]]),
array([[ 1. , -0.93824771],
[-0.93824771, 1. ]]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

More efficient way to merge columns in pandas - python

Related

NumPy array with largest value on diagonal and other values shuffled

Difference between scipy.linalg.expm versus hand-coded one

Second-order cooccurrence of terms in texts

Calculating cosine similarity of columns of a python matrix

Correlate a single time series with a large number of time series

Categories

Resources