Simple implementation of NumPy cov (covariance) function

Simple implementation of NumPy cov (covariance) function - python

I was trying to implement the numpy.cov() function as given here: numpy cov (covariance) function, what exactly does it compute?, but I am getting some bizarre results. Please correct me:
import numpy as np
def my_covar(X):
X -= X.mean(axis=0)
N = X.shape[1]
return np.dot(X, X.T.conj())/float(N-1)
X = np.asarray([[1.0,1.0],[2.0,2.0],[3.0,3.0]])
## Run NumPy's implementation
print np.cov(X)
"""
NumPy's output:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
"""
## Run my implementation
print my_covar(X)
"""
My output:
[[ 2. 0. -2.]
[ 0. 0. 0.]
[ -2. 0. 2.]]
"""
What is going wrong?

Both your function and np.cov (by default) assume that the rows of X correspond to variables, and the columns correspond to observations.
When you center X by subtracting the mean, you need to compute the mean over observations, i.e. the columns of X rather than the rows:
X -= X.mean(axis=1)[:, None]

Related

How to create a 4 dimensional numpy array from a 3 dimensional numpy array filling with zeros?

The problem
Create a higher dimensional NumPy array with zeros on the new dimensions
Details
Analyzing the last dimension, the result is similar to this:
(not an actual code, just a didactic example)
a.shape = (100,2,10)
a[0,0,0]=1
a[0,0,1]=2
...
a[0,0,9]=10
b.shape = (100,2,10,10)
b[0,0,0,:]=[0,0,0,0,0,0,0,0,0,1]
b[0,0,1,:]=[0,0,0,0,0,0,0,0,2,1]
b[0,0,2,:]=[0,0,0,0,0,0,0,3,2,1]
...
b[0,0,2,:]=[10,9,8,7,6,5,4,3,2,1]
a -> b
The objective is to transform from a into b. The problem is that is not only filled with zeros but has a sequential composition with the original array.
Simpler problem for better understanding
Another way to visualize is using lower-dimensional arrays:
We have this:
a = [1,2]
And I want this:
b = [[0,1],[2,1]]
Using NumPy array and avoiding long for loops.
2d to 3d case
We have this:
a = [[1,2,3],[4,5,6],[7,8,9]]
And I want this:
b[0] = [[0,0,1],[0,2,1],[3,2,1]]
b[1] = [[0,0,4],[0,5,4],[6,5,4]]
b[2] = [[0,0,7],[0,8,7],[9,8,7]]
I feel that for the 4-dimensional problem only one for loop with 10 iterations is enough.

Try something like this in the framework of numpy:
import numpy as np
# create transformation tensors
N = a.shape[-1]
sigma1 = np.zeros((N,N,N))
sigma2 = np.zeros((N,N,N))
E = np.ones((N,N))
for i in range(N):
sigma1[...,i] = np.diag(np.diag(E,N-1-i),N-1-i)
sigma2[N-1-i,N-1-i:,i] = 1
b1 = np.tensordot(a, sigma1, axes=([-1],[0]))
b2 = np.tensordot(a, sigma2, axes=([-1],[0]))
where sigma1, sigma2 are the transformation tensors for which you can transform the data associated with the last dimension of a as you want (the two versions you mentioned in your question and comments). Here the loop is only used to create the transformation tensor.
For a = [[1,2,3],[1,2,3],[1,2,3]], the first algorithm gives:
[[[0. 0. 1.] [0. 1. 2.] [1. 2. 3.]]
[[0. 0. 4.] [0. 4. 5.] [4. 5. 6.]]
[[0. 0. 7.] [0. 7. 8.] [7. 8. 9.]]]
and the last algorithm gives:
[[[0. 0. 1.] [0. 2. 1.] [3. 2. 1.]] [[0. 0. 4.] [0. 5. 4.] [6. 5. 4.]] [[0. 0. 7.] [0. 8. 7.] [9. 8. 7.]]]
Try to avoid lists and loops when using numpy as they slow down the execution speed.

I was able to solve the problem but there is probably a more efficient way to do it:
a = np.array([[1,2,3],[4,5,6],[7,8,9]]) #two dim case
a = np.array([[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[7,8,9]]])# three dim case
def increase_dim(arr):
stack_list = []
stack_list.append(arr)
for i in range(1,arr.shape[-1]):
stack_list.append(np.delete(np.delete(np.append(np.zeros(arr.shape),arr,axis=-1),np.s_[-i:],axis = len(arr.shape)-1),np.s_[:arr.shape[-1]-i],axis = -1))
return np.stack(stack_list,axis = -1)
b = increase_dim(b)
I hope this can help the question understanding.

Listing each iteration of rolling a six-sided-die in Python

I'm working on an animated bar plot to show how the number frequencies of rolling a six-sided die converge the more you roll the die. I'd like to show the number frequencies after each iteration, and for that I have to get a list of the number frequencies for that iteration in another list. Here's the code so far:
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs[x] += 1
print(freqs)
frequencies.append(freqs)
print()
for x in frequencies:
print(x)
Output:
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
Desired output:
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
The upper three lists indeed show the number frequencies after each iteration. However, when I try to append the list to the 'frequencies' list, in the end it just shows the final number frequencies each time as can be seen in the lower three lists. This one's got me stumped, and I am rather new to Python. How would one get each list like in the first three lists of the output, in another? Thanks in advance!

You can do it like that by changing only frequencies.append(freqs) with frequencies.append(freqs.copy()). Like that, you can make a copy of freqs that would be independent of original freqs. A change in freqs won't change freqs.copy().
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs[x] += 1
print(freqs)
frequencies.append(freqs.copy())
print(frequencies)
print()
for x in frequencies:
print(x)

Python is keeping track of freqs as single identity, and its value gets changed even after it gets appended. There is a good explanation for this beyond my comprehension =P
However, here is quick and dirty work around:
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs_copy = []
for item in freqs:
freqs_copy.append(item)
freqs_copy[x] += 1
print(freqs_copy)
frequencies.append(freqs_copy)
print()
for x in frequencies:
print(x)
The idea is to make a copy of "freqs" that would be independent of original "freqs". In the code above "freqs_copy" would be unique to each iteration.

How to store a distance matrix more efficiently?

I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")

There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)

You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])

Numpy historgramm 2d vector

I have a list of x y like the picture above
in code it works like this:
np.array([[1.3,2.1],[1.5,2.2],[3.1,4.8]])
now I would like to set a grid of which I can set the start, the number of columns and rows as well as the row and columns size, and then count the number of points in each cell.
in this example [0,0] has 1 point in it, [1,0] has 1, [2,0] has 3, [0,1] has 4 and so on.
I know it would probably be trivial to do by hand, even without numpy, but I need to create it as fast as possible, since I will have to process a ton of data this way.
whats a good way to do this? Basicly create a 2D Histogramm of points? And more importantly, how can I do it as fast as possible?

I think numpy.histogram2d is the best option.
a = np.array([[1.3,2.1],[1.5,2.2],[3.1,4.8]])
H, _, _ = np.histogram2d(a[:, 0], a[:, 1], bins=(range(6), range(6)))
print(H)
# [[0. 0. 0. 0. 0.]
# [0. 0. 2. 0. 0.]
# [0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 1.]
# [0. 0. 0. 0. 0.]]

Evaluating a function over a lattice of unknown dimension using meshgrid and vectorize

When you know the number of dimensions of your lattice ahead of time, it is straight-forward to use meshgrid to evaluate a function over a mesh.
from pylab import *
lattice_points = linspace(0,3,4)
xs,ys = meshgrid(lattice_points,lattice_points)
zs = xs+ys # <- stand-in function, to be replaced by something more interesting
print(zs)
Produces
[[ 0. 1. 2. 3.]
[ 1. 2. 3. 4.]
[ 2. 3. 4. 5.]
[ 3. 4. 5. 6.]]
But I would like to have a version of something similar, for which the number of dimensions is determined during runtime, or is passed as a parameter.
from pylab import *
#np.vectorize
def fn(listOfVars) :
return sum(listOfVars) # <- stand-in function, to be replaced
# by something more interesting
n_vars = 2
lattice_points = linspace(0,3,4)
indices = meshgrid(*(n_vars*[lattice_points])) # this works fine
zs = fn(indices) # <-- this line is wrong, but I don't
# know what would work instead
print(zs)
Produces
[[[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]
[ 0. 1. 2. 3.]]
[[ 0. 0. 0. 0.]
[ 1. 1. 1. 1.]
[ 2. 2. 2. 2.]
[ 3. 3. 3. 3.]]]
But I want it to produce the same result as above.
There is probably a solution where you can find the indices of each dimension and use itertools.product to generate all of the possible combinations of indices etc. etc., but is there not a nice pythonic way of doing this?

Joe Kington and user2357112 have helped me to see the error in my ways. For those of you that would like to see a complete solution:
from pylab import *
## 2D "preknown case" (for testing / to compare output)
lattice_points = linspace(0,3,4)
xs,ys = meshgrid(lattice_points,lattice_points)
zs = xs+ys
print('2-D Case')
print(zs)
## 3D "preknown case" (for testing / to compare output)
lattice_points = linspace(0,3,4)
ws,xs,ys = meshgrid(lattice_points,lattice_points,lattice_points)
zs = ws+xs+ys
print('3-D Case')
print(zs)
## Solution, thanks to comments from Joe Kington and user2357112
def fn(listOfVars) :
return sum(listOfVars)
n_vars = 3 ## can change to 2 or 3 to compare to example cases above
lattice_points = linspace(0,3,4)
indices = meshgrid(*(n_vars*[lattice_points]))
zs = np.apply_along_axis(fn,0,indices)
print('adaptable n-D Case')
print(zs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple implementation of NumPy cov (covariance) function - python

Related

How to create a 4 dimensional numpy array from a 3 dimensional numpy array filling with zeros?

Listing each iteration of rolling a six-sided-die in Python

How to store a distance matrix more efficiently?

Numpy historgramm 2d vector

Evaluating a function over a lattice of unknown dimension using meshgrid and vectorize

Categories

Resources