Related
I am trying to understand the difference between np.zeros((1, n)) and np.zeros(n)
row_vector = np.zeros((1, n))
vector = np.zeros(n)
print('Shape of row_vector: {0}'.format(row_vector.shape))
print('Shape of vector: {0}'.format(vector.shape))
Output is without one extra bracket
Contents of row_vector:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Contents of vector - Note the number of brackets compared to row_vector:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Secondly, if I have to add them, how to do that?
It simply boils down to the number of dimensions of the array or tensor. The row_vector is a 2 dimensional array, while vector is a 1 dimensional array.
You can easily verify this by calling the ndim attribute.
e.g.
>>> row_vector.ndim()
2
>>> vector.ndim()
1
This additional dimension is very useful when working with tensor focused libraries such as TensorFlow and PyTorch.
The difference is that one is a 1D array, the other 2D. You can see this by printing the x.ndim attribute or len(x.shape).
That being said, for many practical purposes there is no difference. Broadcasting aligns dimensions on the right edge, so you can add the two arrays directly using the +operator:
s = vector + row_vector
The result will be 2D, with shape (1, n).
There would be a big difference if the shapes were (n,) and (n, 1). Broadcasting would again align the shapes on the right, which would make the sum an outer sum of shape (n, n).
You can perform an outer sum on the existing arrays by accessing a view with a unit axis using np.newaxis:
vector[:, np.newaxis] + row_vector
Or
vector + row_vector[:, np.newaxis]
I have a 2D array of zeros with some positive integers at (1,6) and (2,7):
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
And I want to filter the array by a custom kernel:
[[1 0 1]
[0 1 0]
[0 1 0]]
I want to filter the array with this kernel and when 2 or 3 of the ones in this kernel are multiplied by a positive integer, I want it to return the co-ordinates of the ones that were multiplied by 0.
I know from image analysis that it's easy to convolve a 2D array by a kernel but it doesn't yield the intermediate results. On the above 2D array, it would return (1,8) and (3,7).
Is there some package functions that I can use to make this process simple and easy, or will I have to implement it myself?
As always, all help is appreciated
This is a numpy implementation of it to start with. You can increase performance probably by modifying it.
Here, num_ones is the lower and upper number of ones in the kernel you would like to filter, referring to when 2 or 3 of the ones in this kernel are multiplied by a positive integer
a = np.array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,2.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,2.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])
kernel = np.array([[1.,0.,1.],\
[0.,1.,0.],\
[0.,1.,0.]])
sub_shape = kernel.shape
#throshold of number of kernel ones to have non-zero value
num_ones = [2,3]
#divide the matrix into sub_matrices of kernel size
view_shape = tuple(np.subtract(a.shape, sub_shape) + 1) + sub_shape
strides = a.strides + a.strides
sub_matrices = np.lib.stride_tricks.as_strided(a,view_shape,strides)
#convert non_zero elements to 1 (dummy representation)
sub_matrices[sub_matrices>0.] = 1.
#Do convolution
m = np.einsum('ij,klij->kl',kernel,sub_matrices)
#find sub_matrices that satisfy non-zero elements' condition
filt = np.argwhere(np.logical_and(m>=num_ones[0], m<=num_ones[1]))
#for each sub_matix find the zero elements located in non-zero elements of kernel
output = []
for [i,j] in filt:
output.append(np.argwhere((sub_matrices[i,j,:,:]==0)*kernel) + [i, j])
output is an array of indices arrays where each array is indices where your condition is met per kernel application in each location [i,j] of your image. If you wish to aggregate them all, you can stack all arrays and take a unique list of it. I am not sure how you would like the output be in case of multiple occurrences.
output:
output =
[[1 8]
[3 7]]
UPDATE: regarding einsum:
I would recommend this post about einsum to learn: Understanding NumPy's einsum
sub_matrices is a 4-dimensional array. sub_matrices[k,l,:,:] is sub matrix of a starting at position [k,l] and shape of kernel. (later we changed all non-zero values of it to 1 for our purpose)
m = np.einsum('ij,klij->kl',kernel,sub_matrices) multiplies two dimensions i and j of kernel into last two dimensions i and j of sub_matrices array (in other words, it element-wise multiplies kernel to sub matrices sub_matrices[k,l,:,:]) and sums all elements into m[k,l]. This is known as 2D convolution of kernel into a.
I am trying to implement a custom distance metric for clustering. The code snippet looks like:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
The vectorized_text is a one-hot encoded feature matrix of size n_sample x n_features. But when custom_metric is being called, one of x or y turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x and y should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.
Example of x and y in distance(x, y) method:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Both should have been one-hot vectors.
Does anyone have an idea to go about this situation?
First of all, your distance is wrong.
Distances must return small values for similar vectors. You have defined a similarity, not a distance.
Secondly, using naive python code such as zip will perform extremely poor. Python just does not optimize such code well, it will do all the work in the slow interpreter. Python speed is only okay if you vectorize everything. And in fact, this code can be vectorised trivially, and then it likely won't even matter whether your inputs are binary or float data. What you are computing in a very complicated fashion is nothing but the dot product of two vectors, isn't it?
This, your distance should probably look like this:
def distance(x, y):
return x.shape[0] - np.dot(x,y)
Or whatever distance transformation you intend to use.
Now for your actual problem: my guess is that sklearn tries to accelerate your distance with a ball tree. That won't help much because of the poor performance of Python interpreter callbacks (in fact, you should probably precompute the entire distance matrix in one vectorised operation - something like dist = dim - X.transpose().dot(X)? Do the math yourself to figure out the equation). Other languages such as Java (e.g., the ELKI tool) are much better to extend this way, because of the way the hotspot JIT compiler can optimize and inline such calls everywhere.
To test the hypothesis that the sklearn ball-tree is the cause for the odd values you are observing, try setting method="brute" or so (see the documentation) to disable the ball tree. But in the end, you'll want to either precompute the entire distance matrix (if you can afford O(n²) cost), or switch to a different programming language (implementing your distance in Cython for example helps, but you'll still likely see the data being numpy float arrays suddenly).
I don't get your question, if I have:
x = [1, 0, 1]
y = [0, 0, 1]
and I use:
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
print(distance(x, y))
1.0
and on top if you print x, y now:
x
[1, 0, 1]
y
[0, 0, 1]
so it is working?
I reproduced your code and I do get your error. I explain it better here:
He has a vectorized_text variable (np.stack) which simulates a One Hot Encoded feature set (only contains 0s and 1s). And in the DBSCAN model, he uses a custom_metric function to calculate the distance. It is expected that when the model is run, the custom metric function takes as parameters pairs of observations as they are: One Hot encoded values, but instead when printing those values inside the distance function, only one is taken as it is, and the other one appears to be a list of real values as he described in the question:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Anyway, when I pass lists to the fit parameter, the function obtains the values as they are:
from sklearn.cluster import KMeans, DBSCAN, MeanShift
x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
# Printing here the values. Should be 0s and 1s
print(x, y)
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`
Result:
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
I suggest you to use a pandas DataFrame or some other type of value and see if it works.
I would like to calculate the gradients of the output of a neural network with respect to the input. I have the following tensors:
Input: (num_timesteps, features)
Output: (num_timesteps, 1)
For the gradients from the inputs to the entire output vector I can use the following:
tf.gradients(Output, Input)
Since I would like to compute the gradients for every single timesample I would like to calculate
tf.gradients(Output[i], Input)
for every i.
What is the best way to do that?
First up, I suppose you mean the gradient of Output with respect to the Input.
Now, the result of both of these calls:
dO = tf.gradients(Output, Input)
dO_i = tf.gradients(Output[i], Input) (for any valid i)
will be a list with a single element - a tensor with the same shape as Input, namely a [num_timesteps, features] matrix. Also, if you sum all matrices dO_i (over all valid i) is exactly the matrix dO.
With this in mind, back to your question. In many cases, individual rows from the Input are independent, meaning that Output[i] is calculated only from Input[i] and doesn't know other inputs (typical case: batch processing without batchnorm). If that is your case, then dO is going to give you all individual components dO_i at once.
This is because each dO_i matrix is going to look like this:
[[ 0. 0. 0.]
[ 0. 0. 0.]
...
[ 0. 0. 0.]
[ xxx xxx xxx] <- i-th row
[ 0. 0. 0.]
...
[ 0. 0. 0.]]
All rows are going to be 0, except for the i-th one. So just by computing one matrix dO, you can easily get every dO_i. This is very efficient.
However, if that's not your case and all Output[i] depend on all inputs, there's no way to extract individual dO_i just from their sum. You have no other choice other than calculate each gradient separately: just iterate over i and execute tf.gradients.
I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!
Here is some code:
SimMatrix = [[ 0.,0.09259259, 0.125 , 0. , 0.08571429],
[ 0.09259259, 0. , 0.05555556, 0. , 0.05128205],
[ 0.125 , 0.05555556, 0. , 0.03571429, 0.05882353],
[ 0. , 0. , 0.03571429, 0. , 0. ],
[ 0.08571429, 0.05128205, 0.05882353, 0. , 0. ]]
linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro = hcluster.dendrogram(linkage) #same plot for all types?
show()
If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!
Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix).
You can see this in the code below:
import scipy.spatial.distance as ssd
distVec = ssd.squareform(SimMatrix)
linkage = hcluster.linkage(1 - distVec)
dendro = hcluster.dendrogram(linkage)
show()