Python bin sets of pairs of interleaving arrays - python

I have a set of pairs of numpy arrays. Each array in a pair is the same length, but arrays in different pairs have different lengths. An example of a pair of arrays from this set is:
Time: [5,8,12,17,100,121,136,156,200]
Score: [3,4,5,-10,-90,-80,-70,-40,10]
Another pair is:
Time: [6,7,9,15,199]
Score: [5,6,7,-11,-130]
I need to take an average (or perform binning) of all of these pairs based on the time. i.e. the time should be divided into intervals of 10 and the corresponding score(s) for each interval need to be averaged.
Thus, for the above 2 pairs, I want the following result:
Time: [1-10,11-20,21-30,31-40,41-50,...,191-200]
Score: [(3+4+5+6+7)/5, (5-10-11)/2, ...]
How can I do this? Is there a simpler way to do this than bin everything individually and then take the average? How do you bin an array based on the bins of another array? i.e. for an individual pair of arrays, how can I bin the time array into intervals of 10 and then use this result to bin the corresponding score array in a consistent manner?

You can use scipy.stats.binned_statistic. This is a generalization of a histogram function. A histogram divides the space into bins, and returns the count of the number of points in each bin. This function allows the computation of the sum, mean, median, or other statistic of the values (or set of values) within each bin.
from scipy import stats
import numpy as np
T1 = [5,8,12,17,100,121,136,156,200]
S1 = [3,4,5,-10,-90,-80,-70,-40,10]
T2 = [6,7,9,15,199]
S2 = [5,6,7,-11,-130]
# Merging all Times and Scores in order
Time = T1 + T2
Score = S1 + S2
output = stats.binned_statistic(Time, Score, statistic='mean',range=(0,200), bins=20)
averages = output[0]
# For empty bins, it generates NaN, we can replace them with 0
print( np.nan_to_num(averages, 0) )
# Output of this code:
# [ 5. -5.33333333 0. 0. 0.
# 0. 0. 0. 0. 0.
# -90. 0. -80. -70. 0.
# -40. 0. 0. 0. -60. ]
For more information follow this link.

Related

Cross-Correlation for specific lags in python

I am currently using the correlate function from scipy.signal.
corrs = signal.correlate(x,y, mode="full")
where x and y are time-series arrays with the same length (N). This function returns all the cross-correlation lags between the 2 arrays x and y (returns an array of length 2*N - 1 containing all the lags for the cross-correlation).
Is there any other efficient way to compute the cross-correlations between x and y but only for certain lags, i.e. a function that only returns the first 10 lags?
Example:
x = [ 0. -0.11386133 0.04422655 0.03187104]
y = [ 0. 0.0970805 -0.02822892 -0.0661678 ]
corrs = [ 0. 0.00753395 0.00028781 -0.01441102 0.00339385 0.00309406 0. ]
and I want a subset of corrs: for example, I want corrs[2:5].
Thanks in advance!
I tried implementing the cross correlation as defined in the scipy docs but is way slower.

Different results using affinity propagation "precomputed" distance matrix

I am working with two-dimensional data
X = array([[5.40310335, 0. ],
[6.86136114, 6.56225717],
[0. , 0. ],
...,
[5.88838732, 0. ],
[6.0003473 , 0. ],
[6.25971331, 0. ]])
looking for clusters, using euclidean distance, i run affinity propagation from scikit learn with this raw data as follows
af = AffinityPropagation(damping=.9, max_iter=300, random_state=0).fit(X)
obtaining as a result 9 clusters.
I understand that when you want to use another distance you have to enter the negative distance matrix, and use affintity = 'precomputed' as it follows
af_c = AffinityPropagation(damping=.9, max_iter=300,
affinity='precomputed', random_state=0).fit(distM)
if as distM I use the Euclidean distance matrix calculated as follows
distM_E = -distance_matrix(X,X)
np.fill_diagonal(distM, np.median(distM))
completing the diagonal with the median since it is a predefined preference value also in the method.
Using this I am getting 34 clusters as a result and I would expect to have 9 as if working with the default distance. I don't know if I'm interpreting the way of entering the distance matrix correctly or if the library does something different when one uses everything predefined.
I would appreciate any help.

How to plot a weighted graph of k-Neighbors in Python

I have created a weighted graph of k-Neighbors using scikit-learn, I'm wondering if there is any way to plot it as a graph.
Here is the result of computation in form of array which I want to plot:
array([[0. , 2.08243189, 0. , 3.42661108],
[2.08243189, 0. , 3.27141008, 0. ],
[0. , 3.27141008, 0. , 1.57294787],
[0. , 3.29779083, 1.57294787, 0. ]])
I just need to get some visualization of data, that's all I need.
More details about the array:
Each row represents a node and each column represents the weight of connectivity of that node with the other nodes.
For example: second column of first row (2.08243189) is the weight of connectivity from first node to second node.
Another example: second row, second column (0): the weight of connectivity from node 2 to itself.
The numbers represents euclidean distance.
Are you talking about something simple like this where the size of the point gives a visual indication of the relative weight compared to the other values? Assume the array is named ar:
for i in range(len(ar)):
for j in range(len(ar)):
v = ar[i,j]
plt.scatter(i+1,j+1,lw=0,s=10**v)
plt.grid(True)
plt.xlabel('Row')
plt.ylabel('Column')
ticks = list(range(1,1+len(ar)))
plt.xticks(ticks)
plt.yticks(ticks)

Tensorflow: Gradient Calculation from Input to Output

I would like to calculate the gradients of the output of a neural network with respect to the input. I have the following tensors:
Input: (num_timesteps, features)
Output: (num_timesteps, 1)
For the gradients from the inputs to the entire output vector I can use the following:
tf.gradients(Output, Input)
Since I would like to compute the gradients for every single timesample I would like to calculate
tf.gradients(Output[i], Input)
for every i.
What is the best way to do that?
First up, I suppose you mean the gradient of Output with respect to the Input.
Now, the result of both of these calls:
dO = tf.gradients(Output, Input)
dO_i = tf.gradients(Output[i], Input) (for any valid i)
will be a list with a single element - a tensor with the same shape as Input, namely a [num_timesteps, features] matrix. Also, if you sum all matrices dO_i (over all valid i) is exactly the matrix dO.
With this in mind, back to your question. In many cases, individual rows from the Input are independent, meaning that Output[i] is calculated only from Input[i] and doesn't know other inputs (typical case: batch processing without batchnorm). If that is your case, then dO is going to give you all individual components dO_i at once.
This is because each dO_i matrix is going to look like this:
[[ 0. 0. 0.]
[ 0. 0. 0.]
...
[ 0. 0. 0.]
[ xxx xxx xxx] <- i-th row
[ 0. 0. 0.]
...
[ 0. 0. 0.]]
All rows are going to be 0, except for the i-th one. So just by computing one matrix dO, you can easily get every dO_i. This is very efficient.
However, if that's not your case and all Output[i] depend on all inputs, there's no way to extract individual dO_i just from their sum. You have no other choice other than calculate each gradient separately: just iterate over i and execute tf.gradients.

Dendrogram through scipy given a similarity matrix

I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!
Here is some code:
SimMatrix = [[ 0.,0.09259259, 0.125 , 0. , 0.08571429],
[ 0.09259259, 0. , 0.05555556, 0. , 0.05128205],
[ 0.125 , 0.05555556, 0. , 0.03571429, 0.05882353],
[ 0. , 0. , 0.03571429, 0. , 0. ],
[ 0.08571429, 0.05128205, 0.05882353, 0. , 0. ]]
linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro = hcluster.dendrogram(linkage) #same plot for all types?
show()
If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!
Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix).
You can see this in the code below:
import scipy.spatial.distance as ssd
distVec = ssd.squareform(SimMatrix)
linkage = hcluster.linkage(1 - distVec)
dendro = hcluster.dendrogram(linkage)
show()

Categories

Resources