Dynamic Time Warping (DTW) for a large dataset with multiple features

Dynamic Time Warping (DTW) for a large dataset with multiple features - python

I am new to DTW and was trying to apply the same for a dataset with ~700,000 rows and 9 features. I have two arrays (matrix) of the form,
[
[0 1 0 0 0 0 0 0 0],
[0 0 0 0 1 0 0 0 0],
...
[0 0 0 0 0 0 1 0 0],
[0 0 1 0 0 0 0 0 0],
]
I have explored the the fastdtw and dtaidistance packages. 'fastdtw' is able to give an output distance for the above matrix in around 5 min. In addition, I am looking to visualize the results as well, and apply hierarchical clustering. I didn't find any function in fastdtw to visualize the path/results and for clustering.
dtaidistance does provide these functions, but it takes too long to run (I ran it for the same two series above, it was still running after 15-20 minutes). Is there any way to handle this? Or can I do clustering and visualization with the results of fastdtw?
I would really appreciate some help regarding this.

Related

Optimal path through one-hot vectors

I have a collection of one-hot vectors (in numpy)
[[0 0 0 ... 0 0 0] [0 1 0 ... 0 0 0] [0 1 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 1 0 0]]
My goal is to find the optimal path to reach all of the vectors, starting from the first vector (which is all 0's), which minimizes the number of steps. The path does not need to be continuous (ie if each vector has only one 1, then the number of steps can just be the number of non-zero vectors).
Is there any existing method that optimizes this? It's kind of like a shortest path problem.

DL: when a prediction SHOULD be considered false in a multiclass multilabel problem

I am currently working on getting the metrics (classification report, confusion matrix) for a DL problem and I have stumbled across a problem.
My y_true is something like [1 0 0 0 1 0 0 1 0 0] (multilabel). The ones indicating the correct values (eg. RED, BLUE, GREEN).
My model outputs with Sigmoid activation, getting everything between 0 and 1. So far so good.
Now, the way I proceed is by getting the max value/score of the prediction torch.max(outputs), then its index/position on the array and then 1-hot encoding it so to resemble the y_true.
My question is, if y_true has 2 or more ones (labels), then even if I manage to predict 1 of them right, my 1-hot encoding will be always false (cause [1 0 0 0 1 0 0 1 0 0] <> [1 0 0 0 0 0 0 0 0 0]). I could get more than 1 score, but again I don't know how many labels my y_true has every time.
What is the right way to proceed with this?

How to get confusion matrix for binary image?

I'm trying to produce a confusion matrix for 2 binary images. These are extracted (using binary thresholding) from 2 bands in a GeoTiff image, although I think this information should be irrelevant.
dataset = rasterio.open('NDBI.tif')
VH_26Jun2015 = dataset.read(1)
VH_30Sep2015 = dataset.read(3)
GND_Truth = dataset.read(7)
VH_diff = VH_26Jun2015 - VH_30Sep2015
ret,th1 = cv2.threshold(VH_diff,0.02,255,cv2.THRESH_BINARY)
print(confusion_matrix(GND_Truth,th1)
Error 1: I used the code above and ran into the problem mentioned here ValueError: multilabel-indicator is not supported for confusion matrix
I tried the argmax(axis=1) solution mentioned in the question and other places, but with a resulting 1983x1983 sized matrix. (This Error 1 is probably same as what the person in the question above ran into).
print(confusion_matrix(GND_Truth.argmax(axis=1),th1.argmax(axis=1)))
Output:
[[8 2 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
I checked the contents of the GND_Truth and th1 and verified that they are binary.
numpy.unique(GND_Truth)
Output:
array([0., 1.], dtype=float32)
Error 2: Then I tried instead ravel() to flatten my binary images when passing to confusion_matrix like shown below, but resulting in a 3x3 matrix, whereas I'm expecting a 2x2 matrix.
print(confusion_matrix(GND_Truth.ravel().astype(int),th1.ravel().astype(int)))
Output:
[[16552434 0 2055509]
[ 6230317 0 1531602]
[ 0 0 0]]
Converting the data astype(int) did not really make a difference. Can you please suggest what might be causing these 2 errors?

Calculating number of permutations of a matrix with elements being adjacent integers only

I'm trying to write a Python code in order to determine the number of possible permutations of a matrix where neighbouring elements can only be adjacent integer numbers. I also wish to know how many times each total set of numbers appears (by that I mean, the same numbers of each integer in n matrices, but not in the same matrix permutation)
Forgive me if I'm not being clear, or if my terminology isn't ideal! Consider a 5 x 5 zero matrix. This is an acceptable permutaton, as all of the elements are adjacent to an identical number.
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
25 x 0, 0 x 1, 0 x 2
The elements within the matrix can be changed to 1 or 2. Changing any of the elements to 1 would also be an acceptable permutation, as the 1 would be surrounded by an adjacent integer, 0. For example, changing the central [2,2] element of the matrix:
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
24 x 0, 1 x 1, 0 x 2
However, changing the [2,2] element in the centre to a 2 would mean that all of the elements surrounding it would have to switch to 1, as 2 is not adjacent to 0.
0 0 0 0 0
0 1 1 1 0
0 1 2 1 0
0 1 1 1 0
0 0 0 0 0
16 x 0, 8 x 1, 1 x 2
I want to know how many permutations are possible from that zeroed 5x5 matrix by changing the elements to 1 and 2, whilst keeping neighbouring elements as adjacent integers. In other words, any permutations where 0 and 2 are adjacent are not allowed.
I also wish to know how many matrices contain a certain number of each integer. For example, both of the below matrices would be 24 x 0, 1 x 1, 0 x 2. Over every permutation, I'd like to know how many correspond to this frequency of integers.
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Again, sorry if I'm not being clear or my nomenclature is poor! Thanks for your time - I'd really appreciate some help with this, and any words or guidance would be kindly received.
Thanks,
Sam

First, what you're calling a permutation isn't.
Secondly your problem is that a naive brute force solution would look at 3^25 = 847,288,609,443 possible combinations. (Somewhat less, but probably still in the hundreds of billions.)
The right way to solve this is called dynamic programming. What you need to do for your basic problem is calculate, for i from 0 to 4, for each of the different possible rows you could have there how many possible matrices you could have had that end in that row.
Add up all of the possible answers in the last row, and you'll have your answer.
For the more detailed count, you need to divide it by row, by cumulative counts you could be at for each value. But otherwise it is the same.
The straightforward version should require tens of thousands of operation. The detailed version might require millions. But this will be massively better than the hundreds of billions that the naive recursive version takes.

Just search for some more simple rules:
1s can be distributed arbitrarily in the array, since the matrix so far only consists of 0s. 2s can aswell be distributed arbitrarily, since only neighbouring elements must be either 1 or 2.
Thus there are f(x) = n! / x! possibilities to distributed 1s and 2s over the matrix.
So the total number of possible permutations is 2 * sum(x = 1 , n * n){f(x)}.
Calculating the number of possible permutations with a fixed number of 1s can easily be solved by simple calculating f(x).
The number of matrices with a fixed number of 2s and 1s is a bit more tricky. Here you can only rely on the fact that all mirrored versions of the matrix yield the same number of 1s and 2s and are valid. Apart from using that fact you can only brute-force search for correct solutions.

Recursion and Percolation

I'm trying to write a function that will check for undirected percolation in a numpy array. In this case, undirected percolation occurs when there is some kind of path that the liquid can follow (the liquid can travel up, down, and sideways, but not diagonally). Below is an example of an array that could be given to us.
1 0 1 1 0
1 0 0 0 1
1 0 1 0 0
1 1 1 0 0
1 0 1 0 1
The result of percolation in this scenario is below.
1 0 1 1 0
1 0 0 0 0
1 0 1 0 0
1 1 1 0 0
1 0 1 0 0
In the scenario above, the liquid could follow a path and everything with a 1 currently would refill except for the 1's in positions [1,4] and [4,4].
The function I'm trying to write starts at the top of the array and checks to see if it's a 1. If it's a 1, it writes it to a new array. What I want it to do next is check the positions above, below, left, and right of the 1 that has just been assigned.
What I currently have is below.
def flow_from(sites,full,i,j)
n = len(sites)
if j>=0 and j<n and i>=0 and i<n: #Check to see that value is in array bounds
if sites[i,j] == 0:
full[i,j] = 0
else:
full[i,j] = 1
flow_from(sites, full, i, j + 1)
flow_from(sites, full, i, j - 1)
flow_from(sites, full, i + 1, j)
flow_from(sites, full, i - 1, j)
In this case, sites is the original matrix, for example the one shown above. New is the matrix that has been replaced with it's flow matrix. Second matrix shown. And i and j are used to iterate through.
Whenever I run this, I get an error that says "RuntimeError: maximum recursion depth exceeded in comparison." I looked into this and I don't think I need to adjust my recursion limit, but I have a feeling there's something blatantly obvious with my code that I just can't see. Any pointers?

Forgot about your code block. This is a known problem with a known solution from the scipy library. Adapting the code from this answer and assume your data is in an array named A.
from scipy.ndimage import measurements
# Identify the clusters
lw, num = measurements.label(A)
area = measurements.sum(A, lw, index=np.arange(lw.max() + 1))
print A
print area
This gives:
[[1 0 1 1 0]
[1 0 0 0 1]
[1 0 1 0 0]
[1 1 1 0 0]
[1 0 1 0 1]]
[[1 0 2 2 0]
[1 0 0 0 3]
[1 0 1 0 0]
[1 1 1 0 0]
[1 0 1 0 4]]
[ 0. 9. 2. 1. 1.]
That is, it's labeled all the "clusters" for you and identified the size! From here you can see that the clusters labeled 3 and 4 have size 1 which is what you want to filter away. This is a much more powerful approach because now you can filter for any size.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamic Time Warping (DTW) for a large dataset with multiple features - python

Related

Optimal path through one-hot vectors

DL: when a prediction SHOULD be considered false in a multiclass multilabel problem

How to get confusion matrix for binary image?

Calculating number of permutations of a matrix with elements being adjacent integers only

Recursion and Percolation

Categories

Resources