I'm trying to cluster time series. I also want to use Sklearn OPTICS. In the documentation it says that the input vector X should have dimensions (n_samples,n_features). My array is on the form (n_samples, n_time_stamps, n_features). Example in code further down.
My question is how I can use the Fit-function from OPTICS with a time series. I know that people have used OPTICS and DBSCAN with time series. I just can't figure out how they have implemented it. Any help will be much appreciated.
[[[t00, x0], [t01, x01], ... [t0_n_timestamps, x0_n_timestamps]],
[[t10, x10], [t11, x11], ... [t1_n_timestamps, x1_n_timestamps]],
.
.
.
[[t_n_samples_0, x_n_samples_0], [[t_n_samples_1, x_n_samples_1], ... [t_n_samples_n_timestamps, x_n_samples_n_timestamps]]]
Given the following np.array as an input:
data = np.array([
[["00:00", 7], ["00:01", 37], ["00:02", 3]],
[["00:00", 27], ["00:01", 137], ["00:02", 33]],
[["00:00", 14], ["00:01", 17], ["00:02", 12]],
[["00:00", 15], ["00:01", 123], ["00:02", 11]],
[["00:00", 16], ["00:01", 12], ["00:02", 92]],
[["00:00", 17], ["00:01", 23], ["00:02", 22]],
[["00:00", 18], ["00:01", 23], ["00:02", 112]],
[["00:00", 100], ["00:01", 200], ["00:02", 301]],
[["00:00", 101], ["00:01", 201], ["00:02", 302]],
[["00:00", 102], ["00:01", 203], ["00:02", 303]],
[["00:00", 104], ["00:01", 207], ["00:02", 304]]])
I will proceed as follows:
# save shape info in three separate variables
x, y, z = data.shape
# idea from https://stackoverflow.com/a/36235454/5050691
output_arr = np.column_stack((np.repeat(np.arange(x), y), data.reshape(x * y, -1)))
# create a df out of the arr
df = pd.DataFrame(output_arr)
# rename for understandability
df = df.rename(columns={0: 'index', 1: 'time', 2: 'value'})
# Change the orientation between rows and columns so that rows
# that contain time info become columns
df = df.pivot(index="index", columns="time", values="value")
df.rename_axis(None, axis=1).reset_index()
# get columns that refer to specific interval of time series
temporal_accessors = ["00:00", "00:01", "00:02"]
# extract data that will be used to carry out clustering
data_for_clustering = df[temporal_accessors].to_numpy()
# a set of exemplary params
params = {
"xi": 0.05,
"metric": "euclidean",
"min_samples": 3
}
clusterer = OPTICS(**params)
fitted = clusterer.fit(data_for_clustering)
cluster_labels = fitted.labels_
df["cluster"] = cluster_labels
# Note: density based algortihms have a notion of the "noise-cluster", which is marked with
# -1 by sklearn algorithms. That's why starting index is -1 for density based clustering,
# and 0 otherwise.
For the given data and the presented choice of params, you'll get the following clusters: [0 0 1 0 0 0 0 0 1 1 1]
Related
The parameter called ignored_columns (see link) helps user to keep a feature that you want to be ignored when building a model.
When I build a simple ML model and analyze the feature importance, I can see that h2o ignores the column that I speficied during the training process, which can be observed from the feature importance. As shown below, column c is not used during training.
import pandas as pd
import h2o
from h2o.estimators import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
x = pd.DataFrame([[0, 1, 4], [5, 1, 6], [15, 2, 0], [25, 5 , 32],
[35, 11 ,89], [45, 15, 1], [55, 34,3], [60, 35,4]], columns = ['a','b','c'])
y = pd.DataFrame([4, 5, 20, 14, 32, 22, 38, 43], columns = ['label'])
hf = h2o.H2OFrame( pd.concat([x,y], axis="columns"))
X = hf.col_names[:-1]
y = hf.col_names[-1]
model= H2ORandomForestEstimator(ignored_columns = ['c'])
model.train(y = y, training_frame=hf)
model.varimp(use_pandas=True)
variable relative_importance scaled_importance percentage
0 b 33876.328125 1.000000 0.540893
1 a 28753.998047 0.848793 0.459107
However, when I turn on the grid search for the hyper parameter tunning, it does not seem like working.
params = {'max_depth': list(range(7, 16)), 'sample_rate': [0.8], }
criteria = {'strategy': 'RandomDiscrete', 'max_models': 4}
grid = H2OGridSearch(model= H2ORandomForestEstimator(ignored_columns = ['c']),
search_criteria=criteria,
hyper_params=params )
grid.train( y = y, training_frame=hf)
best_model = grid.get_grid(sort_by='rmse', decreasing=False)[0]
best_model.varimp(use_pandas=True)
variable relative_importance scaled_importance percentage
0 a 33525.109375 1.000000 0.516545
1 b 23314.916016 0.695446 0.359230
2 c 8062.515137 0.240492 0.124225
suppose I have an array called view:
array([[[[ 7, 9],
[10, 11]],
[[19, 18],
[20, 16]]],
[[[24, 5],
[ 6, 10]],
[[18, 11],
[45, 12]]]])
as you may know from maxpooling, this is a view of the original input, and the kernel size is 2x2:
[[ 7, 9], [[19, 18],
[10, 11]], [20, 16]]], ....
The goal is to find both max values and their indices. However, argmax only works on single axis, so I need to flatten view, i.e. using flatten=view.reshape(2,2,4):
array([[[ 7, 9, 10, 11], [19, 18, 20, 16]],
[[24, 5, 6, 10], [18, 11, 45, 12]]])
Now, with the help I get from my previous question, I can find indices of max using inds = flatten.argmax(-1):
array([[3, 2],
[0, 2]])
and values of max:
i, j = np.indices(flatten.shape[:-1])
flatten[i, j, inds]
>>> array([[11, 20],
[24, 45]])
The problem
the problem arise when I flatten the view array. Since view array is a view of the original array i.e. view = as_strided(original, newshape, newstrides), so view and original shares the same data. However, reshape breaks it, so any change on view is not reflected on original. This is problematical during backpropagation.
My question
Given the array view and indices ind, I'd like to change max values in view to 1000, without using reshape, or any operation that breaks the 'bond' between view and original. Thanks for any help!!!
reproducible example
import numpy as np
from numpy.lib.stride_tricks import as_strided
original=np.array([[[7,9,19,18],[10,11,20,16]],[[24,5,18,11],[6,10,45,12]]],dtype=np.float64)
view=as_strided(original, shape=(2,1,2,2,2),strides=(64,32*2,8*2,32,8))
I'd like to change max values of each kernel in view to, say, 1000, that can be reflected on original, i.e. if I run view[0,0,0,0,0]=1000, then the first element of both view and original are 1000.
how about this:
import numpy as np
view = np.array(
[[[[ 7, 9],
[10, 11]],
[[19, 18],
[20, 16]]],
[[[24, 5],
[ 6, 10]],
[[18, 11],
[45, 12]]]]
)
# Getting the indices of the max values
max0 = view.max(-2)
idx2 = view.argmax(-2)
idx2 = idx2.reshape(-1, idx2.shape[1])
max1 = max0.max(-1)
idx3 = max0.argmax(-1).flatten()
idx2 = idx2[np.arange(idx3.size), idx3]
idx0 = np.arange(view.shape[0]).repeat(view.shape[1])
idx1 = np.arange(view.shape[1]).reshape(1, -1).repeat(view.shape[0], 0).flatten()
# Replacing the maximal vlues with 1000
view[idx0, idx1, idx2, idx3] = 1000
print(f'view = \n{view}')
output:
view =
[[[[ 7 9]
[ 10 1000]]
[[ 19 18]
[1000 16]]]
[[[1000 5]
[ 6 10]]
[[ 18 11]
[1000 12]]]]
Basically, idx{n} is the index of the maximal value in the last two dimensions for every matrix contained in the first two dimensions.
So lets assume my body has the following extrinsic orientation in the Coordinate System A:
A = [20,30,40] # extrinsic xyz in degrees
And the following Orientation in the Coordinate System B:
B = [10, 25, 50]
So the transformation from A to B is:
T = [-10, -5, 10]
So that:
B = A + T
Now I want to do the same using scipy.Rotation:
from scipy.spatial.transform import Rotation
A = Rotation.from_euler('xyz' ,[20, 30, 40], degrees=True)
B = Rotation.from_euler('xyz', [10, 25, 50], degrees=True)
T = Rotation.from_euler('xyz', [-10, -5, 10], degrees=True)
result = A * T # This seems to be wrong?
print(result.as_euler('xyz', degrees=True)) # Output: [14.02609598 21.61478378 48.20912092]
Where is my mistake here? What am I doint wrong. I need to use scipy rotation because, I will apply that same rotation given in euler angles on quaternions too.
The transformation from A to B is incorrect. You need to be careful when considering rotations in 3D as rotations about different axes do not commute with each other.
According to your understanding, the T in the following code should get the object to [0, 0, 0]. But it doesn't.
A = Rotation.from_euler('xyz' ,[20, 30, 40], degrees=True)
T = Rotation.from_euler('xyz', [-20, -30, -40], degrees=True)
result = A * T
print(result.as_euler('xyz', degrees=True))
# output: [-25.4441408 5.14593816 -4.1802616 ]
However, if you reverse the order of the rotations, you go to [0, 0, 0] as expected.
A = Rotation.from_euler('xyz' ,[20, 30, 40], degrees=True)
T = Rotation.from_euler('zyx', [-40, -30, -20], degrees=True)
result = A * T
print(result.as_euler('xyz', degrees=True))
# output: [ 4.77083202e-15 0.00000000e+00 -1.98784668e-15] practically [0,0,0]
The correct transformation from A to B will be T = [-14.74053552, -1.237896, 10.10094351]. Refer to the following.
A = Rotation.from_euler('xyz' ,[20, 30, 40], degrees=True)
AToZero = Rotation.from_euler('zyx', [-40, -30, -20], degrees=True)
ZeroToB = Rotation.from_euler('xyz', [10, 25, 50], degrees=True)
T = AToZero*ZeroToB
print(T.as_euler('xyz', degrees=True))
# output: [-14.74053552 -1.237896 10.10094351]
result = A * T
print(result.as_euler('xyz', degrees=True))
# ouput: [10. 25. 50.]
I have a 3d array of values,
vals = np.array([
[
[10, 20, 30],
[40, 50, 60],
],
[
[15, 25, 35],
[45, 55, 65],
],
])
and a corresponding 3d array of coordinates
coords = np.array([
[
[0,1],
[0,2],
[1,1]
],
[
[0,0],
[1,1],
[1,2]
]
])
Each inner-most array of coords represents (x,y) coordinates corresponding to one of the 2d arrays within vals. For example, the coordinate [0,1] in coords corresponds to the value 20 and the coordinate [1,2] in coords corresponds to the value 65.
How do I use coords to subset vals in this manner?
I can solve this specific example like so
np.array([
vals[0][coords[0][:, 0], coords[0][:, 1]],
vals[1][coords[1][:, 0], coords[1][:, 1]]
])
array([[20, 30, 50],
[15, 55, 65]])
but obviously I'd like a more dynamic solution.
Funny how writing my questions always seems to lead me to an answer.. Staring at the answer matrix,
array([[20, 30, 50],
[15, 55, 65]])
I asked myself, "how would I reproduce this matrix from raw index values?". For example, to extract the value 20, I know I can do
vals[0, 0, 1]
If I wanted to extract the first row of values in the answer, [20, 30, 50] I should do
vals[[0,0,0], [0,0,1], [1,2,1]]
Then to get the full answer matrix, I should do
vals[[[0,0,0],[1,1,1]], [[0,0,1],[0,1,1]], [[1,2,1],[0,1,2]]]
From here, I set my focus on producing those three index matrices. They can be constructed as follows:
i1 = np.arange(coords.shape[0])[:, None].repeat(coords.shape[1], axis=1)
i2 = coords[:,:,0]
i3 = coords[:,:,1]
# Thus the generalized solution
vals[i1, i2, i3]
This answer is extremely similar to the advanced indexing solution mentioned by #Psidom in the comments, but perhaps less elegant.
I have a matrix M:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
And a matrix R:
[[10 20]
[32 35]
[50 66]
[90 90]]
I want to use the values in column 0 of matrix R as start value of a slice and the value in column 1 as end of a slice.
I want to calculate the sum between and including the ranges of these slices from the right column in matrix M.
Basically doing
M[0:4][:,1].sum() # Upper index +1 as I need upper bound including
M[5:7][:,1].sum() # Upper index +1 as I need upper bound including
and so on. 0 is the index of 10 and 3 is the index of 20. 5 would be the index of 32, 6 the index of 35.
I'm stuck at how to get the start/end values from matrix R into indeces by column 0 of matrix M. And then calculate the sum between the index range including upper/lower bound.
Expected output:
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
Update, I can get the indices now using searchsorted. Now I just need to use sum at column 1 of matrix M within the start and end.
start_indices = [0,5,8,13]
end_indices = [3,6,12,13]
Wondering if there is a more efficient way than applying a for loop?
EDIT: Found the answer here. Numpy sum of values in subarrays between pairs of indices
Use searchsorted to determine the correct indices and add.reduceat to perform the summation:
>>> idx = M[:, 0].searchsorted(R) + (0, 1)
>>> idx = idx.ravel()[:-1] if idx[-1, 1] == M.shape[0] else idx.ravel()
>>> result = np.add.reduceat(M[:, 1], idx)[::2]
>>> result
array([ 7000, 6500, 14100, 5000])
Details:
Since you want to include the upper boundaries but Python excludes them we have to add 1.
reduceat cannot handle len(arg0) as an index, we have to special case that
reduceat computes all stretches between consecutive boundaries, we have to discard every other one
I think it would be better to show an example of the output you are expecting. If what you want to calculate using M[0:4][:,1].sum() is the sum of 1000 + 200 + 800 + 5000. Then this code might help:
import numpy as np
M = np.matrix([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
print(M[0:4][:,1].sum())