I have a matrix M:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
And a matrix R:
[[10 20]
[32 35]
[50 66]
[90 90]]
I want to use the values in column 0 of matrix R as start value of a slice and the value in column 1 as end of a slice.
I want to calculate the sum between and including the ranges of these slices from the right column in matrix M.
Basically doing
M[0:4][:,1].sum() # Upper index +1 as I need upper bound including
M[5:7][:,1].sum() # Upper index +1 as I need upper bound including
and so on. 0 is the index of 10 and 3 is the index of 20. 5 would be the index of 32, 6 the index of 35.
I'm stuck at how to get the start/end values from matrix R into indeces by column 0 of matrix M. And then calculate the sum between the index range including upper/lower bound.
Expected output:
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
Update, I can get the indices now using searchsorted. Now I just need to use sum at column 1 of matrix M within the start and end.
start_indices = [0,5,8,13]
end_indices = [3,6,12,13]
Wondering if there is a more efficient way than applying a for loop?
EDIT: Found the answer here. Numpy sum of values in subarrays between pairs of indices
Use searchsorted to determine the correct indices and add.reduceat to perform the summation:
>>> idx = M[:, 0].searchsorted(R) + (0, 1)
>>> idx = idx.ravel()[:-1] if idx[-1, 1] == M.shape[0] else idx.ravel()
>>> result = np.add.reduceat(M[:, 1], idx)[::2]
>>> result
array([ 7000, 6500, 14100, 5000])
Details:
Since you want to include the upper boundaries but Python excludes them we have to add 1.
reduceat cannot handle len(arg0) as an index, we have to special case that
reduceat computes all stretches between consecutive boundaries, we have to discard every other one
I think it would be better to show an example of the output you are expecting. If what you want to calculate using M[0:4][:,1].sum() is the sum of 1000 + 200 + 800 + 5000. Then this code might help:
import numpy as np
M = np.matrix([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
print(M[0:4][:,1].sum())
Related
I'm writing a script to reduce the number of colors in a list by finding clusters. The problem I seem to run into is that the clusters will have different dimensions. Here is my jumping off point after the original list of 6 colors got already seperated into 3 clusters:
import numpy
a = numpy.array([
[12, 44, 52],
[27, 0, 71],
[81, 99, 92]
])
b = numpy.array([
[ 12, 13, 93],
[128, 128, 128]
])
c = numpy.array([
[ 57, 14, 255]
])
clusters = numpy.array([a,b,c])
print(numpy.min(clusters, axis=1))
However now the function numpy.min() starts to throw an error - I suspect it's because of the differently sized arrays.
The cluster arrays will always have the shape (x, 3) (x number of colors, 3 components). I want to get an array with the minimums of all components of the colors in one cluster (n, 3) (n is number of clusters) - so array([12, 0, 52], [12, 13, 93], [57, 14, 255]) in this case.
Is there a way to do this? As I mentioned it works as long as all clusters have multiple values.
Since your arrays a, b and c don't have an equal shape, you can't put them in the same array (at least if you don't pad with some value). You could calculate the minimum first and then generate an array from these minima:
numpy.array([arr.min(axis=0) for arr in (a, b, c)])
Which gives you:
array([[ 12, 0, 52],
[ 12, 13, 93],
[ 57, 14, 255]])
I'm trying to cluster time series. I also want to use Sklearn OPTICS. In the documentation it says that the input vector X should have dimensions (n_samples,n_features). My array is on the form (n_samples, n_time_stamps, n_features). Example in code further down.
My question is how I can use the Fit-function from OPTICS with a time series. I know that people have used OPTICS and DBSCAN with time series. I just can't figure out how they have implemented it. Any help will be much appreciated.
[[[t00, x0], [t01, x01], ... [t0_n_timestamps, x0_n_timestamps]],
[[t10, x10], [t11, x11], ... [t1_n_timestamps, x1_n_timestamps]],
.
.
.
[[t_n_samples_0, x_n_samples_0], [[t_n_samples_1, x_n_samples_1], ... [t_n_samples_n_timestamps, x_n_samples_n_timestamps]]]
Given the following np.array as an input:
data = np.array([
[["00:00", 7], ["00:01", 37], ["00:02", 3]],
[["00:00", 27], ["00:01", 137], ["00:02", 33]],
[["00:00", 14], ["00:01", 17], ["00:02", 12]],
[["00:00", 15], ["00:01", 123], ["00:02", 11]],
[["00:00", 16], ["00:01", 12], ["00:02", 92]],
[["00:00", 17], ["00:01", 23], ["00:02", 22]],
[["00:00", 18], ["00:01", 23], ["00:02", 112]],
[["00:00", 100], ["00:01", 200], ["00:02", 301]],
[["00:00", 101], ["00:01", 201], ["00:02", 302]],
[["00:00", 102], ["00:01", 203], ["00:02", 303]],
[["00:00", 104], ["00:01", 207], ["00:02", 304]]])
I will proceed as follows:
# save shape info in three separate variables
x, y, z = data.shape
# idea from https://stackoverflow.com/a/36235454/5050691
output_arr = np.column_stack((np.repeat(np.arange(x), y), data.reshape(x * y, -1)))
# create a df out of the arr
df = pd.DataFrame(output_arr)
# rename for understandability
df = df.rename(columns={0: 'index', 1: 'time', 2: 'value'})
# Change the orientation between rows and columns so that rows
# that contain time info become columns
df = df.pivot(index="index", columns="time", values="value")
df.rename_axis(None, axis=1).reset_index()
# get columns that refer to specific interval of time series
temporal_accessors = ["00:00", "00:01", "00:02"]
# extract data that will be used to carry out clustering
data_for_clustering = df[temporal_accessors].to_numpy()
# a set of exemplary params
params = {
"xi": 0.05,
"metric": "euclidean",
"min_samples": 3
}
clusterer = OPTICS(**params)
fitted = clusterer.fit(data_for_clustering)
cluster_labels = fitted.labels_
df["cluster"] = cluster_labels
# Note: density based algortihms have a notion of the "noise-cluster", which is marked with
# -1 by sklearn algorithms. That's why starting index is -1 for density based clustering,
# and 0 otherwise.
For the given data and the presented choice of params, you'll get the following clusters: [0 0 1 0 0 0 0 0 1 1 1]
My code combines values from two matrices and lists them side by side. T works as I need properly.
We are trying to remove the field where 2 identical values are located. This can be better seen in the example below
my code
import os
import numpy as np
import sys
b=np.array([[13,14,15],
[22,23,24],
[31,32,33]])
#print(b)
d=np.array([100,200,300,400,500])
b[-1,:] = d[:b.shape[1]] # last row
b[:-1,-1] = d[b.shape[1]:]
val1 = np.hstack(b[::-1])
val2 = np.hstack([d[i:i+b.shape[1]] for i in range(b.shape[0])])
res = zip(val1, val2)
for i, j in res:
l=[i, j]
print(l)
my output
[100, 100]
[200, 200]
[300, 300]
[22, 200]
[23, 300]
[500, 400]
[13, 300]
[14, 400]
[400, 500]
My code combines values from two matrices and lists them side by side. T works as I need properly.
We are trying to remove the field where 2 identical values are located. This can be better seen in the example below
I would need to remove matrices in my output that contain the same numbers. As you can see in the output below
The matrices do not always have to be the same and do not have to match the same iterations
required output
[22, 200]
[23, 300]
[500, 400]
[13, 300]
[14, 400]
[400, 500]
Find where the values are different and only concatenate those values.
>>> # using val1 and val2 from the question
>>> mask = np.where(val1!=val2)
>>> mask
(array([3, 4, 5, 6, 7, 8], dtype=int64),)
>>> np.vstack((val1[mask],val2[mask]))
array([[ 22, 23, 500, 13, 14, 400],
[200, 300, 400, 300, 400, 500]])
>>> np.vstack((val1[mask],val2[mask])).T
array([[ 22, 200],
[ 23, 300],
[500, 400],
[ 13, 300],
[ 14, 400],
[400, 500]])
>>>
It is as simple as comparing the two arrays and using the result as a boolean index:
np.stack([val1, val2], axis=1)[val1 != val2]
I have an Nx2 matrix such as:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
I need to create a Nx3 matrix, that reflects the relationship of the rows from the first matrix in the following way:
Use the right column to identify candidates for range boundaries, the condition is value >= 1000
This condition applied to the matrix:
[[10, 1000],
[20, 5000],
[32, 3000],
[35, 3500],
[50, 5000],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000],]
So far I came up with "M[M[:,1]>=1000]" which works. For this new matrix I want to now check the points in the first column where distance to the next point <= 10 applies, and use these as range boundaries.
What I came up with so far: np.diff(M[:,0]) <= 10 which returns:
[True, False, True, False, True, True, True, False]
This is where I'm stuck. I want to use this condition to define lower and upper boundary of a range. For example:
[[10, 1000], #<- Range 1 start
[20, 5000], #<- Range 1 end (as 32 would be 12 points away)
[32, 3000], #<- Range 2 start
[35, 3500], #<- Range 2 end
[50, 5000], #<- Range 3 start
[55, 2000], #<- Range 3 cont (as 55 is only 5 points away)
[58, 3000], #<- Range 3 cont
[66, 4000], #<- Range 3 end
[90, 5000]] #<- Range 4 start and end (as there is no point +-10)
Lastly, referring back to the very first matrix, I want to add the right-column values together for each range within (and including) the boundaries.
So I have the four ranges which define start and stop for boundaries.
Range 1: Start 10, end 20
Range 2: Start 32, end 35
Range 3: Start 50, end 66
Range 4: Start 90, end 90
The resulting matrix would look like this, where column 0 is the start boundary, column 1 the end boundary and column 2 the added values from matrix M from the right column in between start and end.
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
I got stuck at the second step, after I get the true/false values for range boundaries. But how to create the ranges from the boolean values, and then how to add values together within these ranges is unclear for me. Would appreciate any suggestions. Also, I'm not sure on my approach, maybe there is a better way to get from the first to the last matrix, maybe skipping one step??
EDIT
So, I came a bit further with the middle step, and I can now return the start and end values of the range:
start_diffs = np.diff(M[:,0]) > 10
start_indexes = np.insert(start_diffs, 0, True)
end_diffs = np.diff(M[:,0]) > 10
end_indexes = np.insert(end_diffs, -1, True)
start_values = M[:,0][start_indexes]
end_values = M[:,0][end_indexes]
print(np.array([start_values, end_values]).T)
Returns:
[[10 20]
[32 35]
[50 66]
[90 90]]
What is missing is somehow using these ranges now to calculate the sums from matrix M in the right column.
If you are open to using pandas, here's a solution that seems a bit over-thought in retrospect, but works:
# Initial array
M = np.array([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
# Build a DataFrame with default integer index and column labels
df = pd.DataFrame(M)
# Get a subset of rows that represent potential interval edges
subset = df[df[1] >= 1000].copy()
# If a row is the first row in a new range, flag it with 1.
# Then cumulatively sum these 1s. This labels each row with a
# unique integer, one per range
subset[2] = (subset[0].diff() > 10).astype(int).cumsum()
# Get the start and end values of each range
edges = subset.groupby(2).agg({0: ['first', 'last']})
edges
0
first last
2
0 10 20
1 32 35
2 50 66
3 90 90
# Build a pandas IntervalIndex out of these interval edges
tups = list(edges.itertuples(index=False, name=None))
idx = pd.IntervalIndex.from_tuples(tups, closed='both')
# Build a Series that maps each interval to a unique range number
mapping = pd.Series(range(len(idx)), index=idx)
# Apply this mapping to create a new column of the original df
df[2] = [mapping.loc[i] if idx.contains(i) else None for i in df[0]]
df
0 1 2
0 10 1000 0.0
1 11 200 0.0
2 15 800 0.0
3 20 5000 0.0
4 28 100 NaN
5 32 3000 1.0
6 35 3500 1.0
7 38 100 NaN
8 50 5000 2.0
9 51 100 2.0
10 55 2000 2.0
11 58 3000 2.0
12 66 4000 2.0
13 90 5000 3.0
# Group by this new column, get edges of each interval,
# sum values, and get the underlying numpy array
df.groupby(2).agg({0: ['first', 'last'], 1: 'sum'}).values
array([[ 10, 20, 7000],
[ 32, 35, 6500],
[ 50, 66, 14100],
[ 90, 90, 5000]])
Given an interval of time:
a = [20,40]
I need to covert it into an equal intervals of:
a = [[20,30],[30,40]]
I tried this code:
v1 = a[0]; v2 = a[1]
d.append(v1)
val = abs(v1-v2)
n = int(val/2)
for i in range(n):
v1 += n
d.append(v1)
print d
Can anyone suggest a code to do this it will be helpfull
I can point out a few incorrect things of what you've tried, instead of writing out the code for you.
for i in range(n):
v1 += n
d.append(v1)
Remember, from your example n is set to 10. So when you say for i in range(n), you will be iterating through your for loop 10 times.
And if you look at the way you append to d, this will not be appending a smaller list to a overall list. This will keep appending all numbers to just one list.
I'm guessing this is the output you are currently getting: [20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120].
With what I said, give it another shot :-)
You can solve the first part by calculating an interval and using a loop to create a tuple in each iteration using that interval. This will give an answer corresponding to the desired result you listed.
However, notice that in your example the end of the previous tuple is the same to the start of the next. If you don't want them to intersect, then you need to a similar logic to the first, but do + 1 at the right time.
Here is some code, the first using a for loop and the second using a post test loop:
def interval_divide(min, max, intervals):
assert intervals != 0
interval_size = round((max - min) / intervals)
result = []
start = min
for start in range(min, max, interval_size):
end = start + interval_size
result.append([start, end])
return result
a = [20, 40]
print("1 intervals", interval_divide(a[0], a[1], 1))
print("2 intervals", interval_divide(a[0], a[1], 2))
print("3 intervals", interval_divide(a[0], a[1], 3))
print("4 intervals", interval_divide(a[0], a[1], 4))
def interval_divide2(min, max, intervals):
assert intervals != 0
interval_size = round((max - min) / intervals)
result = []
start = min
end = min + interval_size
while True:
result.append([start, end])
start = end + 1
end = end + interval_size
if len(result) == intervals:
break
return result
print("-----")
print("1 intervals", interval_divide2(a[0], a[1], 1))
print("2 intervals", interval_divide2(a[0], a[1], 2))
print("3 intervals", interval_divide2(a[0], a[1], 3))
print("4 intervals", interval_divide2(a[0], a[1], 4))
The results will as follows:
$ python3 intervals.py
1 intervals [[20, 40]]
2 intervals [[20, 30], [30, 40]]
3 intervals [[20, 27], [27, 34], [34, 41]]
4 intervals [[20, 25], [25, 30], [30, 35], [35, 40]]
-----
1 intervals [[20, 40]]
2 intervals [[20, 30], [31, 40]]
3 intervals [[20, 27], [28, 34], [35, 41]]
4 intervals [[20, 25], [26, 30], [31, 35], [36, 40]]
Note that when we are using three intervals the end doesn't terminate properly. This is because we cannot divide 20 by 3 with no reminder, and thus its not possible to have all the intervals of the same size.
We can still improve our answer though by removing the rounding when we calculate the interval as follows (and still keep the result in integer terms):
def interval_divide(min, max, intervals):
assert intervals != 0
interval_size = (max - min) / intervals
result = []
start = min
end = start + interval_size
while True:
result.append([int(start), int(end)])
start = end
end = end + interval_size
if len(result) == intervals:
break
return result
def interval_divide2(min, max, intervals):
assert intervals != 0
interval_size = (max - min) / intervals
result = []
start = min
end = min + interval_size
while True:
result.append([int(start), int(end)])
start = end + 1
end = end + interval_size
if len(result) == intervals:
break
return result
The new answers are:
1 intervals [[20, 40]]
2 intervals [[20, 30], [30, 40]]
3 intervals [[20, 26], [26, 33], [33, 40]]
4 intervals [[20, 25], [25, 30], [30, 35], [35, 40]]
-----
1 intervals [[20, 40]]
2 intervals [[20, 30], [31, 40]]
3 intervals [[20, 26], [27, 33], [34, 40]]
4 intervals [[20, 25], [26, 30], [31, 35], [36, 40]]
The three intervals are still not fully equal, but pretty close without displaying the answer using decimal places.
Use this
import numpy as np
points = np.linspace(20, 40, num=2+1)
intervals = np.array([points[:-1], points[1:]]).transpose()
print(intervals)
and get a np.array:
[[20. 30.]
[30. 40.]]
Of course, for
points = np.linspace(10, 40, num=6+1)
intervals = np.array([points[:-1], points[1:]]).transpose()
print(intervals)
we have
[[10. 15.]
[15. 20.]
[20. 25.]
[25. 30.]
[30. 35.]
[35. 40.]]