Does KNeighborsClassifier compare lists with different sizes? - python

I have to use Scikit Lean's KNeighborsClassifier to compare time series using an user defined function in Python.
knn = KNeighborsClassifier(n_neighbors=1,weights='distance',metric='pyfunc',func=dtw_dist)
The problem is that KNeighborsClassifier doens't seem to support my training data. They are time series, so they are lists with different sizes. KNeighborsClassifier gives me this error message when I try to use fit method (knn.fit(X,Y)):
ValueError: data type not understood
It seems KNeighborsClassifier only supports same size training sets (only time series with same lenght would be accepted, but that is not my case), but my teacher told me to use KNeighborsClassifier. So I don't know what to do...
Any ideas?

Two (or one...) options as far as I can tell:
Precompute the distances (not directly supported by KNeighborsClassifier it seems, other clustering algorithms do, e.g., Spectral Clustering).
Convert your data to be square using NaNs, and handling these accordingly in your custom distance function.
'Square' your data using NaNs
So, option 2 it is.
Say we have the following data, where every row represents a time series:
import numpy as np
series = [
[1,2,3,4],
[1,2,3],
[1],
[1,2,3,4,5,6,7,8]
]
We simply make the data square by adding nans:
def make_square(jagged):
# Careful: this mutates the series list of list
max_cols = max(map(len, jagged))
for row in jagged:
row.extend([None] * (max_cols - len(row)))
return np.array(jagged, dtype=np.float)
make_square(series)
array([[ 1., 2., 3., 4., nan, nan, nan, nan],
[ 1., 2., 3., nan, nan, nan, nan, nan],
[ 1., nan, nan, nan, nan, nan, nan, nan],
[ 1., 2., 3., 4., 5., 6., 7., 8.]])
Now the data 'fits' into the algorithm. You just have to adapt your distance function to account for the NaNs.
Precompute and use a cache function
Oh we can probably do option 1 too (assuming you have N time series):
Precompute the distances into a (N, N) distance matrix D
Create a (N, 1) matrix that is just a range between [0, N) (i.e., the index of the series in the distance matrix)
Create a distance function wrapper
Use this wrapper as the distance function.
wrapper function:
def wrapper(row1, row2):
# might have to fiddle a bit here, but i think this retrieves the indices.
i1, i2 = row1[0], row2[0]
return D[i1, i2]
Ok I hope its clear.
Complete example
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
from mlpy import dtw_std # I dont know if you are using this one: it doesnt matter.
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Example data
series = [
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3],
[1],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 5, 6, 7, 8],
[1, 2, 4, 5, 6, 7, 8],
]
# I dont know.. these seemed to make sense to me!
y = np.array([
0,
0,
0,
0,
1,
2,
2,
2
])
# Compute the distance matrix
N = len(series)
D = np.zeros((N, N))
for i in range(N):
for j in range(i+1, N):
D[i, j] = dtw_std(series[i], series[j])
D[j, i] = D[i, j]
print D
# Create the fake data matrix: just the indices of the timeseries
X = np.arange(N).reshape((N, 1))
# Create the wrapper function that returns the correct distance
def wrapper(row1, row2):
# cast to int to prevent warnings: sklearn converts our integer indices to floats.
i1, i2 = int(row1[0]), int(row2[0])
return D[i1, i2]
# Only the ball_tree algorith seems to accept a custom function
knn = KNeighborsClassifier(weights='distance', algorithm='ball_tree', metric='pyfunc', func=wrapper)
knn.fit(X, y)
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.predict(X)
# [0 0 0 0 1 2 2 2]

Related

Numpy: Finding minimum and maximum values from associations through binning

Prerequisite
This is a question derived from this post. So, some of the introduction of the problem will be similar to that post.
Problem
Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the minimum and maximum of the values grouped by associations.
An example for better clarification.
min_result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
max_result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values array:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
Note: Here result arrays and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping and y_mapping have mappings from 1D values to 2D result(both min and max). The sizes of x_mapping, y_mapping and values will be the same.
x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping - [0, 3, 2, 2, 0, 3, 2, 1]
Here, 1st value(values[0]) and 5th value(values[4]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we compute the minimum and maximum from this group, we will have 1 and 5 as results respectively. So, min_result[0, 0] will have 1 and max_result[0, 0] will have 5.
Note that if there is no association at all then the default value for result will be zero.
Current working solution
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 1])
values = np.array([ 1., 2., 3., 4., 5., 6., 7., 8.], dtype=np.float32)
max_result = np.zeros([4, 2], dtype=np.float32)
min_result = np.zeros([4, 2], dtype=np.float32)
min_result[-y_mapping, x_mapping] = values # randomly initialising from values
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
# maximum
if values[i] > max_result[-y, x]:
max_result[-y, x] = values[i]
# minimum
if values[i] < min_result[-y, x]:
min_result[-y, x] = values[i]
min_result,
[[1., 0.],
[6., 2.],
[3., 0.],
[8., 0.]]
max_result,
[[5., 0.],
[6., 2.],
[7., 0.],
[8., 0.]]
Failed solutions
#1
min_result = np.zeros([4, 2], dtype=np.float32)
np.minimum.reduceat(values, [-y_mapping, x_mapping], out=min_result)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-126de899a90e> in <module>()
1 min_result = np.zeros([4, 2], dtype=np.float32)
----> 2 np.minimum.reduceat(values, [-y_mapping, x_mapping], out=min_result)
ValueError: object too deep for desired array
#2
min_result = np.zeros([4, 2], dtype=np.float32)
np.minimum.reduceat(values, lidx, out= min_result)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-07e8c75ccaa5> in <module>()
1 min_result = np.zeros([4, 2], dtype=np.float32)
----> 2 np.minimum.reduceat(values, lidx, out= min_result)
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (4,2)->(4,) (8,)->() (8,)->(8,)
#3
lidx = ((-y_mapping) % 4) * 2 + x_mapping #from mentioned post
min_result = np.zeros([8], dtype=np.float32)
np.minimum.reduceat(values, lidx, out= min_result).reshape(4,2)
[[1., 4.],
[5., 5.],
[1., 3.],
[5., 7.]]
Question
How to use np.minimum.reduceat and np.maximum.reduceat for solving this problem? I'm looking for a solution that is optimised for runtime.
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2
Approach #1
Again, the most intuitive ones would be with numpy.ufunc.at.
Now, since, these reductions would be performed against the existing values, we need to initialize the output with max values for minimum reductions and min values for maximum ones. Hence, the implementation would be -
min_result[-y_mapping, x_mapping] = values.max()
max_result[-y_mapping, x_mapping] = values.min()
np.minimum.at(min_result, [-y_mapping, x_mapping], values)
np.maximum.at(max_result, [-y_mapping, x_mapping], values)
Approach #2
To leverage np.ufunc.reduceat, we need to sort data -
m,n = max_result.shape
out_dtype = max_result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
max_result_out.flat[unq_ids] = np.maximum.reduceat(val, m_idx)
min_result_out.flat[unq_ids] = np.minimum.reduceat(val, m_idx)

Python : Mapping values to other values without gap

I have the following question. Is there somekind of method with numpy or scipy , which I can use to get an given unsorted array like this
a = np.array([0,0,1,1,4,4,4,4,5,1891,7]) #could be any number here
to something where the numbers are interpolated/mapped , there is no gap between the values and they are in the same order like before?:
[0,0,1,1,2,2,2,2,3,5,4]
EDIT
Is it furthermore possible to swap/shuffle the numbers after the mapping, so that
[0,0,1,1,2,2,2,2,3,5,4]
become something like:
[0,0,3,3,5,5,5,5,4,1,2]
Edit: I'm not sure what the etiquette is here (should this be a separate answer?), but this is actually directly obtainable from np.unique.
>>> u, indices = np.unique(a, return_inverse=True)
>>> indices
array([0, 0, 1, 1, 2, 2, 2, 2, 3, 5, 4])
Original answer: This isn't too hard to do in plain python by building a dictionary of what index each value of the array would map to:
x = np.sort(np.unique(a))
index_dict = {j: i for i, j in enumerate(x)}
[index_dict[i] for i in a]
Seems you need to rank (dense) your array, in which case use scipy.stats.rankdata:
from scipy.stats import rankdata
rankdata(a, 'dense')-1
# array([ 0., 0., 1., 1., 2., 2., 2., 2., 3., 5., 4.])

Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray.
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray).
Just a simple example:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions.
Could you give me some suggestions?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing).
EDIT: Some more explanations:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. Then I round the coordinates to a specified precision (e.g. 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. The indizes of the values do not need to coincide with the location. However they need to be in the correct order.
EDIT: I just ran a quick performance test:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete.
You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array:
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN, but since that's the expected output for those places, you can ignore the warning there. To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method).
Sample run -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method
To avoid warning, an alternative way would be like so -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray package:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor object is very handy since xarray's DataArrays contain the labels so you may just go on with that object rather pulling out the ndarray:
print(xrTensor)
Output:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])

How to remove na and count values NxK arrays in numpy in a vectorized way

My situation: i have a pandas dataframe so that, for each row, I have to compute the following.
1) Get the first valute na excluded (df.apply(lambda x: x.dropna().iloc[0]))
2) Get the last valute na excluded (df.apply(lambda x: x.dropna().iloc[-1]))
3) Count the non na values (df.apply(lambda x: len(x.dropna()))
Sample case and expected output :
x = np.array([[1,2,np.nan], [4,5,6], [np.nan, 8,9]])
1) [1, 4, 8]
2) [2, 6, 9]
3) [2, 3, 2]
And i need to keep it optimized. So i turned to numpy and looked for a way to apply y = x[~numpy.isnan(x)] on a NxK array as a first step. Then,i would use what was shown here (Vectorized way of accessing row specific elements in a numpy array) for 1) and 2) but i am still empty handed for 3)
Here's one way -
In [756]: x
Out[756]:
array([[ 1., 2., nan],
[ 4., 5., 6.],
[ nan, 8., 9.]])
In [768]: m = ~np.isnan(x)
In [769]: first_idx = m.argmax(1)
In [770]: last_idx = m.shape[1] - m[:,::-1].argmax(1) - 1
In [771]: x[np.arange(len(first_idx)), first_idx]
Out[771]: array([ 1., 4., 8.])
In [772]: x[np.arange(len(last_idx)), last_idx]
Out[772]: array([ 2., 6., 9.])
In [773]: m.sum(1)
Out[773]: array([2, 3, 2])
Alternatively, we could make use of cumulative-summation to get those indices, like so -
In [787]: c = m.cumsum(1)
In [788]: first_idx = (c==1).argmax(1)
In [789]: last_idx = c.argmax(1)

Python: Counting identical rows in an array (without any imports)

For example, given:
import numpy as np
data = np.array(
[[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 0, 1],
[0, 1, 1],
[0, 0, 0]])
I want to get a 3-dimensional array, looking like:
result = array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
One way is:
for row in data
newArray[ row[0] ][ row[1] ][ row[2] ] += 1
What I'm trying to do is the following:
for i in dimension1
for j in dimension2
for k in dimension3
result[i,j,k] = (data[data[data[:,0]==i, 1]==j, 2]==k).sum()
This doesn't seem to work and I would like to achieve the desired result by sticking to my implementation rather than the one mentioned in the beginning (or using any extra imports, eg counter).
Thanks.
You can also use numpy.histogramdd for this:
>>> np.histogramdd(data, bins=(2, 2, 2))[0]
array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
The problem is that data[data[data[:,0]==i, 1]==j, 2]==k is not what you expect it to be.
Let's take this apart for the case (i,j,k) == (0,0,0)
data[:,0]==0 is [True, True, False, False, True, True], and data[data[:,0]==0] correctly gives us the lines where the first number is 0.
Now from those lines we get the lines where the second number is 0: data[data[:,0]==0, 1]==0, which gives us [True, False, False, True]. And this is the problem. Because if we take those indices from data, i.e., data[data[data[:,0]==0, 1]==0] we do not get the rows where the first and second number are 0, but the 0th and 3rd row instead:
In [51]: data[data[data[:,0]==0, 1]==0]
Out[51]: array([[0, 0, 0],
[1, 0, 1]])
And if we now filter for the rows where the third number is 0, we get the wrong result w.r.t. the orignal data.
And that's why your approach does not work. For better methods, see the other answers.
You can do something like the following
#Get output dimension and construct output array.
>>> dshape = tuple(data.max(axis=0)+1)
>>> dshape
(2, 2, 2)
>>> out = np.zeros(shape)
If you have numpy 1.8+:
out.flat[np.ravel_multi_index(data.T, dshape)]+=1
Else:
#Get indices and unique the resulting array
>>> inds = np.ravel_multi_index(data.T, dshape)
>>> inds, inverse = np.unique(inds, return_inverse=True)
>>> values = np.bincount(inverse)
>>> values
array([2, 2, 2])
>>> out.flat[inds] = values
>>> out
array([[[ 2., 0.],
[ 0., 2.]],
[[ 0., 2.],
[ 0., 0.]]])
Numpy versions before numpy 1.7 do not have a add.at attribute and the top code will not work without it. As ravel_multi_index may not be the fastest algorithm ever you can look into taking the unique rows of a numpy array. In effect these two operations should be equivalent.
Don't fear the imports. They're what make Python awesome.
If question assumes that you already have the result matrix.
import numpy as np
data = np.array(
[[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 0, 1],
[0, 1, 1],
[0, 0, 0]]
)
result = np.zeros((2,2,2))
# range of each dim, aka allowable values for each dim
dim_ranges = zip(np.zeros(result.ndim), np.array(result.shape)-1)
dim_ranges
# Out[]:
# [(0.0, 2), (0.0, 2), (0.0, 2)]
# Multidimentional histogram will effectively "count" along each dim
sums,_ = np.histogramdd(data,bins=result.shape,range=dim_ranges)
result += sums
result
# Out[]:
# array([[[ 2., 0.],
# [ 0., 2.]],
#
# [[ 0., 2.],
# [ 0., 0.]]])
This solution solves for any "result" ndarray, no matter what the shape. Additionally, it works fine even if your "data" ndarray has indices which are out-of-bounds for your result matrix.

Categories

Resources