kNN feature should passed through as list - python

my data is like:
sample1 = [[1, 0, 3, 5, 0, 9], 0, 1.5, 0]
sample2 = [[0, 4, 0, 6, 2, 0], 2, 1.9, 1]
sample3 = [[9, 7, 6, 0, 0, 0], 0, 1.3, 1]
paul = pd.DataFrame(data = [sample1, sample2, sample3], columns=`['list','cat','metr','target'])`
on this data a scikit-learn kNN-Regression with an specific distance function should be done.
The distance function is:
def my_distance(X,Y,**kwargs):
if len(X)>1:
x = X
y = Y
all_minima = []
for k in range(len(x)):
one_minimum = min(x[k],y[k])
all_minima.append(one_minimum)
sum_all_minima=sum(all_minima)
distance = (sum(x)+sum(y)-sum_all_minima) * kwargs["Para_list"]
elif X.dtype=='int64':
x = X
y = Y
if x == y and x != -1:
distance = 0
elif x == -1 or y == -1 or x is None or y is None:
distance = kwargs["Para_minus1"] * 1
else:
distance = kwargs["Para_nominal"] * 1
else:
x = X
y = Y
if x == y:
distance = 0
elif x == -1 or y == -1 or x is None or y is None:
distance = kwargs["Para_minus1"] * 1
else:
distance = abs(x-y) * kwargs["Para_metrisch"]
return distance
And should be implemented as valid distance function by
DistanceMetric.get_metric('pyfunc',func=my_distance)
As I'm right, the scikit code should be like this:
train , test = train_test_split(paul, test_size = 0.3)
#x_train soll nur unabhähgige Variablen enthalten, andere kommen raus:
x_train = train.drop('target', axis=1)
y_train = train['target']
x_test = test.drop('target', axis = 1)
y_test = test['target']
knn = KNeighborsRegressor(n_neighbors=2,
algorithm='ball_tree',
metric=my_distance,
metric_params={"Para_list": 2,
"Para_minus1": 3,
"Para_metrisch": 2,
"Para_nominal": 4}))
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
I get
ValueError: setting an array element with a sequence.
I guess scikit can not handle a single feature item as list? Is there a way to make that happen?

I guess scikit can not handle a single feature item as list? Is there a way to make that happen?
No, there is no way I know of to make this happen. You need to convert this feature into 2D matrix, concatenate it with other 1D features, to form data appropriately. This is standard sklearn behavior.
Unless you have some very narrow use-case, making 2D array from list feature is totally fine. I assume, all lists have same length.

Related

Speed up multiplication of two dense tensors

I want to perform element wise multiplication between two tensors, where most of the elements are zero.
For two example tensors:
test1 = np.zeros((2, 3, 5, 6))
test1[0, 0, :, 2] = 4
test1[0, 1, [2, 4], 1] = 7
test1[0, 2, 2, :] = 2
test1[1, 0, 4, 1:3] = 5
test1[1, :, 0, 1] = 3
and,
test2 = np.zeros((5, 6, 4, 7))
test2[2, 2, 2, 4] = 4
test2[0, 1, :, 1] = 3
test2[4, 3, 2, :] = 6
test2[1, 0, 3, 1:3] = 1
test2[3, :, 0, 1] = 2
the calulation I need is:
result = test1[..., None, None] * test2[None, None, ...]
In the actual use case I am coding for, the tensors can have more dimensions and much longer lengths in some of the dimensions, so while the multiplication is reasonably quick, I would like to utilise the fact that most of the elements are zero.
My first thought was to make a sparse representation of each tensor.
coords1 = np.nonzero(test1)
shape1 = test1.shape
test1_squished = test1[coords1]
coords1 = np.array(coords1)
coords2 = np.nonzero(test2)
shape2 = test2.shape
test2_squished = test2[coords2]
coords2 = np.array(coords2)
Here there is enough information to perform the multiplication, by comparing the coordinates along the equal axes and multiplying if they are the same.
I have a function for adding a new axis,
def new_axis(coords, shape, axis):
new_coords = np.zeros((len(coords)+1, len(coords[0])))
new_index = np.delete(np.arange(0, len(coords)+1), axis)
new_coords[new_index] = coords
coords = new_coords
new_shape = np.zeros(len(new_coords), dtype=int)
new_shape[new_index] = shape
new_shape[axis] = 1
new_shape = np.array(new_shape)
return coords, new_shape
and for performing the multiplication,
def multiply(coords1, shape1, array1, coords2, shape2, array2): #all inputs should be numpy arrays
if np.array_equal( shape1, shape2 ):
index1 = np.nonzero( ( coords1.T[:, None, :] == coords2.T ).all(-1).any(-1) )[0]
index2 = np.nonzero( ( coords2.T[:, None, :] == coords1.T ).all(-1).any(-1) )[0]
array = array1[index1] * array2[index2]
coords = ( coords1.T[index] ).T
shape = shape1
else:
if len(shape1) == len(shape2):
equal_index = np.nonzero( ( shape1 == shape2 ) )[0]
not_equal_index = np.nonzero( ~( shape1 == shape2 ) )[0]
if np.logical_or( ( shape1[not_equal_index] == 1 ), ( shape2[not_equal_index] == 1 ) ).all():
#if where not equal, one of them = 1 -> can broadcast
# compare dimensions with same length, if equal then multiply corresponding elements
multiply_index1 = np.nonzero(
( coords1[equal_index].T[:, None, :] == coords2[equal_index].T ).all(-1).any(-1)
)[0]
# would like vecotrised version of below
array = []
coords = []
for index in multiply_index1:
multiply_index2 = np.nonzero( ( (coords2[equal_index]).T == (coords1[equal_index]).T[index] ).all(-1) )[0]
array.append( test_squished[index] * test2_squished[multiply_index2] )
temp = np.zeros((6, len(multiply_index2)))
temp[not_equal_index] = ((coords1[not_equal_index].T[index]).T + (coords2[not_equal_index].T[multiply_index2])).T
if len(multiply_index2)==1:
temp[equal_index] = coords1[equal_index].T[index].T[:, None]
else:
temp[equal_index] = np.repeat( coords1[equal_index].T[index].T[:, None], len(multiply_index2), axis=-1)
coords.append(temp)
array = np.concatenate(array)
coords = np.concatenate(coords, axis=-1)
shape = shape1
shape[np.where(shape==1)] = shape2[np.where(shape==1)]
else:
print("error")
else:
print("error")
return array, coords, shape
However the multiply function is very inefficient and so I lose any gain of going to the sparse representation.
Is there an elegant vectorised approach to the multiply function? Or is there a better solution than this sparse tensor idea?
Thanks in advance.

Value at a given index in a NumPy array depends on values at higher indexes in another NumPy array

I have two 1D NumPy arrays x = [x[0], x[1], ..., x[n-1]] and y = [y[0], y[1], ..., y[n-1]]. The array x is known, and I need to determine the values for array y. For every index in np.arange(n), the value of y[index] depends on x[index] and on x[index + 1: ]. My code is this:
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.empty(n, dtype=int)
for index in np.arange(n):
if (x[index] != 0) and (np.any(x[index + 1:] == 0)):
y[index] = np.random.choice([0,1], 1, p=(1-q, q))
else:
y[index] = 0
print(y)
The problem with the for loop is that the size of n in my experiment can become very large. Is there any vectorized way to do this?
Randomly generate the array y with the full shape.
Generate a bool array indicating where to set zeros.
Use np.where to set zeros.
Try this,
import numpy as np
n = 5
q = 0.5
x = np.array([1, 2, 0, 1, 0])
y = np.random.choice([0, 1], n, p=(1-q, q))
condition = (x != 0) & (x[::-1].cumprod() == 0)[::-1] # equivalent to the posted one
y = np.where(condition, y, 0)

Optimize non-trivial function on tensors

I am looking for a way to speed up the specific operation on tensors in PyTorch. Since it is a general operation on matrices, I am open to answers in NumPy as well.
Let's say I have a tensor with values from 0 to N-1 (N=4) where each value repeats the same number of times (R=2).
import torch
x = torch.Tensor([0, 0, 1, 1, 2, 2, 3, 3])
In this case, it is sorted, but any permutation of x is also in the set of considered tensors X.
I am getting an input tensor with values from 0 to N-1 but without any constraints on the repetition.
z = torch.tensor([3, 2, 3, 0, 2, 3, 1, 2])
And I would like to find an efficient implementation of foo such that y = foo(z). y should be some permutation of x (from the set X) that tries to do as few changes in z as possible (in terms of Hamming distance), for example
y = torch.tensor([3, 2, 3, 0, 2, 0, 1, 1])
The trivial solution is to keep counting the number elements with the same value, but it is extremely inefficient to process elements one-by-one for larger tensors:
def foo(z):
R = 2
N = 4
counters = [0] * N
# first, we replace extra elements with -1
y = []
for elem in z:
if counters[elem] < R:
counters[elem] += 1
y.append(elem)
else:
y.append(-1)
y = torch.tensor(y)
assert torch.equal(y, torch.tensor([3, 2, 3, 0, 2, -1, 1, -1]))
# second, we replace -1 by "unfilled" counters
for i in range(len(y)):
if y[i] == -1:
first_unfilled = [n for n in range(N) if counters[n] < R][0]
counters[first_unfilled] += 1
y[i] = first_unfilled
return y
assert torch.equal(y, foo(z))

ValueError: setting an array element with a sequence for generating a weighted data set?

This is the code I'm trying to run to generate a data set with 3 different sample populations, where one class is weighted by a combined Gaussian distribution with 2 sets of means and covariances -- hence the addition of the two multivariate normal rvs functions to feed into the indices of the 'blank' data set. Not sure what I can do to combine them without making it into a sequence?
N_valid = 10000
def generate_data_from_gmm(N, pdf_params, fig_ax=None):
# Determine dimensionality from mixture PDF parameters
n = pdf_params['mu'].shape[1]
print(n)
# Determine number of classes/mixture components
C = len(pdf_params['priors'])
# Output samples and labels
X = np.zeros([N, n])
labels = np.zeros(N)
# Decide randomly which samples will come from each component u_i ~ Uniform(0, 1) for i = 1, ..., N (or 0, ... , N-1 in code)
u = np.random.rand(N)
# Determine the thresholds based on the mixture weights/priors for the GMM, which need to sum up to 1
thresholds = np.cumsum(pdf_params['priors'])
thresholds = np.insert(thresholds, 0, 0) # For intervals of classes
marker_shapes = 'ox+*.' # Accomodates up to C=5
marker_colors = 'brgmy'
Y = np.array(range(1, C+1))
for y in Y:
# Get randomly sampled indices for this component
indices = np.argwhere((thresholds[y-1] <= u) & (u <= thresholds[y]))[:, 0]
# No. of samples in this component
Ny = len(indices)
labels[indices] = y * np.ones(Ny) - 1
if n == 1:
X[indices, 0] = norm.rvs(pdf_params['mu'][y-1], pdf_params['Sigma'][y-1], Ny)
else:
X[indices, :] = (multivariate_normal.rvs(pdf_params['mu'][y-1], pdf_params['Sigma'][y-1], Ny) + multivariate_normal.rvs(pdf_params['mu'][y], pdf_params['Sigma'][y], Ny))
gmm_pdf = {}
# Likelihood of each distribution to be selected AND class priors!!!
gmm_pdf['priors'] = np.array([0.65, 0.35])
gmm_pdf['mu'] = np.array([[3, 0],
[0, 3],
[2, 2]]) # Gaussian distributions means
gmm_pdf['Sigma'] = np.array([[[2, 0],
[0, 1]],
[[1, 0],
[0, 2]],
[1,0],
[0,1]]) # Gaussian distributions covariance matrices
This specifically happens in this line:
X[indices, :] = (multivariate_normal.rvs(pdf_params['mu'][y-1], pdf_params['Sigma'][y-1], Ny)
+ multivariate_normal.rvs(pdf_params['mu'][y], pdf_params['Sigma'][y], Ny))
Any ideas?

Numpy: Combine several arrays based on an indices array

I have 2 arrays of different sizes m and n, for instance:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
I also have an integer array of size m+n, for instance:
indices = np.asarray([1, 1, 0, 1 , 0])
I'd like to combine x and y into an array z of size m+n, in this case:
expected_z = np.asarray([300, 400, 100, 500, 200])
In details:
The 1st value of indices is 1, so the 1st value of z should come from y. Therefore 300.
The 2nd value of indices is 1, so the 2nd value of z should also come from y. Therefore 400
The 3rd value of indices is 0, so the 3rd value of z should this time come from x. Therefore 100
...
How could I do that efficiently in NumPy?
Thanks in advance!
Make an output array and use boolean indexing to assign x and y into the correct slots of the output:
z = numpy.empty(len(x)+len(y), dtype=x.dtype)
z[indices==0] = x
z[indices==1] = y
out will be your desired output:
out = indices.copy()
out[np.where(indices==0)[0]] = x
out[np.where(indices==1)[0]] = y
or as the above answer suggested, simply do:
out = indices.copy()
out[indices==0] = x
out[indices==1] = y
i hope this could help you:
x = np.asarray([100, 200])
y = np.asarray([300, 400, 500])
indices = np.asarray([1, 1, 0, 1 , 0])
expected_z = np.asarray([])
x_indice = 0
y_indice = 0
for i in range(0,len(indices)):
if indices[i] == 0:
expected_z = np.insert(expected_z,i,x[x_indice])
x_indice += 1
else:
expected_z = np.insert(expected_z,i,y[y_indice])
y_indice += 1
expected_z
and the output is:
output : array([300., 400., 100., 500., 200.])
P.S. always make sure that len(indices) == len(x) + len(y) and :
the values that are coming from y == len(y)
the values that are coming from x == len(x)

Categories

Resources