Access data points scikit KNNR

Access data points scikit KNNR - python

Question
After fitting the data with neigh.fit() I would like to access these data-points, how do I do this?
Details
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> samplesy = [80, 60, 40]
>>> from sklearn import neighbors
>>> neigh = neighbors.KNeighborsRegressor(n_neighbors=1)
>>> neigh.fit(samples, samplesy)
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]))
So from this I learned that the closest data-point is 'samples[2]'.
However in the case I don't have access anymore to the variable 'samples', is there a way to access the data-point in 'neigh'? Maybe something like 'neigh[2]'? Because the data-points have to be saved somewhere in the model of 'neigh' right?
Why
I would like to access the 5 closest neighbors data-points and calculate a cluster-center of these data-points. Then I want to calculate the distance of this cluster-center to the new data point to get an idea of how far this new data-point is from the original data.

The data used to fit the model are stored in neigh._fit_X:
>>> neigh._fit_X
array([[ 0. , 0. , 0. ],
[ 0. , 0.5, 0. ],
[ 1. , 1. , 0.5]])
However: The leading underscore of the variable name should be a signal to you that this is supposed to be somewhat of a private attribute. You shouldn't expect for this data to behave in any particular way, or even to exist in future versions of the library. Use it at your own risk.
A better way might be to just keep track of the input data on your own.

Related

how to fix memory error while using np.r_

I have a list with 482000 entries. The structure of the array is like this:
X_docs = [array([0., 0., 0., ..., 0., 0., 0.]),
array([0.60205999, 0.60205999, 0.47712125, ..., 0. , 0. ,0.])]
each array have 5000 entries. so in the end we have 482000 * 5000.
Then I need to apply np.r over it like this:
np.r_[X_docs]
When it reaches this line it raises this error:
MemoryError
I dont know how to fix this? is there any limitation regarding the numpy thing?
I have 32 gig ram. I even tried to run it in AWS Amazon sagemaker(Free version). there it still raises error.
Update 1
This is the whole code before reaching the np part:
corpus = load_corpus(args.input)
n_vocab, docs = len(corpus['vocab']),
corpus['docs'] corpus.clear()
# save memory
doc_keys = docs.keys()
X_docs = []
for k in doc_keys:
X_docs.append(vecnorm(doc2vec(docs[k], n_vocab), 'logmax1', 0))
del docs[k] X_docs = np.r_[X_docs]

python - numpy - many matrices multiplying many vectors

I have a set of many matrices each corresponding to a vector. I want to multiply each matrix by its vector smartly. I know I can putt all the matrices in a big block diagonal form, and multiply it by a big combined vector.
I want to know if there is a way to use numpy.dot to multiply all of them in an efficient way.
I have tried to use numpy.stack and the numpy.dot, but I can't get only the wanted vectors.
To be more specific. My matrices look like:
R_stack = np.stack((R, R2, R3))
which is
array([[[-0.60653066, 1.64872127],
[ 0.60653066, -1.64872127]],
[[-0.36787944, 2.71828183],
[ 0.36787944, -2.71828183]],
[[-0.22313016, 4.48168907],
[ 0.22313016, -4.48168907]]])
and my vectors look like:
p_stack = np.stack((p0, p0_2, p0_3))
which is
array([[[0.73105858],
[0.26894142]],
[[0.88079708],
[0.11920292]],
[[0.95257413],
[0.04742587]]])
I want to multiply the following: R*p0, R2*p0_2, R3*p0_3.
When I do the dot :
np.dot(R_stack, p_stack)[:,:,:,0]
I get
array([[[ 0. , -0.33769804, -0.49957337],
[ 0. , 0.33769804, 0.49957337]],
[[ 0.46211716, 0. , -0.22151555],
[-0.46211716, 0. , 0.22151555]],
[[ 1.04219061, 0.33769804, 0. ],
[-1.04219061, -0.33769804, 0. ]]])
The 3 vectors I'm interested in are the 3 [0,0] vectors on the diagonal. How can I get them?

You are almost there. You need to add a diagonal index on 1st and 3rd dimensions like so:
np.dot(R_stack, p_stack)[np.arange(3),:,np.arange(3),0]
Every row in the result will correspond to one of your desired vectors:
array([[-3.48805945e-09, 3.48805945e-09],
[-5.02509157e-09, 5.02509157e-09],
[-1.48245199e-08, 1.48245199e-08]])

Another way I found is to use numpy.diagonal
np.diagonal(np.dot(R_stack, p_stack)[:,:,:,0], axis1=0, axis2=2)
which gives a vector in each column:
array([[0., 0., 0.],
[0., 0., 0.]])

Most appropriate variable types to deal with coordinates in python

I am writing python code that uses Cartesian methods and trigonometry to move, resize and rotate shapes on a plane, and to track and report these shenanigans.
It will not be computationally intensive - typically a user-instruction would lead to a single move/rotate/resize operation.
I would like to know what is the most appropriate variable type to use for the shape coordinate and dimension pairs, and why.
The types I have considered are
x = 10
y = -15
list_coords = [x, y]
tuple_coords = (x, y)
import numpy as np
array_coords = np.array([x, y])
import cmath as cm
complex_coords = x + j*y
If you know of other good options, please also tell me about it.
Thanks!

Short answer, Tuple
From "What's the difference between lists and tuples?" thread,
Tuples are heterogeneous data structures (i.e., their entries have
different meanings), while lists are homogeneous sequences.
Tuples have structure, lists have order.
Using this distinction makes code more explicit and understandable.
As tuples consists of heterogeneous entities, instead of an order of homogeius entities, tuple is a great way to deal with coordinate systems. Also the coordinate operations like addition & substraction is fairly simple with tuples.
Example:
import operator
a = (1,2,3)
b = (5,6,7)
c = tuple(map(operator.add, a, b))
Also tuple is immutable. This seems inconvenient at first, but using immutable data like this in functional programming techniques has substantial advantages.

Lots of options. Consider a polygon. In most GIS programs the first and last point are repeated to form closure, as in the polygon 'a' below using numpy
import numpy as np
a = np.array([[0., 0.], [0., 1000.], [1000., 1000.], [1000., 0.], [ 0., 0.]])
a
array([[ 0., 0.],
[ 0., 1000.],
[ 1000., 1000.],
[ 1000., 0.],
[ 0., 0.]])
The dtype for the above is a simple float64. You can convert it to a structured array by assigning an appropriate data type as follows:
b = np.zeros((a.shape[0]), dtype=[('Xs', '<f8'), ('Ys', '<f8')])
b['Xs'] = a[:,0]; b['Ys'] = a[:,1]
b
array([(0.0, 0.0), (0.0, 1000.0), (1000.0, 1000.0), (1000.0, 0.0), (0.0, 0.0)],
dtype=[('Xs', '<f8'), ('Ys', '<f8')])
You can go one step further and produce a 'recarray' if you prefer to use object.property notation with your objects.
c = b.view(np.recarray)
With the standard array with the uniform dtype, you can access the X values using slicing, with the structured array you add the ability to slice by column name, and finally, with the recarray you can use object.property notation.
args = [a[:,0], b['Xs'], c.Xs] # ---- get the X coordinates
print('{}\n{}\n{}'.format(*args))
[ 0. 0. 1000. 1000. 0.]
[ 0. 0. 1000. 1000. 0.]
[ 0. 0. 1000. 1000. 0.]
You can get a polygon centroid from the unique points in the array..
np.mean(a[:-1], axis=0)
array([ 500., 500.])
In fact it is easy to get unique points from an array given the right form
np.unique(b)
array([(0.0, 0.0), (0.0, 1000.0), (1000.0, 0.0), (1000.0, 1000.0)],
dtype=[('Xs', '<f8'), ('Ys', '<f8')])
You may have noticed that I have been switching back and forth between conventional ndarrays, those with named fields and recarrays. That is because you can use the same data and just view it in different ways if you like.

How this command "preprocessing.scale" do in term of math?

I have read the manual in scikit learn website and i still don't know what is the mathematical formula behind this command.
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])

Center to the mean and component wise scale to unit variance.
This means that mean value along the axis is subtracted from X and the resulting value is divided by std along the axis.

Andrey's formula in the comments is correct - I'd just add that numpy and scikit-learn use the population formula for calculating the standard deviation, not the sample standard deviation, which is the default in other languages like R. So numpy and scikit-learn divide the sum of squares by n, instead of n-1.

Theano: Operate on nonzero elements of sparse matrix

I'm trying to take the exp of nonzero elements in a sparse theano variable. I have the current code:
A = T.matrix("Some matrix with many zeros")
A_sparse = theano.sparse.csc_from_dense(A)
I'm trying to do something that's equivalent to the following numpy syntax:
mask = (A_sparse != 0)
A_sparse[mask] = np.exp(A_sparse[mask])
but Theano doesn't support != masks yet. (And (A_sparse > 0) | (A_sparse < 0) doesn't seem to work either.)
How can I achieve this?

The support for sparse matrices in Theano is incomplete, so some things are tricky to achieve. You can use theano.sparse.structured_exp(A_sparse) in that particular case, but I try to answer your question more generally below.
Comparison
In Theano one would normally use the comparison operators described here: http://deeplearning.net/software/theano/library/tensor/basic.html
For example, instead of A != 0, one would write T.neq(A, 0). With sparse matrices one has to use the comparison operators in theano.sparse. Both operators have to be sparse matrices, and the result is also a sparse matrix:
mask = theano.sparse.neq(A_sparse, theano.sparse.sp_zeros_like(A_sparse))
Modifying a Subtensor
In order to modify part of a matrix, one can use theano.tensor.set_subtensor. With dense matrices this would work:
indices = mask.nonzero()
A = T.set_subtensor(A[indices], T.exp(A[indices]))
Notice that Theano doesn't have a separated boolean type—the mask is zeros and ones—so nonzero() has to be called first to take the indices of the nonzero elements. Furthermore, this is not implemented for sparse matrices.
Operating on Nonzero Sparse Elements
Theano provides sparse operations that are said to be structured and operate only on the nonzero elements. See:
http://deeplearning.net/software/theano/tutorial/sparse.html#structured-operation
More precisely, they operate on the data attribute of a sparse matrix, independent of the indices of the elements. Such operations are straightforward to implement. Note that the structured operations will operate on all the values in the data array, also those that are explicitly set to zero.

Here's a way of doing this with the scipy.sparse module. I don't know how theano implements its sparse. It's likely to be based on similar ideas (since it uses name like csc)
In [224]: A=sparse.csc_matrix([[1.,0,0,2,0],[0,0,3,0,0],[0,1,1,2,0]])
In [225]: A.A
Out[225]:
array([[ 1., 0., 0., 2., 0.],
[ 0., 0., 3., 0., 0.],
[ 0., 1., 1., 2., 0.]])
In [226]: A.data
Out[226]: array([ 1., 1., 3., 1., 2., 2.])
In [227]: A.data[:]=np.exp(A.data)
In [228]: A.A
Out[228]:
array([[ 2.71828183, 0. , 0. , 7.3890561 , 0. ],
[ 0. , 0. , 20.08553692, 0. , 0. ],
[ 0. , 2.71828183, 2.71828183, 7.3890561 , 0. ]])
The main attributes of the csc format at data, indices, indptr. It's possible that data has some 0 values if you fiddle with them after creation, but a freshly created matrix shouldn't.
The matrix also has a nonzero method modeled on the numpy one. In practice it converts the matrix to coo format, filters out any zero values, and returns the row and col attributes:
In [229]: A.nonzero()
Out[229]: (array([0, 0, 1, 2, 2, 2]), array([0, 3, 2, 1, 2, 3]))
And the csc format allows indexing just as a dense numpy array:
In [230]: A[A.nonzero()]
Out[230]:
matrix([[ 2.71828183, 7.3890561 , 20.08553692, 2.71828183,
2.71828183, 7.3890561 ]])

T.where works.
A_sparse = T.where(A_sparse == 0, 0, T.exp(A_sparse))
#Seppo Envari's answer seems faster though. So I'll accept his answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.