sklearn OPTICS and precomputed cosine matrix yields no clusters - python

i am trying to use sklearn.cluster.OPTICS to cluster an already computed similarity (distance) matrix filled with normalized cosine distances (0.0 to 1.0)
but no matter what i give in max_eps and eps i don't get any clusters out.
Later on i would need to run OPTICS on a similarity matrix of more than 129'000 x 129'000 items hopefully relying on Dask to keep memory footprint low.
I am extracting fasttext vectors for a small amount of words (each vector 300 dimensions) and use dask-distance to create a similarity matrix from the vectors.
The result is a matrix looking like this:
sim == [[0. 0.56742118 0.42776633 0.42344265 0.84878847 0.87984235
0.87468601 0.95224451 0.89341788 0.80922083]
[0.56742118 0. 0.59779273 0.62900345 0.83004028 0.87549904
0.887784 0.8591598 0.80752158 0.80960947]
[0.42776633 0.59779273 0. 0.45120935 0.79292425 0.78556189
0.82378645 0.93107747 0.83290157 0.85349163]
[0.42344265 0.62900345 0.45120935 0. 0.81379353 0.83985011
0.8441614 0.89824009 0.77074847 0.81297649]
[0.84878847 0.83004028 0.79292425 0.81379353 0. 0.15328565
0.36656755 0.79393195 0.76615941 0.83415538]
[0.87984235 0.87549904 0.78556189 0.83985011 0.15328565 0.
0.36000894 0.7792588 0.77379052 0.83737352]
[0.87468601 0.887784 0.82378645 0.8441614 0.36656755 0.36000894
0. 0.82404421 0.86144969 0.87628284]
[0.95224451 0.8591598 0.93107747 0.89824009 0.79393195 0.7792588
0.82404421 0. 0.521453 0.5784272 ]
[0.89341788 0.80752158 0.83290157 0.77074847 0.76615941 0.77379052
0.86144969 0.521453 0. 0.629014 ]
[0.80922083 0.80960947 0.85349163 0.81297649 0.83415538 0.83737352
0.87628284 0.5784272 0.629014 0. ]]
which looks like something i could cluster using a threshold of 0.8 for example
from dask import array as da
import dask_distance
import logging
import numpy as np
from sklearn.cluster import OPTICS
from collections import defaultdict
log = logging.warning
np.set_printoptions(suppress=True)
if __name__ == "__main__":
array = np.load("vectors.npy")
vectors = da.from_array(array)
sim = dask_distance.cosine(vectors, vectors)
sim = sim.clip(0.0, 1.0)
m = np.max(sim)
c = OPTICS(eps=-1, cluster_method="dbscan", metric="precomputed", algorithm="brute")
clusters = c.fit(sim)
words = [
"icecream",
"cake",
"cream",
"ice",
"dog",
"cat",
"animal",
"car",
"truck",
"bus",
]
cs = defaultdict(list)
for index, c in enumerate(clusters.labels_):
cs[c].append(words[index])
for v in cs.values():
log(v)
log(clusters.labels_)
which prints
['icecream', 'cake', 'cream', 'ice', 'dog', 'cat', 'animal', 'car', 'truck', 'bus']
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
but i was expecting there to be several clusters.
I have tried many different values for all the supported parameters in OPTICS but have not been able to yield anything usable or even more clusters than just one.
i am using following versions:
python -V
Python 3.7.3
sklearn.__version__
'0.21.3'
dask.__version__
'2.3.0'
numpy.__version__
'1.17.0'
Here is how it looks with using sklearn DBSCAN instead
...
sim = sim.astype(np.float32)
c = DBSCAN(eps=0.7, min_samples=1, metric="precomputed", n_jobs=-1)
clusters = g.fit(sim)
...
yields
['icecream', 'cake', 'cream', 'ice']
['dog', 'cat', 'animal']
['car', 'truck', 'bus']
[0 0 0 0 1 1 1 2 2 2]
Which is very correct, but has a much higher memory footprint (OPTICS apparently only needs to calculate half of the matrix)

Have you tried to estimate how much memory a 129000x129000 matrix needs - and how long it will take you to compute that and work with that?!? I strongly doubt that dask will be that helpful here in scaling this. You will need to use some indexing approach to avoid any O(n²) cost in the first place. Cutting O(n²) by a factor of k with k nodes just doesn't get you far enough to be scalable.
When you use "precomputed", you already computed the full distance matrix. Neither OPTICS not DBSCAN will now compute this again (nor just the lower half of it) - they will only iterate over this huge huge matrix because they cannot make any assumptions on it: not even that it is symmetric.
Why do you think eps=-1 is right? What about min_samples with OPTICS? If you don't choose the same parameters, you of course don't get similar results of OPTICS and DBSCAN.
The result found by OPTICS with your parameters is correct. At eps=-1 no points are neighbors, and with min_samples=5 hence there are no clusters, all points should be labeled -1.

Related

How to use sklearn's Matrix factorization to predict new users' recommendation scores

I'm trying to use sklearn.decomposition.NMF to a matrix R that contains data on how users rated items to predict user ratings for items that they have not yet seen.
the matrix's rows being users, columns being items, and values being scores, with 0 score meaning that the user did not rate this item yet.
Now with the code below I have only managed to get the two matrices that when multiplied together give the original matrix back.
import numpy
R = numpy.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
])
from sklearn.decomposition import NMF
model = NMF(n_components=4)
A = model.fit_transform(R)
B = model.components_
n = numpy.dot(A, B)
print(n)
Problem is, that the model does not predict new values in place of 0's, that would be the predicted scores, but instead recreates the matrix as was.
How do I get the model to predict user scores in place of my original matrix's zeros?
That is what is supposed to happen.
However in most of the cases you are not going to have number of components so similar to the number of products and/or customers.
So for instance considering 2 components
model = NMF(n_components=2)
A = model.fit_transform(R)
B = model.components_
R_estimated = np.dot(A, B)
print(np.sum(R-R_estimated))
-1.678873127048393
R_estimated
array([[5.2558264 , 1.99313836, 0. , 1.45512772],
[3.50429478, 1.32891458, 0. , 0.9701988 ],
[1.31294288, 0.94415991, 1.94956896, 3.94609389],
[0.98129195, 0.72179987, 1.52759811, 3.0788454 ],
[0. , 0.65008935, 2.84003662, 5.21894555]])
You can see in this case that many of the previous zeros are now other numbers you could use. Here for a bit of context https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems).
How to select n_components?
I think the question above is answered, but in case the complete procedure could be something as below.
For that we will need to know a the values in R that are real and we want to focus to predict.
In many cases 0 in R are those new cases / scenarios.
It is common to update R with the averages for products or customers and then calculate the decomposition for selecting the ideal n_components. For selection of they maybe a criteria or more to calculate the advantage in a test sample
Create R_with_Averages
Model selection:
2.1) Split R_with_Averages Test and Training
2.2) Compare among different n_components (from 1 and arbitrary number) using a metric (in which you only consider real evaluations in R)
2.3) Select the best model --> best n_components
Predict with the best model.
Perhaps good to see:
Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. (2000). Application of Dimensionality Reduction in Recommender System—A Case Study. In ACM WebKDD’00 (Web-mining for ECommerce Workshop). This give you and overall view.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/. Example with code very similar.
sklearn's implementation of NMF does not seem to support missing values (Nans, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise's NMF implementation, as shown in the following code:
import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader
R = np.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
], dtype=np.float)
R[R==0] = np.nan
print(R)
# [[ 5. 3. nan 1.]
# [ 4. nan nan 1.]
# [ 1. 1. nan 5.]
# [ 1. nan nan 4.]
# [nan 1. 5. 4.]]
df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)
k = 2
algo = NMF(n_factors=k)
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
print(R_hat)
# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]
The NMF implementation is as per the [NMF:2014] paper as described here and shown below:
Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0, as expected).
Again, as usual, we can find the number of factors k using cross-validation.

PCA analysis considering N-less relevant components

I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.
How should the analysis be done?
I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
At the moment I am doing something like this:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.
Is there someone that can tell me if previous steps are correct and point out the direction to follow?
Thank you very much
EDIT: currently I have adapted this solution as suggested by the first answer to my question:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.
If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])

How can I find what data is clustering in K-Shape?

I wrote codes,
import numpy
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
ks = KShape(n_clusters=3, n_init=10, verbose=True, random_state=seed)
y_pred = ks.fit_predict(data)
plt.figure(figsize=(16,9))
for yi in range(3):
plt.subplot(3, 1, 1 + yi)
for xx in stack_data[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.title("Cluster %d" % (yi + 1))
plt.tight_layout()
plt.show()
I want to divide data by usigng KShape’s clustering.Now plot is shown, but I cannot find what data is in each 3 clustering.
data is an order of A,B,C,D ’s kind.So I want to show label to plot or the result of the clustering.I searched KShape’s document (http://tslearn.readthedocs.io/en/latest/auto_examples/plot_kshape.html ),but I cannot find the information to do my ideal things.How should I do it?
Why there are no perfect solutions
K-Shape works randomly, and without setting a seed for every iteration you might get different clusters and centroids. There is no deterministic way to know a-priori if a given class is completely described by a given centroid, but you can proceed in an offline fashion, in a fuzzy way, by checking to which centroid a given class is classified mostly.
Also any given class, A for instance, could contain elements that are part of two clusters in the space of the features you are considering.
Suppose you have 3 classes but your dataset is best described (for example by maximal average density) by 4 clusters: you'd surely have some points of at least one class that go in the 4th cluster.
Or alternatively, suppose your classes do not overlap with the centroids generated by the distance metric you are considering: take in consideration an obvious example: you have 3 classes, numbers from 0 to 100, from 100 to 1000 and from 1000 to 1100, but your dataset contains numbers from 0 to 150 and from 950 to 1100: a clustering algorithm would find its optimum with 2 clusters and put the points of class A in either one of the two.
Once you have determined that, for example, class A goes mostly to cluster 1, class B to cluster 2 etc... you can proceed to assign that cluster to the given class.
A possible fuzzy approach
We will proceed to determining the clusters classes by assigning the best fitted class to the cluster that contains most of its points:
Simple example: classes that actually fit clusters
For this example we use one of tslearn.datasets. This code is partially taken from this K-Shape example on tslearn.
import numpy as np
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from seaborn import heatmap
We set the seed, for code reproducibility:
seed = 0
np.random.seed(seed)
Firstly we prepare the dataset, selecting the first classes_number=3 classes:
classes_number = 3
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
mask = y_train <= classes_number
X_train, y_train = X_train[mask], y_train[mask] # Keep first 3 classes
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train) # Keep only 50 time series
sz = X_train.shape[1]
Now we find the clusters, with clusters_number=3:
# Euclidean k-means
clusters_number = 3
ks = KShape(n_clusters=clusters_number, verbose=False, random_state=seed)
y_pred = ks.fit_predict(X_train)
We now proceed to count the elements of each class that are assigned to each cluster and to add the 0 paddings for where no elements of a given class was assigned to a given cluster (surely there will be a more pythonic way to d this but I've yet to find it):
data = [np.unique(y_pred[y_train==i+1], return_counts=True) for i in range(classes_number)]
>>>[(array([2]), array([26])),
(array([0]), array([21])),
(array([1]), array([22]))]
Adding the padding:
padded_data = np.array([[
data[j][1][data[j][0] == i][0] if np.any(data[j][0] == i) else 0
for i in range(clusters_number)
] for j in range(classes_number)])
>>> array([[ 0, 0, 26],
[21, 0, 0],
[ 0, 22, 0]])
Normalising the obtained matrix:
normalized_data = padded_data / np.sum(padded_data, axis=-1)[:, np.newaxis]
>>> array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])
We can visualise the obtained matrix using seaborn.heatmap:
xticklabels = ["Cluster n. %s" % (1+i) for i in range(clusters_number)]
yticklabels = ["Class n. %s" % (1+i) for i in range(classes_number)]
heatmap(
normalized_data,
cbar=False,
square=True,
annot=True,
cmap="YlGnBu",
xticklabels=xticklabels,
yticklabels=yticklabels)
plt.yticks(rotation=0)
Obtaining:
In this optimal situation, every cluster contains only and exactly one class, so with absolute precision we obtain:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Second example: classes that do not overlap with clusters
For simplicity sake, to simulate classes that do not overlap completely with the clusters I'm just going to shuffle part of the labels, but there a vast range of example: most of clustering problems ends up with classes that do not exactly coincide with a cluster.
tmp = y_train[:20]
np.random.shuffle(tmp)
y_train[:20] = tmp
Now, when we execute the script again we get quite a different matrix:
But we are still able to determine the classes clusters:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Third example: classes that do not exist in the dataset
Suppose we were lead to believe that in the dataset existed 4 classes: we would find after running with different values of k that the best number of clusters is k=3 in our current dataset: how would we proceed to assign the classes to the clusters? Which class could be thrown away?
We proceed to simulate such a situation by arbitrarily assigning a forth class to our labels:
y_train[:20] = 4
Running again our script we would obtain:
Clearly the 4th class has got to go. We can proceed by thresholding on the mean variance:
threshold = np.mean(np.var(normalized_data, axis=1))
result = np.argmax(normalized_data[np.var(normalized_data, axis=1)>threshold], axis=1)
And we obtain yet again:
array([2, 0, 1])
I hope this explanation has cleared most of your doubts!

In python, what is a good way to match expected values to real values?

Given a Dictionary with ideal x,y locations, I have a list of unordered real x,y locations that are close to the ideal locations and I need to classify them to the corresponding ideal location dictionary key. Sometimes, I get no data at all (0,0) for a given location.
An example dataset is:
idealLoc= {1:(907,1026),
2:(892,1152),
3:(921,1364),
4:(969,1020),
5:(949,1220),
6:(951,1404),
'No_Data':(0,0)}
realLoc = [[ 892., 1152.],
[ 969., 1021.],
[ 906., 1026.],
[ 949., 1220.],
[ 951., 1404.],
[ 0., 0.]]
The output would be a new dictionary with the real locations assigned to the correct dictionary key from idealLoc. I have considered the brute force approach (scan the whole list n times for each best match), but I was wondering if there is a more elegant/efficient way?
Edit: Below is the "brute" force method
Dest = {}
dp = 6
for (y,x) in realLoc:
for key, (r,c) in idealLoc.items():
if x > c-dp and x < c+dp and y > r-dp and y < r+dp:
Dest[key] = [y,x]
break
K-d trees are an efficient way to partition data in order to perform fast nearest-neighbour searches. You can use scipy.spatial.cKDTree to solve your problem like this:
import numpy as np
from scipy.spatial import cKDTree
# convert inputs to numpy arrays
ilabels, ilocs = (np.array(vv) for vv in zip(*idealLoc.iteritems()))
rlocs = np.array(realLoc)
# construct a K-d tree that partitions the "ideal" points
tree = cKDTree(ilocs)
# query the tree with the real coordinates to find the nearest "ideal" neigbour
# for each "real" point
dist, idx = tree.query(rlocs, k=1)
# get the corresponding labels and coordinates
print(ilabels[idx])
# ['2' '4' '1' '5' '6' 'No_Data']
print(ilocs[idx])
# [[ 892 1152]
# [ 969 1020]
# [ 907 1026]
# [ 949 1220]
# [ 951 1404]
# [ 0 0]]
By default cKDTree uses the Euclidean norm as the distance metric, but you could also specify the Manhattan norm, max norm etc. by passing the p= keyword argument to tree.query().
There is also the scipy.interpolate.NearestNDInterpolator class, which is basically just a convenience wrapper around scipy.spatial.cKDTree.
Assuming you want to use euclidean distance, you can use scipy.spatial.distance.cdist to calculate the distance matrix and then choose the nearest point.
import numpy
from scipy.spatial import distance
ideal = numpy.array(idealloc.values())
real = numpy.array(realloc)
dist = distance.cdist(ideal, real)
nearest_indexes = dist.argmin(axis=0)

Numpy's 'linalg.solve' and 'linalg.lstsq' not giving same answer as Matlab's '\' or mldivide

I'm trying to implement the least squares curve fitting algorithm on Python, having already written it on Matlab. However, I'm having trouble getting the right transform matrix, and the problem seems to be happening at the solve step. (Edit: My transform matrix is incredibly accurate with Matlab, but completely off with Python.)
I've looked at numerous sources online, and they all indicate that to translate Matlab's 'mldivide', you have to use 'np.linalg.solve' if the matrix is square and nonsingular, and 'np.linalg.lstsq' otherwise. Yet my results are not matching up.
What is the problem? If it has to do with the implementation of the functions, what is the correct translation of mldivide in numpy?
I have attached both versions of the code below. They are essentially the exact same implementation, with exception to the solve part.
Matlab code:
%% Least Squares Fit
clear, clc, close all
% Calibration Data
scr_calib_pts = [0,0; 900,900; -900,900; 900,-900; -900,-900];
cam_calib_pts = [-1,-1; 44,44; -46,44; 44,-46; -46,-46];
cam_x = cam_calib_pts(:,1);
cam_y = cam_calib_pts(:,2);
% Least Squares Fitting
A_matrix = [];
for i = 1:length(cam_x)
A_matrix = [A_matrix;1, cam_x(i), cam_y(i), ...
cam_x(i)*cam_y(i), cam_x(i)^2, cam_y(i)^2];
end
A_star = A_matrix'*A_matrix
B_star = A_matrix'*scr_calib_pts
transform_matrix = mldivide(A_star,B_star)
% Testing Data
test_scr_vec = [200,400; 1600,400; -1520,1740; 1300,-1800; -20,-1600];
test_cam_vec = [10,20; 80,20; -76,87; 65,-90; -1,-80];
test_cam_x = test_cam_vec(:,1);
test_cam_y = test_cam_vec(:,2);
% Coefficients for Transform
coefficients = [];
for i = 1:length(test_cam_x)
coefficients = [coefficients;1, test_cam_x(i), test_cam_y(i), ...
test_cam_x(i)*test_cam_y(i), test_cam_x(i)^2, test_cam_y(i)^2];
end
% Mapped Points
results = coefficients*transform_matrix;
% Plotting
test_scr_x = test_scr_vec(:,1)';
test_scr_y = test_scr_vec(:,2)';
results_x = results(:,1)';
results_y = results(:,2)';
figure
hold on
load seamount
s = 50;
scatter(test_scr_x, test_scr_y, s, 'r')
scatter(results_x, results_y, s)
Python code:
# Least Squares fit
import numpy as np
import matplotlib.pyplot as plt
# Calibration data
camera_vectors = np.array([[-1,-1], [44,44], [-46,44], [44,-46], [-46,-46]])
screen_vectors = np.array([[0,0], [900,900], [-900,900], [900,-900], [-900,-900]])
# Separate axes
cam_x = np.array([i[0] for i in camera_vectors])
cam_y = np.array([i[1] for i in camera_vectors])
# Initiate least squares implementation
A_matrix = []
for i in range(len(cam_x)):
new_row = [1, cam_x[i], cam_y[i], \
cam_x[i]*cam_y[i], cam_x[i]**2, cam_y[i]**2]
A_matrix.append(new_row)
A_matrix = np.array(A_matrix)
A_star = np.transpose(A_matrix).dot(A_matrix)
B_star = np.transpose(A_matrix).dot(screen_vectors)
print A_star
print B_star
try:
# Solve version (Implemented)
transform_matrix = np.linalg.solve(A_star,B_star)
print "Solve version"
print transform_matrix
except:
# Least squares version (implemented)
transform_matrix = np.linalg.lstsq(A_star,B_star)[0]
print "Least Squares Version"
print transform_matrix
# Test data
test_cam_vec = np.array([[10,20], [80,20], [-76,87], [65,-90], [-1,-80]])
test_scr_vec = np.array([[200,400], [1600,400], [-1520,1740], [1300,-1800], [-20,-1600]])
# Coefficients of quadratic equation
test_cam_x = np.array([i[0] for i in test_cam_vec])
test_cam_y = np.array([i[1] for i in test_cam_vec])
coefficients = []
for i in range(len(test_cam_x)):
new_row = [1, test_cam_x[i], test_cam_y[i], \
test_cam_x[i]*test_cam_y[i], test_cam_x[i]**2, test_cam_y[i]**2]
coefficients.append(new_row)
coefficients = np.array(coefficients)
# Transform camera coordinates to screen coordinates
results = coefficients.dot(transform_matrix)
# Plot points
results_x = [i[0] for i in results]
results_y = [i[1] for i in results]
actual_x = [i[0] for i in test_scr_vec]
actual_y = [i[1] for i in test_scr_vec]
plt.plot(results_x, results_y, 'gs', actual_x, actual_y, 'ro')
plt.show()
Edit (in accordance with a suggestion):
# Transform matrix with linalg.solve
[[ 2.00000000e+01 2.00000000e+01]
[ -5.32857143e+01 7.31428571e+01]
[ 7.32857143e+01 -5.31428571e+01]
[ -1.15404203e-17 9.76497106e-18]
[ -3.66428571e+01 3.65714286e+01]
[ 3.66428571e+01 -3.65714286e+01]]
# Transform matrix with linalg.lstsq:
[[ 2.00000000e+01 2.00000000e+01]
[ 1.20000000e+01 8.00000000e+00]
[ 8.00000000e+00 1.20000000e+01]
[ 1.79196935e-15 2.33146835e-15]
[ -4.00000000e+00 4.00000000e+00]
[ 4.00000000e+00 -4.00000000e+00]]
% Transform matrix with mldivide:
20.0000 20.0000
19.9998 0.0002
0.0002 19.9998
0 0
-0.0001 0.0001
0.0001 -0.0001
The interesting thing is that you will get quite different results with np.linalg.lstsq and np.linalg.solve.
x1 = np.linalg.lstsq(A_star, B_star)[0]
x2 = np.linalg.solve(A_star, B_star)
Both should offer a solution for the equation Ax = B. However, these give two quite different arrays:
In [37]: x1
Out[37]:
array([[ 2.00000000e+01, 2.00000000e+01],
[ 1.20000000e+01, 7.99999999e+00],
[ 8.00000001e+00, 1.20000000e+01],
[ -1.15359111e-15, 7.94503352e-16],
[ -4.00000001e+00, 3.99999999e+00],
[ 4.00000001e+00, -3.99999999e+00]]
In [39]: x2
Out[39]:
array([[ 2.00000000e+01, 2.00000000e+01],
[ -4.42857143e+00, 2.43809524e+01],
[ 2.44285714e+01, -4.38095238e+00],
[ -2.88620104e-18, 1.33158696e-18],
[ -1.22142857e+01, 1.21904762e+01],
[ 1.22142857e+01, -1.21904762e+01]])
Both should give an accurate (down to the calculation precision) solution to the group of linear equations, and for a non-singular matrix there is exactly one solution.
Something must be then wrong. Let us see if this both candidates could be solutions to the original equation:
In [41]: A_star.dot(x1)
Out[41]:
array([[ -1.11249392e-08, 9.86256055e-09],
[ 1.62000000e+05, -1.65891834e-09],
[ 0.00000000e+00, 1.62000000e+05],
[ -1.62000000e+05, -1.62000000e+05],
[ -3.24000000e+05, 4.47034836e-08],
[ 5.21540642e-08, -3.24000000e+05]])
In [42]: A_star.dot(x2)
Out[42]:
array([[ -1.45519152e-11, 1.45519152e-11],
[ 1.62000000e+05, -1.45519152e-11],
[ 0.00000000e+00, 1.62000000e+05],
[ -1.62000000e+05, -1.62000000e+05],
[ -3.24000000e+05, 0.00000000e+00],
[ 2.98023224e-08, -3.24000000e+05]])
They seem to give the same solution, which is essentially the same as B_star as it should be. This leads us towards the explanation. With simple linear algebra we could predict that A . (x1-x2) should be very close to zero:
In [43]: A_star.dot(x1-x2)
Out[43]:
array([[ -1.11176632e-08, 9.85164661e-09],
[ -1.06228981e-09, -1.60071068e-09],
[ 0.00000000e+00, -2.03726813e-10],
[ -6.72298484e-09, 4.94765118e-09],
[ 5.96046448e-08, 5.96046448e-08],
[ 2.98023224e-08, 5.96046448e-08]])
And it indeed is. So, it seems that there is a non-trivial solution for the equation Ax = 0, the solution being x = x1-x2, which means that the matrix is singular and there are thus an infinite number of different solutions for Ax=B.
The problem is thus not in NumPy or Matlab, it is in the matrix itself.
However, in the case of this matrix the situation is a bit tricky. A_star seems to be singular by the definition above (Ax=0 for x<>0). On the other hand its determinant is non-zero, and it is not singular.
In this case A_star is an example of a matrix which is numerically unstable while not singular. The solve method solves it by using the simple multiplication-by-inverse. This is a bad choice in this case, as the inverse contains very large and very small values. This makes the multiplicaion prone to round-off errors. This can be seen by looking at the condition number of the matrix:
In [65]: cond(A_star)
Out[65]: 1.3817810855559592e+17
This is a very high condition number, and the matrix is ill-conditioned.
In this case the use of an inverse to solve the problem is a bad approach. The least squares approach gives better results, as you may see. However, the better solution is to rescale the input values so that x and x^2 are in the same range. One very good scaling is to scale everything between -1 and 1.
One thing you might consider is to try to use NumPy's indexing capabilities. For example:
cam_x = np.array([i[0] for i in camera_vectors])
is equivalent to:
cam_x = camera_vectors[:,0]
and you may construct your array A this way:
A_matrix = np.column_stack((np.ones_like(cam_x), cam_x, cam_y, cam_x*cam_y, cam_x**2, cam_y**2))
No need to create lists of lists or any loops.
The matrix A_matrix is a 6 by 5 matrix, so A_star is a singular matrix. As a result there is no unique solution, and the result of both programs are correct. This corresponds to the original problem being under-determined, as opposed to over-determined.

Categories

Resources