Related
EDIT: added prefix / suffix value to interval arrays to make them the same length as their corresponding data arrays, as per #user1319128 's suggestion and indeed interp does the job. For sure his solution was workable and good. I just couldn't see it because I was tired and stupid.
I am sure this is a fairly mundane application, but so I have failed to find or come up with a way to do this without doing it outside of numpy. Maybe my brain just needs a rest, anyway here is the problem with example and solution requirements.
So I have to arrays with different lengths and I want to apply common time intervals between them to these arrays, so that that the result is I have versions of these arrays that are all the same length and their values relate to each other at the same row (if that makes sense). In the example below I have named this functionality "apply_timeintervals_to_array". The example code:
import numpy as np
from colorsys import hsv_to_rgb
num_xy = 20
num_colors = 12
#xy = np.random.rand(num_xy, 2) * 1080
xy = np.array([[ 687.32758344, 956.05651214],
[ 226.97671414, 698.48071588],
[ 648.59878864, 175.4882185 ],
[ 859.56600997, 487.25205922],
[ 794.43015178, 16.46114312],
[ 884.7166732 , 634.59100322],
[ 878.94218682, 835.12886098],
[ 965.47135726, 542.09202328],
[ 114.61867445, 601.74092126],
[ 134.02663822, 334.27221884],
[ 940.6589034 , 245.43354493],
[ 285.87902276, 550.32600784],
[ 785.00104142, 993.19960822],
[1040.49576307, 486.24009511],
[ 165.59409198, 156.79786175],
[1043.54280058, 313.09073855],
[ 645.62878826, 100.81909068],
[ 625.78003257, 252.17917611],
[1056.77009875, 793.02218098],
[ 2.93152052, 596.9795026 ]])
xy_deltas = np.sum((xy[1:] - xy[:-1])**2, axis=-1)
xy_ti = np.concatenate(([0.0],
(xy_deltas) / np.sum(xy_deltas)))
colors_ti = np.concatenate((np.linspace(0, 1, num_colors),
[1.0]))
common_ti = np.unique(np.sort(np.concatenate((xy_ti,
colors_ti))))
common_colors = (np.array(tuple(hsv_to_rgb(t, 0.9, 0.9) for t
in np.concatenate(([0.0],
common_ti,
[1.0]))))
* 255).astype(int)[1:-1]
common_xy = apply_timeintervals_to_array(common_ti, xy)
So one could then use the common arrays for additional computations or for rendering.
The question is what could accomplish the "apply_timeintervals_to_array" functionality, or alternatively a better way to generate the same data.
I hope this is clear enough, let me know if it isn't. Thank you in advance.
I think , numpy.interp should meet your expectations.For example, If a have an 2d array of length 20 , and would like to interpolate at different common_ti values ,whose length is 30 , the code would be as follows.
xy = np.arange(0,400,10).reshape(20,2)
xy_ti = np.arange(20)/19
common_ti = np.linspace(0,1,30)
x=np.interp(common_ti,xy_ti,xy[:,0]) # interpolate the first column
y=np.interp(common_ti,xy_ti,xy[:,1]) #interpolate the second column
i am trying to use sklearn.cluster.OPTICS to cluster an already computed similarity (distance) matrix filled with normalized cosine distances (0.0 to 1.0)
but no matter what i give in max_eps and eps i don't get any clusters out.
Later on i would need to run OPTICS on a similarity matrix of more than 129'000 x 129'000 items hopefully relying on Dask to keep memory footprint low.
I am extracting fasttext vectors for a small amount of words (each vector 300 dimensions) and use dask-distance to create a similarity matrix from the vectors.
The result is a matrix looking like this:
sim == [[0. 0.56742118 0.42776633 0.42344265 0.84878847 0.87984235
0.87468601 0.95224451 0.89341788 0.80922083]
[0.56742118 0. 0.59779273 0.62900345 0.83004028 0.87549904
0.887784 0.8591598 0.80752158 0.80960947]
[0.42776633 0.59779273 0. 0.45120935 0.79292425 0.78556189
0.82378645 0.93107747 0.83290157 0.85349163]
[0.42344265 0.62900345 0.45120935 0. 0.81379353 0.83985011
0.8441614 0.89824009 0.77074847 0.81297649]
[0.84878847 0.83004028 0.79292425 0.81379353 0. 0.15328565
0.36656755 0.79393195 0.76615941 0.83415538]
[0.87984235 0.87549904 0.78556189 0.83985011 0.15328565 0.
0.36000894 0.7792588 0.77379052 0.83737352]
[0.87468601 0.887784 0.82378645 0.8441614 0.36656755 0.36000894
0. 0.82404421 0.86144969 0.87628284]
[0.95224451 0.8591598 0.93107747 0.89824009 0.79393195 0.7792588
0.82404421 0. 0.521453 0.5784272 ]
[0.89341788 0.80752158 0.83290157 0.77074847 0.76615941 0.77379052
0.86144969 0.521453 0. 0.629014 ]
[0.80922083 0.80960947 0.85349163 0.81297649 0.83415538 0.83737352
0.87628284 0.5784272 0.629014 0. ]]
which looks like something i could cluster using a threshold of 0.8 for example
from dask import array as da
import dask_distance
import logging
import numpy as np
from sklearn.cluster import OPTICS
from collections import defaultdict
log = logging.warning
np.set_printoptions(suppress=True)
if __name__ == "__main__":
array = np.load("vectors.npy")
vectors = da.from_array(array)
sim = dask_distance.cosine(vectors, vectors)
sim = sim.clip(0.0, 1.0)
m = np.max(sim)
c = OPTICS(eps=-1, cluster_method="dbscan", metric="precomputed", algorithm="brute")
clusters = c.fit(sim)
words = [
"icecream",
"cake",
"cream",
"ice",
"dog",
"cat",
"animal",
"car",
"truck",
"bus",
]
cs = defaultdict(list)
for index, c in enumerate(clusters.labels_):
cs[c].append(words[index])
for v in cs.values():
log(v)
log(clusters.labels_)
which prints
['icecream', 'cake', 'cream', 'ice', 'dog', 'cat', 'animal', 'car', 'truck', 'bus']
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
but i was expecting there to be several clusters.
I have tried many different values for all the supported parameters in OPTICS but have not been able to yield anything usable or even more clusters than just one.
i am using following versions:
python -V
Python 3.7.3
sklearn.__version__
'0.21.3'
dask.__version__
'2.3.0'
numpy.__version__
'1.17.0'
Here is how it looks with using sklearn DBSCAN instead
...
sim = sim.astype(np.float32)
c = DBSCAN(eps=0.7, min_samples=1, metric="precomputed", n_jobs=-1)
clusters = g.fit(sim)
...
yields
['icecream', 'cake', 'cream', 'ice']
['dog', 'cat', 'animal']
['car', 'truck', 'bus']
[0 0 0 0 1 1 1 2 2 2]
Which is very correct, but has a much higher memory footprint (OPTICS apparently only needs to calculate half of the matrix)
Have you tried to estimate how much memory a 129000x129000 matrix needs - and how long it will take you to compute that and work with that?!? I strongly doubt that dask will be that helpful here in scaling this. You will need to use some indexing approach to avoid any O(n²) cost in the first place. Cutting O(n²) by a factor of k with k nodes just doesn't get you far enough to be scalable.
When you use "precomputed", you already computed the full distance matrix. Neither OPTICS not DBSCAN will now compute this again (nor just the lower half of it) - they will only iterate over this huge huge matrix because they cannot make any assumptions on it: not even that it is symmetric.
Why do you think eps=-1 is right? What about min_samples with OPTICS? If you don't choose the same parameters, you of course don't get similar results of OPTICS and DBSCAN.
The result found by OPTICS with your parameters is correct. At eps=-1 no points are neighbors, and with min_samples=5 hence there are no clusters, all points should be labeled -1.
I am trying to learn the basics of PCA analysis in Python using scikit libraries (in particular sklearn.decomposition and sklearn.preprocessing). The goal is to import data from images into a matrix X (each row is a sample, each column is a feature), then standardize X, use PCA to extract principal components (2 most important, 6 most important....6 less important), project X on these principal components, reverse the previous transformation and plot the result in order to see the difference with respect to the original image/images.
Now let's say that I do not want to consider the 2,3,4... most important principal components but I want to consider the N less relevant components, let's say N=6.
How should the analysis be done?
I mean I can't simply standardize then call PCA().fit_transform and then revert back with inverse_transform() to plot the results.
At the moment I am doing something like this:
X_std = StandardScaler().fit_transform(X) # standardize original data
pca = PCA()
model = pca.fit(X_std) # create model with all components
Xprime = model.components_[range(dim-6, dim, 1),:] # get last 6 PC
And then I stop because I know I should call transform() but I do not understand how to do it...I tried several times withouth being successfull.
Is there someone that can tell me if previous steps are correct and point out the direction to follow?
Thank you very much
EDIT: currently I have adapted this solution as suggested by the first answer to my question:
model = PCA().fit(X_std)
model2pc = model
model2pc.components_[range(2, img_count, 1), :] = 0
Xp_2pc = model2pc.transform(X_std)
Xr_2pc = model2pc.inverse_transform(Xp_2pc)
And then I do the same for 6pc, 60pc, last 6 pc. What I have noticed is that this is very time consuming. I would like to get a model directly extracting the principal components I need (without zeroing out the others) and then perform transform() and inverse_transform() on that with that model.
If you want to ignore all but the last 6 principal components, you can just zero out the ones you don't want to keep.
N = 6
X_std = StandardScaler().fit_transform(X)
pca = PCA()
model = pca.fit(X_std) # create model with all components
model.components_[:-N] = 0
Then, to remove all but the last N components from the data, just do a forward and inverse transform of the data:
Xprime = model.inverse_transform(model.transform(X_std))
Here is an example:
>>> X = np.random.rand(18).reshape(6, 3)
>>> model = PCA().fit(X)
A round-trip transform should give back the original data:
>>> X
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
>>> model.inverse_transform(model.transform(X))
array([[0.16594796, 0.02366958, 0.8403745 ],
[0.25219425, 0.22879029, 0.07950927],
[0.69636084, 0.4410933 , 0.97431828],
[0.50121079, 0.44835563, 0.95236146],
[0.6793044 , 0.53847562, 0.27882302],
[0.32886931, 0.0643043 , 0.10597973]])
Now zero out the first principal component:
>>> model.components_
array([[ 0.22969899, 0.21209762, 0.94986998],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
>>> model.components_[:-2] = 0
>>> model.components_
array([[ 0. , 0. , 0. ],
[-0.67830467, -0.66500728, 0.31251894],
[ 0.69795497, -0.71608653, -0.0088847 ]])
The round-trip transform now gives a different result since we've removed the first principal component (which contains the greatest amount of variance):
>>> model.inverse_transform(model.transform(X))
array([[ 0.12742811, -0.01189858, 0.68108405],
[ 0.36513945, 0.33308073, 0.54656949],
[ 0.58029482, 0.33392119, 0.49435263],
[ 0.39987803, 0.35478779, 0.53332196],
[ 0.71114004, 0.56787176, 0.41047233],
[ 0.44000711, 0.16692583, 0.56556581]])
The page https://pypi.python.org/pypi/fancyimpute has the line
# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)
which kind of suggests that I need to normalize the input data. However I did not find any details on the internet, what exactly is meant by that. Do I have to normalize my data beforehand and what exactly is expected?
Yes you should definitely normalize the data. Consider the following example:
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(100,0.5,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
The result is
array([[ 81.78428587, 99.69638878, 100.67626769],
[ 99.82026281, 100.09077899, 99.50273223],
[ 99.70946085, 70.98619873, 69.57668189],
[ 81.82898539, 99.66269922, 100.95263318],
[ 99.14285815, 100.10809651, 99.73870089]])
Note that the places where I put nan are completely off. However, if instead you run
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(0,1,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
(same code as before, the only difference is that v is normalized) you get the following reasonable result:
array([[ 0.07705556, -0.53449412, -0.20081351],
[ 0.9709198 , -1.19890962, -0.25176222],
[ 0.41839224, -0.11786451, 0.03231515],
[ 0.21374759, -0.66986997, 0.78565414],
[ 0.30004524, 1.28055845, 0.58625942]])
Thus, when you are using SoftImpute, don't forget to normalize your data (you can do that by making the mean of every column to be 0, and the std to be 1).
I'm trying to implement the least squares curve fitting algorithm on Python, having already written it on Matlab. However, I'm having trouble getting the right transform matrix, and the problem seems to be happening at the solve step. (Edit: My transform matrix is incredibly accurate with Matlab, but completely off with Python.)
I've looked at numerous sources online, and they all indicate that to translate Matlab's 'mldivide', you have to use 'np.linalg.solve' if the matrix is square and nonsingular, and 'np.linalg.lstsq' otherwise. Yet my results are not matching up.
What is the problem? If it has to do with the implementation of the functions, what is the correct translation of mldivide in numpy?
I have attached both versions of the code below. They are essentially the exact same implementation, with exception to the solve part.
Matlab code:
%% Least Squares Fit
clear, clc, close all
% Calibration Data
scr_calib_pts = [0,0; 900,900; -900,900; 900,-900; -900,-900];
cam_calib_pts = [-1,-1; 44,44; -46,44; 44,-46; -46,-46];
cam_x = cam_calib_pts(:,1);
cam_y = cam_calib_pts(:,2);
% Least Squares Fitting
A_matrix = [];
for i = 1:length(cam_x)
A_matrix = [A_matrix;1, cam_x(i), cam_y(i), ...
cam_x(i)*cam_y(i), cam_x(i)^2, cam_y(i)^2];
end
A_star = A_matrix'*A_matrix
B_star = A_matrix'*scr_calib_pts
transform_matrix = mldivide(A_star,B_star)
% Testing Data
test_scr_vec = [200,400; 1600,400; -1520,1740; 1300,-1800; -20,-1600];
test_cam_vec = [10,20; 80,20; -76,87; 65,-90; -1,-80];
test_cam_x = test_cam_vec(:,1);
test_cam_y = test_cam_vec(:,2);
% Coefficients for Transform
coefficients = [];
for i = 1:length(test_cam_x)
coefficients = [coefficients;1, test_cam_x(i), test_cam_y(i), ...
test_cam_x(i)*test_cam_y(i), test_cam_x(i)^2, test_cam_y(i)^2];
end
% Mapped Points
results = coefficients*transform_matrix;
% Plotting
test_scr_x = test_scr_vec(:,1)';
test_scr_y = test_scr_vec(:,2)';
results_x = results(:,1)';
results_y = results(:,2)';
figure
hold on
load seamount
s = 50;
scatter(test_scr_x, test_scr_y, s, 'r')
scatter(results_x, results_y, s)
Python code:
# Least Squares fit
import numpy as np
import matplotlib.pyplot as plt
# Calibration data
camera_vectors = np.array([[-1,-1], [44,44], [-46,44], [44,-46], [-46,-46]])
screen_vectors = np.array([[0,0], [900,900], [-900,900], [900,-900], [-900,-900]])
# Separate axes
cam_x = np.array([i[0] for i in camera_vectors])
cam_y = np.array([i[1] for i in camera_vectors])
# Initiate least squares implementation
A_matrix = []
for i in range(len(cam_x)):
new_row = [1, cam_x[i], cam_y[i], \
cam_x[i]*cam_y[i], cam_x[i]**2, cam_y[i]**2]
A_matrix.append(new_row)
A_matrix = np.array(A_matrix)
A_star = np.transpose(A_matrix).dot(A_matrix)
B_star = np.transpose(A_matrix).dot(screen_vectors)
print A_star
print B_star
try:
# Solve version (Implemented)
transform_matrix = np.linalg.solve(A_star,B_star)
print "Solve version"
print transform_matrix
except:
# Least squares version (implemented)
transform_matrix = np.linalg.lstsq(A_star,B_star)[0]
print "Least Squares Version"
print transform_matrix
# Test data
test_cam_vec = np.array([[10,20], [80,20], [-76,87], [65,-90], [-1,-80]])
test_scr_vec = np.array([[200,400], [1600,400], [-1520,1740], [1300,-1800], [-20,-1600]])
# Coefficients of quadratic equation
test_cam_x = np.array([i[0] for i in test_cam_vec])
test_cam_y = np.array([i[1] for i in test_cam_vec])
coefficients = []
for i in range(len(test_cam_x)):
new_row = [1, test_cam_x[i], test_cam_y[i], \
test_cam_x[i]*test_cam_y[i], test_cam_x[i]**2, test_cam_y[i]**2]
coefficients.append(new_row)
coefficients = np.array(coefficients)
# Transform camera coordinates to screen coordinates
results = coefficients.dot(transform_matrix)
# Plot points
results_x = [i[0] for i in results]
results_y = [i[1] for i in results]
actual_x = [i[0] for i in test_scr_vec]
actual_y = [i[1] for i in test_scr_vec]
plt.plot(results_x, results_y, 'gs', actual_x, actual_y, 'ro')
plt.show()
Edit (in accordance with a suggestion):
# Transform matrix with linalg.solve
[[ 2.00000000e+01 2.00000000e+01]
[ -5.32857143e+01 7.31428571e+01]
[ 7.32857143e+01 -5.31428571e+01]
[ -1.15404203e-17 9.76497106e-18]
[ -3.66428571e+01 3.65714286e+01]
[ 3.66428571e+01 -3.65714286e+01]]
# Transform matrix with linalg.lstsq:
[[ 2.00000000e+01 2.00000000e+01]
[ 1.20000000e+01 8.00000000e+00]
[ 8.00000000e+00 1.20000000e+01]
[ 1.79196935e-15 2.33146835e-15]
[ -4.00000000e+00 4.00000000e+00]
[ 4.00000000e+00 -4.00000000e+00]]
% Transform matrix with mldivide:
20.0000 20.0000
19.9998 0.0002
0.0002 19.9998
0 0
-0.0001 0.0001
0.0001 -0.0001
The interesting thing is that you will get quite different results with np.linalg.lstsq and np.linalg.solve.
x1 = np.linalg.lstsq(A_star, B_star)[0]
x2 = np.linalg.solve(A_star, B_star)
Both should offer a solution for the equation Ax = B. However, these give two quite different arrays:
In [37]: x1
Out[37]:
array([[ 2.00000000e+01, 2.00000000e+01],
[ 1.20000000e+01, 7.99999999e+00],
[ 8.00000001e+00, 1.20000000e+01],
[ -1.15359111e-15, 7.94503352e-16],
[ -4.00000001e+00, 3.99999999e+00],
[ 4.00000001e+00, -3.99999999e+00]]
In [39]: x2
Out[39]:
array([[ 2.00000000e+01, 2.00000000e+01],
[ -4.42857143e+00, 2.43809524e+01],
[ 2.44285714e+01, -4.38095238e+00],
[ -2.88620104e-18, 1.33158696e-18],
[ -1.22142857e+01, 1.21904762e+01],
[ 1.22142857e+01, -1.21904762e+01]])
Both should give an accurate (down to the calculation precision) solution to the group of linear equations, and for a non-singular matrix there is exactly one solution.
Something must be then wrong. Let us see if this both candidates could be solutions to the original equation:
In [41]: A_star.dot(x1)
Out[41]:
array([[ -1.11249392e-08, 9.86256055e-09],
[ 1.62000000e+05, -1.65891834e-09],
[ 0.00000000e+00, 1.62000000e+05],
[ -1.62000000e+05, -1.62000000e+05],
[ -3.24000000e+05, 4.47034836e-08],
[ 5.21540642e-08, -3.24000000e+05]])
In [42]: A_star.dot(x2)
Out[42]:
array([[ -1.45519152e-11, 1.45519152e-11],
[ 1.62000000e+05, -1.45519152e-11],
[ 0.00000000e+00, 1.62000000e+05],
[ -1.62000000e+05, -1.62000000e+05],
[ -3.24000000e+05, 0.00000000e+00],
[ 2.98023224e-08, -3.24000000e+05]])
They seem to give the same solution, which is essentially the same as B_star as it should be. This leads us towards the explanation. With simple linear algebra we could predict that A . (x1-x2) should be very close to zero:
In [43]: A_star.dot(x1-x2)
Out[43]:
array([[ -1.11176632e-08, 9.85164661e-09],
[ -1.06228981e-09, -1.60071068e-09],
[ 0.00000000e+00, -2.03726813e-10],
[ -6.72298484e-09, 4.94765118e-09],
[ 5.96046448e-08, 5.96046448e-08],
[ 2.98023224e-08, 5.96046448e-08]])
And it indeed is. So, it seems that there is a non-trivial solution for the equation Ax = 0, the solution being x = x1-x2, which means that the matrix is singular and there are thus an infinite number of different solutions for Ax=B.
The problem is thus not in NumPy or Matlab, it is in the matrix itself.
However, in the case of this matrix the situation is a bit tricky. A_star seems to be singular by the definition above (Ax=0 for x<>0). On the other hand its determinant is non-zero, and it is not singular.
In this case A_star is an example of a matrix which is numerically unstable while not singular. The solve method solves it by using the simple multiplication-by-inverse. This is a bad choice in this case, as the inverse contains very large and very small values. This makes the multiplicaion prone to round-off errors. This can be seen by looking at the condition number of the matrix:
In [65]: cond(A_star)
Out[65]: 1.3817810855559592e+17
This is a very high condition number, and the matrix is ill-conditioned.
In this case the use of an inverse to solve the problem is a bad approach. The least squares approach gives better results, as you may see. However, the better solution is to rescale the input values so that x and x^2 are in the same range. One very good scaling is to scale everything between -1 and 1.
One thing you might consider is to try to use NumPy's indexing capabilities. For example:
cam_x = np.array([i[0] for i in camera_vectors])
is equivalent to:
cam_x = camera_vectors[:,0]
and you may construct your array A this way:
A_matrix = np.column_stack((np.ones_like(cam_x), cam_x, cam_y, cam_x*cam_y, cam_x**2, cam_y**2))
No need to create lists of lists or any loops.
The matrix A_matrix is a 6 by 5 matrix, so A_star is a singular matrix. As a result there is no unique solution, and the result of both programs are correct. This corresponds to the original problem being under-determined, as opposed to over-determined.