Related
I have a 3-d array of shape=(3, 60000, 10) which needs to be 2-D so as to be able to visualize it when clustering.
I was planning on implementing the k-means clustering from scikit-learn to the 3-d array and read that it only takes in 2-D shape , I just wanted some advice as to whether there is a right way to do it ? I was planning on making it (60000,30) , but wanted a clarification before I go ahead.
How I read it is that you have 10 features each consisting of 3d data. Do you intend to cluster all 10 features? If so reshape it such that you have 600000 x 3 points (assuming you want to separate in space). For example this
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
# 3x points
data = np.random.rand(100, 3, 10) + np.arange(10) # add arbitrary offset for "difference" in real data
data = np.moveaxis(data, -1, 1).reshape(-1, 3)
n_clus = 10 # cluster in 10 --> fill in with your goal in mind
km = KMeans(n_clusters = n_clus).fit(data)
fig, ax = plt.subplots(subplot_kw = dict(projection = '3d'))
colors = plt.cm.tab20(np.linspace(0, 1, n_clus))
ax.scatter(*data.T, c = colors[km.labels_])
fig.show()
Yields
(600000 , 30) is probably not a great idea. K-means clustering uses a distance metrics to define clusters, Euclidean distance normally, but when you increase number of variables in the second dimension you fall into a curse of dimensionality where results of clustering will stop making sense.
You can of course try (600000, 30) and see if it works, but if it doesn't, you'll need to do reduce dimensionality, for example by doing a PCA and use principal components to do clustering.
EDIT
I'll probably try and explain what I mean by dimensionality and the issues it causes since there appears to be some confusion.
A 2d array of size (100, 2) is a 2-dimensional data, i.e. it's 100 observations of 2 variables. The trend line between those points would be a 1d object (line) and you can plot it on a 2d plane. Similarly, a (100, 3) array is 3-dimensional with a trendline being a 2d plane and you can plot those points on a 3d chart.
Then (100, 100) array is 100-dimensional. A trend would be a 99-dimensional hyperplane and you cannot visualise even in principle. Now let's see what issues this causes. Let's define a simple function calculating Euclidean distance:
def distance(x, y):
return sum((i - j)**2 for i, j in zip(x, y))**0.5
The function takes two iterables as arguments and calculates Euclidean distance between those. Now let's try with something simple:
v1 = (1, 1)
v2 = (2, 2)
v3 = (100, 100)
v4 = (120, 120)
>> distance(v1, v2)
Out: 1.4142135623730951
>> distance(v1, v3)
Out: 140.0071426749364
>> distance(v1, v4)
Out: 168.2914139223983
If we make these tuples 3 dimensional keeping the same values in all dimensions, distances become respectively: 1.73, 171.47, 206.11.
Now for the fun part - let's add a bunch of dimensions filled with "1"s:
v1 = [1, 1, 1] + list(1 for i in range(47))
v2 = [2, 2, 2] + list(1 for i in range(47))
v2 = [100, 100, 100] + list(1 for i in range(47))
v4 = [120, 120, 120] + list(1 for i in range(47))
>>> distance(v1, v2)
171.47302994931886
>>> distance(v1, v3)
175.16278143486988
>>> distance(v1, v4)
206.11404610069638
So here we increased dimensions without adding additional information to separate variables an suddenly what appeared as two distinct clusters are not so defined any more, in fact v1, v2 and v3 appear more like they belong together and v4 being an outsider.
This will also happen in most cases, unless the higher dimensions continue the pattern of the first three, i.e. (1, 1, 1...), (2, 2, 2,..), (100, 100, 100...), (120, 120, 120,...). But in most cases you will see distances shrink and clusters become indistinguishable.
Let's consider data :
import numpy as np
from sklearn.linear_model import LogisticRegression
x=np.linspace(0,2*np.pi,80)
x = x.reshape(-1,1)
y = np.sin(x)+np.random.normal(0,0.4,80)
y[y<1/2] = 0
y[y>1/2] = 1
clf=LogisticRegression(solver="saga", max_iter = 1000)
I want to fit logistic regression where y is dependent variable, and x is independent variable. But while I'm using :
clf.fit(x,y)
I see error
'y should be a 1d array, got an array of shape (80, 80) instead'.
I tried to reshape data by using
y=y.reshape(-1,1)
But I end up with array of length 6400! (How come?)
Could you please give me a hand with performing this regression ?
Change the order of your operations:
First geneate x and y as 1-D arrays:
x = np.linspace(0, 2*np.pi, 8)
y = np.sin(x) + np.random.normal(0, 0.4, 8)
Then (after y was generated) reshape x:
x = x.reshape(-1, 1)
Edit following a comment as of 2022-02-20
The source of the problem in the original code is that;
x = np.linspace(0,2*np.pi,80) - generates a 1-D array.
x = x.reshape(-1,1) - reshapes it into a 2-D array, with one column and
as many rows as needed.
y = np.sin(x) + np.random.normal(0,0.4,80) - operates on a columnar array and
a 1-D array (treated here as a single row array).
the effect is that y is a 2-D array (80 * 80).
then the attempt to reshape y gives a single column array with 6400 rows.
The proper solution is that both x and y should be initially 1-D
(single row) arrays and my code does just this.
Then both arrays can be reshaped.
I encountered this error and solving it via reshape but it didn't work
ValueError: y should be a 1d array, got an array of shape () instead.
Actually, this was happening due to the wrong placement of [] brackets around np.argmax, below is the wrong code and correct one, notice the positioning of [] around the np.argmax in both the snippets
Wrong Code
ax[i,j].set_title("Predicted Watch : "+str(le.inverse_transform([pred_digits[prop_class[count]]])) +"\n"+"Actual Watch : "+str(le.inverse_transform(np.argmax([y_test[prop_class[count]]])).reshape(-1,1)))
Correct Code
ax[i,j].set_title("Predicted Watch :"+str(le.inverse_transform([pred_digits[prop_class[count]]]))+"\n"+"Actual Watch : "+str(le.inverse_transform([np.argmax(y_test[prop_class[count]])])))
I am trying to increase dimensionality of my inital array:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
x = 10*rng.rand(50)
y = np.sin(x) + 0.1*rng.rand(50)
poly = PolynomialFeatures(7, include_bias=False)
poly.fit_transform(x[:,np.newaxis])
First, I know np.newaxis is creating additional column. Why is this necessary?
Now I will train the updated x data(poly) with linear regression
test_x = np.linspace(0,10,1000)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
# train with increased dimension(x=poly) with its target
model.fit(poly,y)
# testing
test_y = model.predict(x_test)
When I run this it give me :ValueError: Expected 2D array, got scalar array instead: on model.fit(poly,y) line. I've already added a dimension to poly, what is happening?
Also what's the difference between x[:,np.newaxis] Vs. x[:,None]?
In [55]: x=10*np.random.rand(5)
In [56]: x
Out[56]: array([6.47634068, 6.25520837, 7.58822106, 4.65466951, 2.35783624])
In [57]: x.shape
Out[57]: (5,)
newaxis does not add a column, it adds a dimension:
In [58]: x1 = x[:,np.newaxis]
In [59]: x1
Out[59]:
array([[6.47634068],
[6.25520837],
[7.58822106],
[4.65466951],
[2.35783624]])
In [60]: x1.shape
Out[60]: (5, 1)
np.newaxis has the value of None, so both work the same.
In[61]: x[:,None].shape
Out[61]: (5, 1)
One is a little clearer to human readers, the other a little easier to type.
https://www.numpy.org/devdocs/reference/constants.html
Whether x or x1 works depends on the expectations of the learning code. Some learning code expects inputs of the shape (samples, features). It could assume that a (50,) shape array is 50 samples, 1 feature, or 1 case, 50 features. But it's better if you tell exactly what you mean.
Look at the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures.fit_transform
poly.fit_transform
X : numpy array of shape [n_samples, n_features]
Sure looks like fit_transform expects a 2d input.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit
Both X and y are supposed to be 2d.
I try to use Linear Discriminant Analysis from scikit-learn library, in order to perform dimensionality reduction on my data which has more than 200 features. But I could not find the inverse_transform function in the LDA class.
I just wanted to ask, how can I reconstruct the original data from a point in LDA domain?
Edit base on #bogatron and #kazemakase answer:
I think the term "original data" was wrong and instead I should use "original coordinate" or "original space". I know without all PCAs we can't reconstruct the original data, but when we build the shape space we project the data down to lower dimension with help of PCA. The PCA try to explain the data with only 2 or 3 components which could capture the most of the variance of the data and if we reconstruct the data base on them it should show us the parts of the shape that causes this separation.
I checked the source code of the scikit-learn LDA again and I noticed that the eigenvectors are store in scalings_ variable. when we use the svd solver, it's not possible to inverse the eigenvectors (scalings_) matrix, but when I tried the pseudo-inverse of the matrix, I could reconstruct the shape.
Here, there are two images which are reconstructed from [ 4.28, 0.52] and [0, 0] points respectively:
I think that would be great if someone explain the mathematical limitation of the LDA inverse transform in depth.
The inverse of the LDA does not necessarily make sense beause it loses a lot of information.
For comparison, consider the PCA. Here we get a coefficient matrix that is used to transform the data. We can do dimensionality reduction by stripping rows from the matrix. To get the inverse transform, we first invert the full matrix and then remove the columns corresponding to the removed rows.
The LDA does not give us a full matrix. We only get a reduced matrix that cannot be directly inverted. It is possible to take the pseudo inverse, but this is much less efficient than if we had the full matrix at our disposal.
Consider a simple example:
C = np.ones((3, 3)) + np.eye(3) # full transform matrix
U = C[:2, :] # dimensionality reduction matrix
V1 = np.linalg.inv(C)[:, :2] # PCA-style reconstruction matrix
print(V1)
#array([[ 0.75, -0.25],
# [-0.25, 0.75],
# [-0.25, -0.25]])
V2 = np.linalg.pinv(U) # LDA-style reconstruction matrix
print(V2)
#array([[ 0.63636364, -0.36363636],
# [-0.36363636, 0.63636364],
# [ 0.09090909, 0.09090909]])
If we have the full matrix we get a different inverse transform (V1) than if we simple invert the transform (V2). That is because in the second case we lost all information about the discarded components.
You have been warned. If you still want to do the inverse LDA transform, here is a function:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array, check_X_y
import numpy as np
def inverse_transform(lda, x):
if lda.solver == 'lsqr':
raise NotImplementedError("(inverse) transform not implemented for 'lsqr' "
"solver (use 'svd' or 'eigen').")
check_is_fitted(lda, ['xbar_', 'scalings_'], all_or_any=any)
inv = np.linalg.pinv(lda.scalings_)
x = check_array(x)
if lda.solver == 'svd':
x_back = np.dot(x, inv) + lda.xbar_
elif lda.solver == 'eigen':
x_back = np.dot(x, inv)
return x_back
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis()
Z = lda.fit(X, y).transform(X)
Xr = inverse_transform(lda, Z)
# plot first two dimensions of original and reconstructed data
plt.plot(X[:, 0], X[:, 1], '.', label='original')
plt.plot(Xr[:, 0], Xr[:, 1], '.', label='reconstructed')
plt.legend()
You see, the result of the inverse transform does not have much to do with the original data (well, it's possible to guess the direction of the projection). A considerable part of the variation is gone for good.
There is no inverse transform because in general, you can not return from the lower dimensional feature space to your original coordinate space.
Think of it like looking at your 2-dimensional shadow projected on a wall. You can't get back to your 3-dimensional geometry from a single shadow because information is lost during the projection.
To address your comment regarding PCA, consider a data set of 10 random 3-dimensional vectors:
In [1]: import numpy as np
In [2]: from sklearn.decomposition import PCA
In [3]: X = np.random.rand(30).reshape(10, 3)
Now, what happens if we apply the Principal Components Transformation (PCT) and apply dimensionality reduction by keeping only the top 2 (out of 3) PCs, then apply the inverse transform?
In [4]: pca = PCA(n_components=2)
In [5]: pca.fit(X)
Out[5]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
In [6]: Y = pca.transform(X)
In [7]: X.shape
Out[7]: (10, 3)
In [8]: Y.shape
Out[8]: (10, 2)
In [9]: XX = pca.inverse_transform(Y)
In [10]: X[0]
Out[10]: array([ 0.95780971, 0.23739785, 0.06678655])
In [11]: XX[0]
Out[11]: array([ 0.87931369, 0.34958407, -0.01145125])
Obviously, the inverse transform did not reconstruct the original data. The reason is that by dropping the lowest PC, we lost information. Next, let's see what happens if we retain all PCs (i.e., we do not apply any dimensionality reduction):
In [12]: pca2 = PCA(n_components=3)
In [13]: pca2.fit(X)
Out[13]:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
In [14]: Y = pca2.transform(X)
In [15]: XX = pca2.inverse_transform(Y)
In [16]: X[0]
Out[16]: array([ 0.95780971, 0.23739785, 0.06678655])
In [17]: XX[0]
Out[17]: array([ 0.95780971, 0.23739785, 0.06678655])
In this case, we were able to reconstruct the original data because we didn't throw away any information (since we retained all the PCs).
The situation with LDA is even worse because the maximum number of components that can be retained is not 200 (the number of features for your input data); rather, the maximum number of components you can retain is n_classes - 1. So if, for example, you were doing a binary classification problem (2 classes), the LDA transform would be going from 200 input dimensions down to just a single dimension.
What I want to do is rather simple but I havent found a straightforward approach thus far:
I have a 3D rectilinear grid with float values (therefore 3 coordinate axes -1D numpy arrays- for the centers of the grid cells and a 3D numpy array with the corresponding shape with a value for each cell center) and I want to interpolate (or you may call it subsample) this entire array to a subsampled array (e.g. size factor of 5) with linear interpolation.
All the approaches I've seen this far involve 2D and then 1D interpolation or VTK tricks which Id rather not use (portability).
Could someone suggest an approach that would be the equivalent of taking 5x5x5 cells at the same time in the 3D array, averaging and returning an array 5times smaller in each direction?
Thank you in advance for any suggestions
EDIT:
Here's what the data looks like, 'd' is a 3D array representing a 3D grid of cells. Each cell has a scalar float value (pressure in my case) and 'x','y' and 'z' are three 1D arrays containing the spatial coordinates of the cells of every cell (see the shapes and how the 'x' array looks like)
In [42]: x.shape
Out[42]: (181L,)
In [43]: y.shape
Out[43]: (181L,)
In [44]: z.shape
Out[44]: (421L,)
In [45]: d.shape
Out[45]: (181L, 181L, 421L)
In [46]: x
Out[46]:
array([-0.410607 , -0.3927568 , -0.37780656, -0.36527296, -0.35475321,
-0.34591168, -0.33846866, -0.33219107, -0.32688467, -0.3223876 ,
...
0.34591168, 0.35475321, 0.36527296, 0.37780656, 0.3927568 ,
0.410607 ])
What I want to do is create a 3D array with lets say a shape of 90x90x210 (roughly downsize by a factor of 2) by first subsampling the coordinates from the axes on arrays with the above dimensions and then 'interpolating' the 3D data to that array. Im not sure whether 'interpolating' is the right term though. Downsampling? Averaging?
Here's an 2D slice of the data:
Here is an example of 3D interpolation on an irregular grid using scipy.interpolate.griddata.
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
def func(x, y, z):
return x ** 2 + y ** 2 + z ** 2
# Nx, Ny, Nz = 181, 181, 421
Nx, Ny, Nz = 18, 18, 42
subsample = 2
Mx, My, Mz = Nx // subsample, Ny // subsample, Nz // subsample
# Define irregularly spaced arrays
x = np.random.random(Nx)
y = np.random.random(Ny)
z = np.random.random(Nz)
# Compute the matrix D of shape (Nx, Ny, Nz).
# D could be experimental data, but here I'll define it using func
# D[i,j,k] is associated with location (x[i], y[j], z[k])
X_irregular, Y_irregular, Z_irregular = (
x[:, None, None], y[None, :, None], z[None, None, :])
D = func(X_irregular, Y_irregular, Z_irregular)
# Create a uniformly spaced grid
xi = np.linspace(x.min(), x.max(), Mx)
yi = np.linspace(y.min(), y.max(), My)
zi = np.linspace(y.min(), y.max(), Mz)
X_uniform, Y_uniform, Z_uniform = (
xi[:, None, None], yi[None, :, None], zi[None, None, :])
# To use griddata, I need 1D-arrays for x, y, z of length
# len(D.ravel()) = Nx*Ny*Nz.
# To do this, I broadcast up my *_irregular arrays to each be
# of shape (Nx, Ny, Nz)
# and then use ravel() to make them 1D-arrays
X_irregular, Y_irregular, Z_irregular = np.broadcast_arrays(
X_irregular, Y_irregular, Z_irregular)
D_interpolated = interpolate.griddata(
(X_irregular.ravel(), Y_irregular.ravel(), Z_irregular.ravel()),
D.ravel(),
(X_uniform, Y_uniform, Z_uniform),
method='linear')
print(D_interpolated.shape)
# (90, 90, 210)
# Make plots
fig, ax = plt.subplots(2)
# Choose a z value in the uniform z-grid
# Let's take the middle value
zindex = Mz // 2
z_crosssection = zi[zindex]
# Plot a cross-section of the raw irregularly spaced data
X_irr, Y_irr = np.meshgrid(sorted(x), sorted(y))
# find the value in the irregular z-grid closest to z_crosssection
z_near_cross = z[(np.abs(z - z_crosssection)).argmin()]
ax[0].contourf(X_irr, Y_irr, func(X_irr, Y_irr, z_near_cross))
ax[0].scatter(X_irr, Y_irr, c='white', s=20)
ax[0].set_title('Cross-section of irregular data')
ax[0].set_xlim(x.min(), x.max())
ax[0].set_ylim(y.min(), y.max())
# Plot a cross-section of the Interpolated uniformly spaced data
X_unif, Y_unif = np.meshgrid(xi, yi)
ax[1].contourf(X_unif, Y_unif, D_interpolated[:, :, zindex])
ax[1].scatter(X_unif, Y_unif, c='white', s=20)
ax[1].set_title('Cross-section of downsampled and interpolated data')
ax[1].set_xlim(x.min(), x.max())
ax[1].set_ylim(y.min(), y.max())
plt.show()
In short: doing interpolation in each dimension separately is the right way to go.
You can simply average every 5x5x5 cube and return the results. However, if your data is supposed to be continuous, you should understand that is not good subsampling practice, as it will likely induce aliasing. (Also, you can't reasonably call it "interpolation"!)
Good resampling filters need to be wider than the resampling factor in order to avoid aliasing. Since you are downsampling, you should also realize that your resampling filter needs to be scaled according to the destination resolution, not the original resolution -- in order to interpolate properly, it will likely need to be 4 or 5 times as wide as your 5x5x5 cube. This is a lot of samples -- 20*20*20 is way more than 5*5*5...
So, the reason why practical implementations of resampling typically filter each dimension separately is that it is more efficient. By taking 3 passes, you can evaluate your filter using far fewer multiply/accumulate operations per output sample.