How to convert 2-dimensional data into 3-D (Non-linear data)? - python

I have created synthetic data like ,
X, Y = make_classification(n_features=2,n_samples=100, n_redundant=0, n_informative=1,
n_clusters_per_class=1, class_sep=0.001,weights= [0.8,0.2] ,n_classes=2 ,random_state=42)
df = CreateDataFrame(X,Y,['X1','X2'])
having two class and the data in non-linear. Now I want to convert this 2-d data into 3-D data space to draw the decision boundary between the classes. any one can help me

For converting this 2-D data into 3-D one you have to add another feature. This feature addition can be done in any of the following ways
1) Add extra features while creating random data by setting n_features = 3. This will add extra dimension at the time of creation. But it will be the addition of a new feature.
2) Another way is to create the third feature by applying arithmetic(like avg of 2 features etc) operation .

Related

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

Hidden Markov Model python

I have a time series of position of a particle over time and I want to estimate model parameters of two HMM using this data (one for the x axis, the other for the y axis). I'm using the hmmlearn library, however, it is not clear to me how should I proced. In the tutorial, it states that this is the third way to use the library, however, when I use the code as bellow:
remodel = hmm.GaussianHMM(n_components=3, covariance_type="full", n_iter=100)
remodel.fit(X)
Z2 = remodel.predict(X)
and X is the list of x-axis values, it returns
ValueError: Expected 2D array, got 1D array instead
What should I add to my data in order to turn it 2D?
Caveat emptor: My understanding of HMM and this lib are based on a few minutes of googling and Wikipedia. That said:
To train an HMM model, you need a number of observes samples, each of which is a vector of features. For example, in the Wikipedia example of Alice predicting the weather at Bob's house based on what he did each day, Alice gets a number of samples (what Bob tells her each day), each of which has one feature (Bob's reported activity that day). It would be entirely possible for Bob to give Alice multiple features for a given day (what he did, and what his outfit was, for instance).
To learn/fit an HMM model, then, you should need a series of samples, each of which is a vector of features. This is why the fit function expects a two-dimensional input. From the docs, X is expected to be "array-like, shape (n_samples, n_features)". In your case, the position of the particle is the only feature, with each observation being a sample. So your input should be an array-like of shape n_samples, 1 (a single column). Right now, it's presumably of shape 1, n_samples (a single row, the default from doing something like np.array([1, 2, 3])). So just reshape:
remodel.fit(X.reshape(-1, 1))
for me, the reshape method didn't work. I used the numpy's np.column_stack instead. I suggest you insert X = np.column_stack([X[:]]) before fitting the model, it should work out.

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

How do I use my dataset in Sklearn clustering?

I am trying to adapt the Sklearn example here to use my own dataset, which is a 1000 row, 4 column matrix of integers. I cannot see how to replace one of the SKlearn datasets with mine. I.e. what do I replace
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
with?
The datasets.make_circles function creates a toy dataset with a very clear pattern. The data it returns is a tuple containing an X array of features (n x 2 dimensions) and a y array of labels (length n).
To pass your data into the clustering script, you just need to put it into a similar format and use that in place of the value returned by make_circles.
Load your data as a 2 dimensional numpy array. Read the documentation of numpy and scipy to learn how to do so depending on the file format you have at hand.
Before running the clustering algorithm you might want to preprocess the data with a one-hot encoder if the integer mean category assignment rather than quantities.
If they represent quantities, you might want to preprocess with StandardScaler.

Extracting PCA components with sklearn

I am using sklearn's PCA for dimensionality reduction on a large set of images. Once the PCA is fitted, I would like to see what the components look like.
One can do so by looking at the components_ attribute. Not realizing that was available, I did something else instead:
each_component = np.eye(total_components)
component_im_array = pca.inverse_transform(each_component)
for i in range(num_components):
component_im = component_im_array[i, :].reshape(height, width)
# do something with component_im
In other words, I create an image in the PCA space that has all features but 1 set to 0. By inversely transforming them, I should then get the image in the original space which, once transformed, can be expressed solely with that PCA component.
The following image shows the results. On the left is the component calculated using my method. On the right is pca.components_[i] directly. Additionally, with my method, most images are very similar (but they are different) while by accessing the components_ the images are very different as I would have expected
Is there a conceptual problem in my method? Clearly the components from pca.components_[i] are correct (or at least more correct) than the ones I'm getting. Thanks!
Components and inverse transform are two different things. The inverse transform maps the components back to the original image space
#Create a PCA model with two principal components
pca = PCA(2)
pca.fit(data)
#Get the components from transforming the original data.
scores = pca.transform(data)
# Reconstruct from the 2 dimensional scores
reconstruct = pca.inverse_transform(scores )
#The residual is the amount not explained by the first two components
residual=data-reconstruct
Thus you are inverse transforming the original data and not the components, and thus they are completely different. You almost never inverse_transform the orginal data. pca.components_ are the actual vectors representing the underlying axis used to project the data to the pca space.
The difference between grabbing the components_ and doing an inverse_transform on the identity matrix is that the latter adds in the empirical mean of each feature. I.e.:
def inverse_transform(self, X):
return np.dot(X, self.components_) + self.mean_
where self.mean_ was estimated from the training set.

Categories

Resources