How do I use my dataset in Sklearn clustering?

How do I use my dataset in Sklearn clustering? - python

I am trying to adapt the Sklearn example here to use my own dataset, which is a 1000 row, 4 column matrix of integers. I cannot see how to replace one of the SKlearn datasets with mine. I.e. what do I replace
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
with?

The datasets.make_circles function creates a toy dataset with a very clear pattern. The data it returns is a tuple containing an X array of features (n x 2 dimensions) and a y array of labels (length n).
To pass your data into the clustering script, you just need to put it into a similar format and use that in place of the value returned by make_circles.

Load your data as a 2 dimensional numpy array. Read the documentation of numpy and scipy to learn how to do so depending on the file format you have at hand.
Before running the clustering algorithm you might want to preprocess the data with a one-hot encoder if the integer mean category assignment rather than quantities.
If they represent quantities, you might want to preprocess with StandardScaler.

Related

Does PCA transform preserve the sorting of the data?

I am working with a huge number of n-dimensional arrays in python. All the arrays are stored in a dictionary, so each array is uniquely identified by a key.
I would like to visualize all the arrays in 2D, so I have performed a PCA:
# standardize data before applying PCA
dict_data_std = StandardScaler().fit_transform(dict_data.values())
pca = PCA(n_components=2)
data_post_pca = pca.fit_transform(dict_data_std.values())
My problem is: does PCA transform preserve the order of the data? So, does the first array of dict_data get mapped to the first (2D) array of data_post_pca?
I need a 100% certain answer.

Not necessarily. You can think of PCA with one eigenvector as a weighted linear combination of input. Depending on the decomposition it could do anything from reordering to reversing the weights. However, if your intuition holds true it should produce a weighting similar to what you think it should.
Now once you take more eigenvectors it becomes a set of linear combinations instead of a single one. - https://www.reddit.com/r/MachineLearning/comments/24ywyc/does_pca_preserve_the_order_in_data/

How to convert 2-dimensional data into 3-D (Non-linear data)?

I have created synthetic data like ,
X, Y = make_classification(n_features=2,n_samples=100, n_redundant=0, n_informative=1,
n_clusters_per_class=1, class_sep=0.001,weights= [0.8,0.2] ,n_classes=2 ,random_state=42)
df = CreateDataFrame(X,Y,['X1','X2'])
having two class and the data in non-linear. Now I want to convert this 2-d data into 3-D data space to draw the decision boundary between the classes. any one can help me

For converting this 2-D data into 3-D one you have to add another feature. This feature addition can be done in any of the following ways
1) Add extra features while creating random data by setting n_features = 3. This will add extra dimension at the time of creation. But it will be the addition of a new feature.
2) Another way is to create the third feature by applying arithmetic(like avg of 2 features etc) operation .

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?

The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.

After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

How to get the top N frequent words in each cluster? Sklearn

I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering
Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms
# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)
I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean
My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster.
Thanks

One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list below) you could try something like this:
import pandas as pd
X = pd.DataFrame(X.toarray(), columns=your_word_list) # columns argument is optional
X['Cluster'] = clustering # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()
# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)
As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.

Understanding scikitlearn PCA.transform function in Python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!

When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.