My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?
It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.
Related
I read a paper that their retrieval system is based on SIFT descriptor and fast approximate k-means clustering. I installed pyflann. If I am not mistaken the following commands only find the indices of the close datapoints to a specific sample (for example, here, the indices of 5 nearest points from dataset to testset)
from pyflann import *
from numpy import *
from numpy.random import *
dataset = rand(10000, 128)
testset = rand(1000, 128)
flann = FLANN()
result,dists = flann.nn(dataset,testset,5,algorithm="kmeans",
branching=32, iterations=7, checks=16)
I went through user manual, however, could find how can I do k-means clusterin with FLANN. and How can I fit the test based on the cluster centers. As we can use the kmeans++ clustering` in scikitlearn, and then we are fitting the dataset based on the model:
kmeans=KMeans(n_clusters=100,init='k-means++',random_state = 0, verbose=0)
kmeans.fit(dataset)
and later we can assign labels to the test set by using KDTree for example.
kdt=KDTree(kmeans.cluster_centers_)
Q=testset #query
kdt_dist,kdt_idx=kdt.query(Q,k=1) #knn
test_labels=kdt_idx #knn=1 labels
Could someone please help me how can I use the same procedure with FLANN? (I mean clustering the dataset (finding the cluster centers and quantizing features) and then quantizing testset based on cluster centers found from the previous step).
You won't be able to do the best variations with FLANN, because these use two indexes at the same time, and are ugly to implement.
But you can build a new index on the centers for every iteration. But unless you have k > 1000 it probably will not help much.
I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
plt.show()
Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?
I'm still learning PYMC3, but I cannot find anything on the following problem in the docs. Consider the Bayesian Structure Time Series (BSTS) model from this question with no seasonality. This can be modeled in PYMC3 as follows:
import pymc3, numpy, matplotlib.pyplot
# generate some test data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
y_train = y_full[:90]
y_test = y_full[90:]
# specify the model
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_train.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_train)
trace = pymc3.sample(1000)
y_mean_pred = pymc3.sample_ppc(trace,samples=1000,model=model)['y'].mean(axis=0)
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='b')
ax.plot(t[:90],y_mean_pred,c='r')
matplotlib.pyplot.show()
Now I would like to predict the behavior for the next 10 time steps, i.e., y_test. I would also like to include credible regions over this area produce a Bayesian cone, e.g., see here. Unfortunately the mechanism for producing the cones in the aforementioned link is a little vague. In a more conventional AR model one could learn the mean regression coefficients and manually extend the mean curve. However, in this BSTS model there is no obvious way to do this. Alternatively, if there were regressors, then I could use a theano.shared and update it with a finer/extended grid to impute and extrapolate with sample_ppc, but thats not really an option in this setting. Perhaps sample_ppc is a red herring here, but its unclear how else to proceed. Any help would be welcome.
I think the following work. However, its super clunky and requires that I know how far in advance I want to predict before I train (in particular it percludes streaming usage or simple EDA). I suspect there is a better way and I would much rather accept a better solution by someone with more Pymc3 experience
import numpy, pymc3, matplotlib.pyplot, seaborn
# generate some data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
# mask the data that I want to predict (requires knowledge
# that one might not always have at training time).
cutoff_idx = 80
y_obs = numpy.ma.MaskedArray(y_full,numpy.arange(t.size)>cutoff_idx)
# specify and train the model, used the masked array to supply only
# the observed data
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_obs.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_obs)
trace = pymc3.sample(5000)
y_pred = pymc3.sample_ppc(trace,samples=20000,model=model)['y']
y_pred_mean = y_pred.mean(axis=0)
# compute percentiles
dfp = numpy.percentile(y_pred,[2.5,25,50,70,97.5],axis=0)
# plot actual data and summary posterior information
pal = seaborn.color_palette('Purples')
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='g',label='true value',alpha=0.5)
ax.plot(t,y_pred_mean,c=pal[5],label='posterior mean',alpha=0.5)
ax.plot(t,dfp[2,:],alpha=0.75,color=pal[3],label='posterior median')
ax.fill_between(t,dfp[0,:],dfp[4,:],alpha=0.5,color=pal[1],label='CR 95%')
ax.fill_between(t,dfp[1,:],dfp[3,:],alpha=0.4,color=pal[2],label='CR 50%')
ax.axvline(x=t[cutoff_idx],linestyle='--',color='r',alpha=0.25)
ax.legend()
matplotlib.pyplot.show()
This outputs the following which seems like a really bad prediction, but at least the code is supplying out of sample values.
I have training data in the form of images taken by a PiCamera from an RaspberryPi RC car while I drive it in between two lane lines.
Each image is labelled with Left and Right motor controls. I've plotted them in the below graph.
I am using ConvNets to do the regression with Keras and TensorFlow as backend.
It's clearly visible that regression can be much simpler if I am able to remove the training samples which are to the left of the visible regression.
The code for loading the images and labels is very simple and is below:
filenames = glob.glob("../data/*.jpg")
labels = []
images = []
for filename in filenames:
# Timestamp-LeftMotorControl-RightMotorControl.jpg
filename = filename.replace('.jpg', '')
parts = filename.split('-')
if float(parts[1]) == 0. or float(parts[2]) == 0.:
continue
images.append(filename)
labels.append([float(parts[1]), float(parts[2])])
Firstly is there a good approach for removing data which is out of bounds of the visible regression from the training samples?
Also, I have a different approach in mind which is to create 100 data bins with edges of 0 to 1, then take 50 samples from each bin such that my data is balanced.
Is there a numpy way to put the data into bins so that I don't need to do that with some custom function?
Answering the first question:
Do the linear regression fit on the whole dataset
Remove the points with the largest residuals
Repeat 1 and 2 until all residuals are comfortably small