Clustering vectors with similar patterns - python

Say that I have many vectors, some of them are:
a: [1,2,3,4,3,2,1,0,0,0,0,0]
b: [5,5,5,5,5,10,20,30,5,10]
c: [1,2,3,2,1,0,0,0,0,0,0,0]
We can see similar patterns between vector a and c.
My question is if it is possible to classify these two to the same cluster and classify b to another cluster.
I rather not use algorithms like KMeans, because the values are not interesting, only the patterns do.
any advice is welcome, especially solutions in Phyton.
Thanks

You may want to use Support Vector Classifier as it produces boundaries between clusters based on the patterns (generalized directions) between points in the clusters, rather than naive distance between points (like KMeans and Spectral Clustering will do). You will however have to construct labels Y yourself as SVC is a supervised method. Here is an example:
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
a = [1,2,3,4,3,2,1,0,0,0,0,0]
b = [5,5,5,5,5,10,20,30,5,10]
c = [1,2,3,2,1,0,0,0,0,0,0,0]
d = [100,2,300,4,100,0,0,0,0,0,0,0]
vectors = [a, b, c]
# Vectors have different lengths. Append them to get equal dimensions.
L = max(len(elem) for elem in vectors)
imputed = []
for elem in vectors:
l = len(elem)
imputed.append(elem + [0]*(L-l))
print(imputed)
X = np.array(imputed)
print(X)
Y = np.array([0, 1, 0])
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X, Y)
print(clf.predict(np.array([d])))

Related

Plot boundary lines between classes in python based on multidimensional data?

I am trying to plot boundary lines of Iris data set using LDA in sklearn Python based on this documentation.
For two dimensional data, we can easily plot the lines using LDA.coef_ and LDA.intercept_.
But for multidimensional data that has been reduced to two components, the LDA.coef_ and LDA.intercept has many dimensions which I don't know how to use these to plot the boundary lines in 2D reduced-dimension plot.
I've tried to plot using only the first two-element of LDA.coef_ and LDA.intercept, but It didn't work.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
x = np.array([-10,10])
y_hyperplane = -1*(lda.intercept_[0]+x*lda.coef_[0][0])/lda.coef_[0][1]
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
plt.plot(x,y_hyperplane,'k')
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
plt.show()
Result of boundary line produced by lda.coef_[0] and lda.intercept[0] showed a line that isn't likely to separate between two classes
enter image description here
I've tried using np.meshgrid to draw areas of the classes. But I get an error like this
ValueError: X has 2 features per sample; expecting 4
which expecting 4 dimensional of original data, instead of 2D points from the meshgrid.
Linear discriminant analysis (LDA) can be used as a classifier or for dimensionality reduction.
LDA for dimensionality reduction
Dimensionality reduction techniques reduces the number of features. Iris dataset has 4 features, lets use LDA to reduce it to 2 features so that we can visualise it.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
lda_object = lda.fit(X, y)
X = lda_object.transform(X)
for l,c,m in zip(np.unique(y),['r','g','b'],['s','x','o']):
plt.scatter(X[y==l,0],
X[y==l,1],
c=c, marker=m, label=l,edgecolors='black')
Output:
LDA for multi class classification
LDA does multi class classification using One-vs-rest. If you have 3 classes you will get 3 hyperplanes (decision boundaries) for each class. If there are n features then each hyperplane is represented using n weights (coefficients) and 1 intersect. In general
coef_ : shape of (n_classes, n_features)
intercept_ : shape of (n_classes,)
Sample, documented inline
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(13)
# Generate 3 linearly separable dataset of 2 features
X = [[0,0]]*25+[[0,10]]*25+[[10,10]]*25
X = np.array(list(map(lambda x: list(map(lambda y: np.random.randn()+y, x)), X)))
y = np.array([0]*25+[1]*25+[2]*25)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda_object = lda.fit(X, y)
# Plot the hyperplanes
for l,c,m in zip(np.unique(y),['r','g','b'],['s','x','o']):
plt.scatter(X[y==l,0],
X[y==l,1],
c=c, marker=m, label=l,edgecolors='black')
x1 = np.array([np.min(X[:,0], axis=0), np.max(X[:,0], axis=0)])
for i, c in enumerate(['r','g','b']):
b, w1, w2 = lda.intercept_[i], lda.coef_[i][0], lda.coef_[i][1]
y1 = -(b+x1*w1)/w2
plt.plot(x1,y1,c=c)
As you can see each decision boundary separates one class from the rest (follow the color of the decision boundary)
You case
You have dataset which is of 4 features, so you cannot visualise the data as well as the decision boundary (human visualisation is limited only upto 3D). One approach is to use LDA and reduce the dimentions to 2D and then again using LDA to classify these 2D features.

How to rank features correctly from PCA's eigenvector

My goal is to rank features of a supervised machine learning dataset, by contributions to theris Principal component, thanks to this answer.
I set up an experiment in which I construct a dataset contains 3 infomative, 3 redundent and 3 noise features in order. Then find the index of the largest component on each principal axes.
However, I got a realy worse rank by using this method. Dont know what mistakes I have made. Many thanks for helping. Here is my code:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
# Make a dataset which contains 3 Infomative, redundant, noise features respectively
X, _ = make_classification(n_samples=20, n_features=9, n_informative=3,
n_redundant=3, random_state=0, shuffle=False)
cols = ['I_'+str(i) for i in range(3)]
cols += ['R_'+str(i) for i in range(3)]
cols += ['N_'+str(i) for i in range(3)]
dfX = pd.DataFrame(X, columns=cols)
# Rank each feature by each priciple axis maximum component
model = PCA().fit(dfX)
_ = model.transform(dfX)
n_pcs= model.components_.shape[0]
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
most_important_names = [dfX.columns[most_important[i]] for i in range(n_pcs)]
rank = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
rank outputs:
{'PC0': 'R_1',
'PC1': 'I_1',
'PC2': 'N_1',
'PC3': 'N_0',
'PC4': 'N_2',
'PC5': 'I_2',
'PC6': 'R_1',
'PC7': 'R_0',
'PC8': 'R_2'}
I am expecting to see infomative features I_x to be ranked top3.
PCA ranking criteria is the variance of each columns, if you would like to have a ranking, what you can do is to output the VarianceThreshold of each columns. You can do that by this
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold()
selector.fit_transform(dfX)
print(selector.variances_)
# outputs [1.57412087 1.08363799 1.11752334 0.58501874 2.2983772 0.2857617
# 1.09782539 0.98715471 0.93262548]
Which you can clearly see that the first 3 columns (I0, I1,I2) has the greatest variance, and thus makes the best candidate for using PCA with.

How can I increase the accuracy of my Linear Regression model?(machine learning with python)

I have a machine learning project with python by using scikit-learn library. I have two seperated datasets for training and testing and I try to doing linear regression. I use this codeblock shown below:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))
My results are:
lr.coef_: [ 2.32088001e+00 2.07441948e-12 -4.73338567e-05 6.79658129e+02]
lr.intercept_: 2166.186033098048
Training set score: 0.63
Test set score: 0.5732999
What do you suggest me? How can I increase my accuracy? (adding code,parameter etc.)
My datasets is here: https://yadi.sk/d/JJmhzfj-3QCV4V
I'll elaborate a bit on #GeorgiKaradjov's answer with some examples. Your question is very broad, and there's multiple ways to gain improvements. In the end, having domain knowledge (context) will give you the best possible chance of getting improvements.
Normalise your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation
Turn categorical data into variables via, e.g., OneHotEncoding
Do feature engineering:
Are my features collinear?
Do any of my features have cross terms/higher-order terms?
Regularisation of the features to reduce possible overfitting
Look at alternative models given the underlying features and the aim of the project
1) Normalise data
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
Gives
0 0.752395
1 0.008489
2 -0.381637
3 -0.020588
4 0.171446
Name: AFP, dtype: float64
2) Categorical Feature Encoding
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
X_train = feature_engineering(X_train)
X_train.head(5)
Gives
AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105 dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1 resource_level_2 resource_level_4
0 0.752395 1 0 0 0 1 0 0 1 0 0
1 0.008489 0 0 1 0 0 1 0 1 0 0
2 -0.381637 0 0 1 0 0 1 0 1 0 0
3 -0.020588 0 0 1 0 1 0 0 1 0 0
3) Feature Engineering; collinearity
import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
You want the red line for y=x because values should be correlated with themselves. However, any red or blue columns show there's a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent from one another.
3) Feature engineering; higher-order terms
Maybe your model is too simple, you could consider adding higher order and cross terms:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)
I had a quick try at this, I don't think the higher order terms help out much. It's also possible your data is non-linear, a quick logarithm or the Y-output gives a worse fit, suggesting it's linear. You could also look at the actuals, but I was too lazy....
4) Regularisation
Try using sklearn's RidgeRegressor and playing with alpha:
lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
5) Alternative models
Sometimes linear regression is not always suited. For example, Random Forest Regressors can perform very well, and are usually insensitive to data being standardised, and being categorical/continuous. Other models include XGBoost, and Lasso (Linear regression with L1 regularisation).
lr = RandomForestRegressor(n_estimators=100)
Putting it all together
I got carried away and started looking at your problem, but couldn't improve it too much without knowing all the context of the features:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)
lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])
Resulting in:
Training set score: 0.90
Test set score: 0.61
You can look at the importance of the variables (higher value, more important).
Importance
AFP 0.882295
dev_plat_077070 0.020817
dev_plat_077082 0.001162
dev_plat_077117108116105 0.016334
dev_plat_080067 0.004077
lang_type_051071076 0.012458
lang_type_052071076 0.021195
lang_type_065112071 0.001118
resource_level_1 0.012644
resource_level_2 0.006673
resource_level_4 0.021227
You could start looking at the hyperparameters to get improvements on this also: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
here are some tips :
Data preparation(exploration) is one of the most important steps in a machine learning project, you need to start with it.
did you clean your data? if not start with that step!
As said in this tutorial :
There are no shortcuts for data exploration. If you are in a state of
mind, that machine learning can sail you away from every data storm,
trust me, it won’t.After some point of time, you’ll realize that you
are struggling at improving model’s accuracy. In such situation, data
exploration techniques will come to your rescue.
here is some step for data exploration :
missing values treatment,
outlier removal
feature engineering
After that try to perform univariate and bivariate analysis with your features.
use one hot encoding to transform you categorical features into numerics ones.
this is what you need according to what we have talked about in the comments.
here is a tutorial on how to deal with categorical variables, one-hot encoding from sklearn learn is the best technic for your problem.
Using ASCII representation is not the best practice for handling categorical features
You can find more about data exploration in here
follow the suggestions I gave to you and thank me later.
normalize your data
Depending on the type of input features you can extract different features from them (feature combinations are possible too)
If your data is not linearly separable, you won't be able to predict it well. You may need to use another model - Logistic regression, SVR, NN / whatever

Determining a threshold value for a bimodal distribution via KMeans clustering

I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.
Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]

Select 5 data points closest to SVM hyperlane

I have written Python code using Sklearn to cluster my dataset:
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_= len(cluster_centers_indices)
I am exploring the use of query-by-clustering and so form an inital training dataset by:
td_title =[]
td_abstract = []
td_y= []
for each in centers:
td_title.append(title[each])
td_abstract.append(abstract[each])
td_y.append(y[each])
I then train my model (an SVM) on it by:
clf = svm.SVC()
clf.fit(X, data_y)
I wish to write a function that given the centres, the model, the X values and the Y values will append the 5 data points which the model is most unsure about, ie. the data points closest to the hyperplane. How can I do this?
The first steps of your process aren't entirely clear to me, but here's a suggestion for "Select(ing) 5 data points closest to SVM hyperplane". The scikit documentation defines decision_function as the distance of the samples to the separating hyperplane. The method returns an array which can be sorted with argsort to find the "top/bottom N samples".
Following this basic scikit example, define a function closestN to return the samples closest to the hyperplane.
import numpy as np
def closestN(X_array, n):
# array of sample distances to the hyperplane
dists = clf.decision_function(X_array)
# absolute distance to hyperplane
absdists = np.abs(dists)
return absdists.argsort()[:n]
Add these two lines to the scikit example to see the function implemented:
closest_samples = closestN(X, 5)
plt.scatter(X[closest_samples][:, 0], X[closest_samples][:, 1], color='yellow')
Original
Closest Samples Highlighted
If you need to append the samples to some list, you could somelist.append(closestN(X, 5)). If you needed the sample values you could do something like somelist.append(X[closestN(X, 5)]).
closestN(X, 5)
array([ 1, 20, 14, 31, 24])
X[closestN(X, 5)]
array([[-1.02126202, 0.2408932 ],
[ 0.95144703, 0.57998206],
[-0.46722079, -0.53064123],
[ 1.18685372, 0.2737174 ],
[ 0.38610215, 1.78725972]])

Categories

Resources