How to do Constrained Linear Regression - scikit learn? - python

I am trying to carry out linear regression subject using some constraints to get a certain prediction.
I want to make the model predicting half of the linear prediction, and the last half linear prediction near the last value in the first half using a very narrow range (using constraints) similar to a green line in figure.
The full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
pd.options.mode.chained_assignment = None # default='warn'
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = range(1,85,1)
x=int(0.7*len(data))
df = pd.DataFrame(list(zip(*[time, data])))
df.columns = ['time', 'data']
# print df
x=int(0.7*len(df))
train = df[:x]
valid = df[x:]
models = []
names = []
tr_x_ax = []
va_x_ax = []
pr_x_ax = []
tr_y_ax = []
va_y_ax = []
pr_y_ax = []
time_model = []
models.append(('LR', LinearRegression()))
for name, model in models:
x_train=df.iloc[:, 0][:x].values
y_train=df.iloc[:, 1][:x].values
x_valid=df.iloc[:, 0][x:].values
y_valid=df.iloc[:, 1][x:].values
model = LinearRegression()
# poly = PolynomialFeatures(5)
x_train= x_train.reshape(-1, 1)
y_train= y_train.reshape(-1, 1)
x_valid = x_valid.reshape(-1, 1)
y_valid = y_valid.reshape(-1, 1)
# model.fit(x_train,y_train)
model.fit(x_train,y_train.ravel())
# score = model.score(x_train,y_train.ravel())
# print 'score', score
preds = model.predict(x_valid)
tr_x_ax.extend(train['data'])
va_x_ax.extend(valid['data'])
pr_x_ax.extend(preds)
valid['Predictions'] = preds
valid.index = df[x:].index
train.index = df[:x].index
plt.figure(figsize=(5,5))
# plt.plot(train['data'],label='data')
# plt.plot(valid[['Close', 'Predictions']])
x = valid['data']
# print x
# plt.plot(valid['data'],label='validation')
plt.plot(valid['Predictions'],label='Predictions before',color='orange')
y =range(0,58)
y1 =range(58,84)
for index, item in enumerate(pr_x_ax):
if index >13:
pr_x_ax[index] = pr_x_ax[13]
pr_x_ax = list([float(i) for i in pr_x_ax])
va_x_ax = list([float(i) for i in va_x_ax])
tr_x_ax = list([float(i) for i in tr_x_ax])
plt.plot(y,tr_x_ax, label='train' , color='red', linewidth=2)
plt.plot(y1,va_x_ax, label='validation1' , color='blue', linewidth=2)
plt.plot(y1,pr_x_ax, label='Predictions after' , color='green', linewidth=2)
plt.xlabel("time")
plt.ylabel("data")
plt.xticks(rotation=45)
plt.legend()
plt.show()
If you see this figure:
label: Predictions before, the model predicted it without any constraints (I don't need this result).
label: Predictions after, the model predicted it within a constraint but this is after the model predicted AND the all values are equal to last value at index = 71 , item 8.56.
I used for loop for index, item in enumerate(pr_x_ax): in line:64, and the curve is line straight from time 71 to 85 sec as you see in order to show you how I need the model work.
Could I build the model give the same result instead of for loop???
Please your suggestions

I expect that in your question by drawing green line you really expect trained model to predict linear horizontal turn to the right. But current trained model draws just straight orange line.
It is true for any trained model of any algorithm and type that in order to learn some unordinary change in behavior model needs to have at least some samples of that unordinary change. Or at least some hidden meaning in observed data should point to having such unordinary change.
In other words for your model to learn that right turn on green line a model should have points with that right turn in the training data set. But you take for training data just first (leftmost) 70% of data by train = df[:int(0.7 * len(df))] and that training data has no such right turns and this training data just looks close to one straight line.
So you need to re-sample your data into training and validation in a different way - take randomly 70% of samples from whole range of X and the rest goes to validation. So that in your training data samples that do right turn also included.
Second thing is that LinearRegression model always models predictions just with one single straight line, and this line can't have right turns. In order to have right turns you need some more complex model.
One way for a model to have a right turn is to be piece-wise-linear, i.e. having several joined straight lines. I didn't find ready-made piecewise linear models inside sklearn, only using other pip models. So I decided to implement my own simple class PieceWiseLinearRegression that uses np.piecewise() and scipy.optimize.curve_fit() in order to model piecewise linear function.
Next picture shows results of applying two mentioned things above, code goes afterwards, re-sampling dataset in a different way and modeling piece-wise-linear function. Your current linear model LR still makes a prediction using just one straight blue line, while my piecewise linear PWLR2, orange line, consists of two segments and correctly predicts right turn:
To see clearly just one PWLR2 graph I did next picture too:
My class PieceWiseLinearRegression on creation of object accepts just one argument n - number of linear segments to be used for prediction. For picture above n = 2 was used.
import sys, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
np.random.seed(0)
class PieceWiseLinearRegression:
#classmethod
def nargs_func(cls, f, n):
return eval('lambda ' + ', '.join([f'a{i}'for i in range(n)]) + ': f(' + ', '.join([f'a{i}'for i in range(n)]) + ')', locals())
#classmethod
def piecewise_linear(cls, n):
condlist = lambda xs, xa: [(lambda x: (
(xs[i] <= x if i > 0 else np.full_like(x, True, dtype = np.bool_)) &
(x < xs[i + 1] if i < n - 1 else np.full_like(x, True, dtype = np.bool_))
))(xa) for i in range(n)]
funclist = lambda xs, ys: [(lambda i: (
lambda x: (
(x - xs[i]) * (ys[i + 1] - ys[i]) / (
(xs[i + 1] - xs[i]) if abs(xs[i + 1] - xs[i]) > 10 ** -7 else 10 ** -7 * (-1, 1)[xs[i + 1] - xs[i] >= 0]
) + ys[i]
)
))(j) for j in range(n)]
def f(x, *pargs):
assert len(pargs) == (n + 1) * 2, (n, pargs)
xs, ys = pargs[0::2], pargs[1::2]
xa = x.ravel().astype(np.float64)
ya = np.piecewise(x = xa, condlist = condlist(xs, xa), funclist = funclist(xs, ys)).ravel()
#print('xs', xs, 'ys', ys, 'xa', xa, 'ya', ya)
return ya
return cls.nargs_func(f, 1 + (n + 1) * 2)
def __init__(self, n):
self.n = n
self.f = self.piecewise_linear(self.n)
def fit(self, x, y):
from scipy import optimize
self.p, self.e = optimize.curve_fit(self.f, x, y, p0 = [j for i in range(self.n + 1) for j in (np.amin(x) + i * (np.amax(x) - np.amin(x)) / self.n, 1)])
#print('p', self.p)
def predict(self, x):
return self.f(x, *self.p)
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = list(range(1, 85))
df = pd.DataFrame(list(zip(time, data)), columns = ['time', 'data'])
choose_train = np.random.uniform(size = (len(df),)) < 0.8
choose_valid = ~choose_train
x_all = df.iloc[:, 0].values
y_all = df.iloc[:, 1].values
x_train = df.iloc[:, 0][choose_train].values
y_train = df.iloc[:, 1][choose_train].values
x_valid = df.iloc[:, 0][choose_valid].values
y_valid = df.iloc[:, 1][choose_valid].values
x_all_lin = np.linspace(np.amin(x_all), np.amax(x_all), 500)
models = []
models.append(('LR', LinearRegression()))
models.append(('PWLR2', PieceWiseLinearRegression(2)))
for imodel, (name, model) in enumerate(models):
model.fit(x_train[:, None], y_train)
x_all_lin_pred = model.predict(x_all_lin[:, None])
plt.plot(x_all_lin, x_all_lin_pred, label = f'pred {name}')
plt.plot(x_train, y_train, label='train')
plt.plot(x_valid, y_valid, label='valid')
plt.xlabel('time')
plt.ylabel('data')
plt.legend()
plt.show()

Related

Problems with hurdle and truncated count model from statsmodels

I want to use a hurdle model from statsmodels (https://www.statsmodels.org/dev/examples/notebooks/generated/count_hurdle.html).
Unfortunately, when following the manual, I am running into several problems:
I am only able to load the discrete_model and count_model from discrete:
Hence, the following work
import statsmodels.discrete.discrete_model
import statsmodels.discrete.count_model
However, there is no module named 'statsmodels.discrete.truncated_model'
Furthermore, I am not able to run
get_diagnostic()
with
a poisson regression. I am running the same code as on the statsmodels website (see below), but it throughs the following error:
'PoissonResults' object has no attribute 'get_diagnostic'
Thanks for your help!
np.random.seed(987456348)
# large sample to get strong results
nobs = 5000
x = np.column_stack((np.ones(nobs), np.linspace(0, 1, nobs)))
mu0 = np.exp(0.5 *2 * x.sum(1))
y = np.random.poisson(mu0, size=nobs)
print(np.bincount(y))
y_ = y
indices = np.arange(len(y))
mask = mask0 = y > 0
for _ in range(10):
print( mask.sum())
indices = mask #indices[mask]
if not np.any(mask):
break
mu_ = np.exp(0.5 * x[indices].sum(1))
y[indices] = y_ = np.random.poisson(mu_, size=len(mu_))
np.place(y, mask, y_)
mask = np.logical_and(mask0, y == 0)
np.bincount(y)
mod_p = Poisson(y, x)
res_p = mod_p.fit()
print(res_p.summary())
dia_p = res_p.get_diagnostic()
dia_p.plot_probs();

Python optimization of prediction of random forest regressor

I have built a random forest regressor to predict the elasticity of a certain object based on color, material, size and other features.
The model works fine and I can predict the expected elasticity given certain inputs.
Eventually, I want to be able to find the lowest elasticity with certain constraints. The inputs have limited possibilities, i.e., material can only be plastic or textile.
I would like to have a smart solution in which I don't have to brute force and try all the possible combinations and find the one with lowest elasticity. I have found that surrogate models can be used for this but I don't understand how to apply this concept to my problem. For example, what is the objective function I should optimize in my case? I thought of passing the .predict() of the random forest but I'm not sure this is the correct way.
To summarize, I'd like to have a solution that given certain conditions, tells me what should be the best set of features to have lowest elasticity. Example, I'm looking for the lowest elasticity when the object is made of plastic --> I'd like to receive the set of other features that tells me how to get lowest elasticity in that case. Or simply, what feature I should tune to improve the performance
import numpy
from scipy.optimize import minimize
import random
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=10, random_state=0)
model.fit(X_train,y_train)
material = [0,1]
size= list(range(1, 45))
color= list(range(1, 500))
def objective(x):
material= x[0]
size = x[1]
color = x[2]
return model.predict([[material,size,color]])
# initial guesses
n = 3
x0 = np.zeros(n)
x0[0] = random.choice(material)
x0[1] = random.choice(size)
x0[2] = random.choice(color)
# optimize
b = (None,None)
bnds = (b, b, b, b, b)
solution = minimize(objective, x0, method='nelder-mead',
options={'xtol': 1e-8, 'disp': True})
x = solution.x
print('Final Objective: ' + str(objective(x)))
This is one solution if I understood you correctly,
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from scipy.optimize import differential_evolution
model = None
def objective(x):
material= x[0]
size = x[1]
color = x[2]
return model.predict([[material,size,color]])
# define input data
material = np.random.choice([0,1], 10); material = np.expand_dims(material, 1)
size = np.arange(10); size = np.expand_dims(size, 1)
color = np.arange(20, 30); color = np.expand_dims(color, 1)
input = np.concatenate((material, size, color), 1) # shape = (10, 3)
# define output = elasticity between [0, 1] i.e. 0-100%
elasticity = np.array([0.51135295, 0.54830051, 0.42198349, 0.72614775, 0.55087905,
0.99819945, 0.3175208 , 0.78232872, 0.11621277, 0.32219236])
# model and minimize
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(input, elasticity)
limits = ((0, 1), (0, 10), (20, 30))
res = differential_evolution(objective, limits, maxiter = 10000, seed = 11111)
min_y = model.predict([res.x])[0]
print("min_elasticity ==", round(min_y, 5))
The output is minimal elasticity based on the limits
min_elasticity == 0.19029
These are random data so the RandomForestRegressor doesn't do the best job perhaps

My neural network only returns 0, 1 or other constant value

I'm trying to create a network, that would help predict stock prices the following day. My input data are: open, high, low and close stock values, volume, index values, a few technical indicators and exchange rate; the output is closing price from the next day. I'm using data uploaded from Excel file.
I wrote a program, that I will paste below, but it doesn't seem to be working correctly. Network always returns 1, 0 or other constant value (between 0 - 1).
I took the following steps so far:
tried to normalise the data like so: X_norm = X/(10 ** d) where d is the smallest number for which this conditon is met: abs(X_norm) < 1. I did that for the whole set in Excel before dividing it into training and test.
shuffled the data before dividing it into training/test, so that learning examples are not from consecutive days
running the network on a smaller data set and on example data set (I generated random numbers and did a simple math using them for an output and tried running network with that)
changing amount of hidden neurons
chaninging number of iterations (up to a 1000, which was a lot for my computer considering the data set, so I didn't try any more because it would take too much time)
changing learning rate.
No matter what steps I took the outcome was always the same. I think my problem could be that I don't have a bias, but perhaps I also have other mistakes in my code that are contributing to this error.
My program:
import numpy as np
import pandas as pd
df = pd.read_excel(r"path", sheet_name="DATA", index_col=0, header=0)
df = df.to_numpy()
np.random.shuffle(df)
X_data = df[:, 0:15]
X_data = X_data.reshape(1000, 1, 15)
print(f"X_data: {X_data}")
Y_data = df[:, 15]
Y_data = Y_data.reshape(1000, 1, 1)
print(f"Y_data: {Y_data}")
X = X_data[0:801]
x_test = X_data[801:]
y = Y_data[0:801]
y_test = Y_data[801:]
print(f"X_train: {X}")
print(f"x_test: {x_test}")
print(f"Y_train: {y}")
print(f"y_test: {y_test}")
rate = 0.2
class NeuralNetwork:
def __init__(self):
self.input_neurons = 15
self.hidden1_neurons = 10
self.hidden2_neurons = 5
self.output_neuron = 1
self.input_to_hidden1_w = (np.random.random((self.input_neurons, self.hidden1_neurons))) # 14x30
self.hidden1_to_hidden2_w = (np.random.random((self.hidden1_neurons, self.hidden2_neurons))) # 30x20
self.hidden2_to_output_w = (np.random.random((self.hidden2_neurons, self.output_neuron))) # 20x1
def activation(self, x):
sigmoid = 1/(1+np.exp(-x))
return sigmoid
def activation_d(self, x):
derivative = x * (1 - x)
return derivative
def feed_forward(self, X):
self.z1 = np.dot(X, self.input_to_hidden1_w)
self.z1_a = self.activation(self.z1)
self.z2 = np.dot(self.z1_a, self.hidden1_to_hidden2_w)
self.z2_a = self.activation(self.z2)
self.z3 = np.dot(self.z2_a, self.hidden2_to_output_w)
output = self.activation(self.z3)
return output
def backward(self, X, y, rate, output):
error = y - output
z3_error_delta = error * self.activation_d(output)
z2_error = np.dot(z3_error_delta, np.transpose(self.hidden2_to_output_w))
z2_error_delta = z2_error * self.activation_d(self.z2)
z1_error = np.dot(z2_error_delta, np.transpose(self.hidden1_to_hidden2_w))
z1_error_delta = z1_error * self.activation_d(self.z1)
self.input_to_hidden1_w += rate * np.dot(np.transpose(X), z1_error_delta)
self.hidden1_to_hidden2_w += rate * np.dot(np.transpose(self.z1), z2_error_delta)
self.hidden2_to_output_w += rate * np.dot(np.transpose(self.z2), z3_error_delta)
def train(self, X, y):
output = self.feed_forward(X)
self.backward(X, y, rate, output)
def save_weights(self):
np.savetxt("w1.txt", self.input_to_hidden1_w, fmt="%s")
np.savetxt("w2.txt", self.hidden1_to_hidden2_w, fmt="%s")
np.savetxt("w3.txt", self.hidden2_to_output_w, fmt="%s")
def check(self, x_test, y_test):
self.feed_forward(x_test)
np.mean(np.square((y_test - self.feed_forward(x_test))))
Net = NeuralNetwork()
for l in range(100):
for i, pattern in enumerate(X):
for j, outcome in enumerate(y):
print(f"#: {l}")
print(f'''
# {str(l)}
# {str(X[i])}
# {str(y[j])}''')
print(f"Predicted output: {Net.feed_forward(X[i])}")
Net.train(X[i], y[j])
print(f"Error training: {(np.mean(np.square(y - Net.feed_forward(X))))}")
Net.save_weights()
for i, pattern in enumerate(x_test):
for j, outcome in enumerate(y_test):
Net.check(x_test[i], y_test[j])
print(f"Error test: {(np.mean(np.square(y_test - Net.feed_forward(x_test))))}")

sample X examples from each class label

l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you
Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling
Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)
I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]
You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array

Bad K-means with Gradient Descent using TensorFlow

Currently learning TensorFlow I'm working to implement kmeans clustering using TensorFlow. I am following a tutorial on TensorFlow which first introduce kmeans then introduce Gradient Descent Optimization
We first generate samples
def create_samples(n_clusters, n_samples_per_cluster, n_features, embiggen_factor, seed):
np.random.seed(seed)
slices = []
centroids = []
# Create samples for each cluster
for i in range(n_clusters):
samples = tf.random_normal((n_samples_per_cluster, n_features),
mean=0.0, stddev=5.0, dtype=tf.float32, seed=seed, name="cluster_{}".format(i))
current_centroid = (np.random.random((1, n_features)) * embiggen_factor) - (embiggen_factor/2)
centroids.append(current_centroid)
samples += current_centroid
slices.append(samples)
# Create a big "samples" dataset
samples = tf.concat(0, slices, name='samples')
centroids = tf.concat(0, centroids, name='centroids')
return centroids, samples
then define 2 function assign & update (+ euclidian distance) as usual
def assign(data, centroids):
# Explanations here: http://learningtensorflow.com/lesson6/
expanded_vectors = tf.expand_dims(samples, 0)
expanded_centroids = tf.expand_dims(centroids, 1)
# nice trick here: use 'sub' "pairwisely" (thats why we just used "expand")
#
distances = tf.reduce_sum( tf.square(
tf.sub(expanded_vectors, expanded_centroids)), 2)
mins = tf.argmin(distances, 0)
nearest_indices = mins
return nearest_indices
def update(data, nearest_indices, n_clusters):
# Updates the centroid to be the mean of all samples associated with it.
nearest_indices = tf.to_int32(nearest_indices)
partitions = tf.dynamic_partition(samples, nearest_indices, n_clusters)
new_centroids = tf.concat(0, [tf.expand_dims(tf.reduce_mean(partition, 0), 0) for partition in partitions])
return new_centroids
def euclidian_distance(x, y):
sqd = tf.squared_difference(tf.cast(x, "float32"),tf.cast(y, "float32"))
sumsqd = tf.reduce_sum(sqd)
sqrtsumsqd = tf.sqrt(sumsqd)
return sqrtsumsqd
Then define the TensorFlow model to run:
import tensorflow as tf
import numpy as np
nclusters = 3
nsamplespercluster = 500
nfeatures = 2
embiggenfactor = 70
seed = 700
np.random.seed(seed)
ocentroids, samples = create_samples(nclusters, nsamplespercluster, nfeatures, embiggenfactor, seed)
X = tf.placeholder("float32", [nclusters*nsamplespercluster, 2])
# chosing random sample points as initial centroids.
centroids = tf.Variable([samples[i] for i in np.random.choice(range(nclusters*nsamplespercluster), nclusters)])#, [10.,10.]])
mean=0.0, stddev=150, dtype=tf.float32, seed=seed))
nearest_indices = assign(X, centroids)
new_centroids = update(X, nearest_indices, nclusters)
# Our error is defined as the square of the differences between centroid
error = euclidian_distance(centroids, new_centroids)
# The Gradient Descent Optimizer
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)
model = tf.initialize_all_variables()
with tf.Session() as session:
data = session.run(samples)
session.run(model)
epsilon = 0.08
err = float("inf")
count = 0
while err > epsilon:
_, err = session.run([train_op, error], {X: data})
print(err)
clustering = session.run(nearest_indices)
centers = session.run(centroids)
count += 1
# Plot each 100 iteration to see progress
if (count % 100) == 0:
print(count)
plt.figure()
plt.scatter(data[:,0], data[:,1], c=clustering)
plt.scatter(centers[:,0], centers[:,1], s=300, c="orange", marker="x", linewidth=5)
print("%d iterations" % count)
plt.figure()
plt.scatter(data[:,0], data[:,1], c=clustering)
plt.scatter(centers[:,0], centers[:,1], s=300, c="orange", marker="x", linewidth=5)
This is actually working (running) but the result is decieving:
After around 1600 iteration the result is so bad. I dont even figure out how some points can be "lost" (= clustered as a color they are so away from). To my mind kmeans can converge rlly fast on such case. Here it is not even converging to a good solution. Is it due to Gradient Descent? (don't see how could it be but...)
Thanks for advices!
pltrdy

Categories

Resources