Unexpected result with polyfit in python - python

Permit me first to introduce the background of the question:
I came into a quiz, which gave me a data set and a Logistic equation:
Then it asked whether that model can be linearized, and if it is, use the linear model to evaluate the value of a and k.
I tried to linearize it like this:
And coded in python:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y = np.array([43.65, 109.86, 187.21, 312.67, 496.58, 707.65, 960.25, 1238.75, 1560, 1824.29, 2199, 2438.89, 2737.71])
yAss = np.log(3000/y - 1)
cof = np.polyfit(t, yAss, deg = 1)
a = math.e**(cof[0]);
k = -cof[1];
yAfter = 3000 / (1 + a*math.e**(-k*t))
sizeScalar = 10
fig = plt.figure(figsize = (sizeScalar*1.1, sizeScalar))
plt.plot(t, y, 'o', markersize = sizeScalar*0.75)
plt.plot(t, yAfter, 'r-')
plt.grid(True)
plt.show()
And got that, which is obviously incorrect:
Then by coincident, I changed some part of the code:
t = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y = np.array([43.65, 109.86, 187.21, 312.67, 496.58, 707.65, 960.25, 1238.75, 1560, 1824.29, 2199, 2438.89, 2737.71])
yAss = np.log(3000/y - 1)
cof = np.polyfit(t, yAss, deg = 1)
a = math.e**(-cof[1]); #<<<===============here. Before: a = math.e**(cof[0])
k = cof[0]; #<<<==========================and here, Before: k = -cof[1]
temp = 3000 / (1 + a*math.e**(-k*t))
yAfter = []
for itera in temp: #<<<=======================add this
yAfter.append(3000 - itera)
sizeScalar = 10
fig = plt.figure(figsize = (sizeScalar*1.1, sizeScalar))
plt.plot(t, y, 'o', markersize = sizeScalar*0.75)
plt.plot(t, yAfter, 'r-')
plt.grid(True)
plt.show()
And received a sequence that seems to be right?
But how could it be? I think cof[0] IS beta, and cof1 IS -k, if it is, my previous code should just have some wrong concept. But change the order and sign of the coefficient, I got a result that fits well?! Is it purely coincident?
And what might be the right answer of the quiz?

np.polyfit returns the highest degree first: https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html#numpy.polyfit
Use:
k = -cof[0]
a = exp(cof[1])
Also, you can use NumPy's exponential:
yAfter = 3000/(1+a*np.exp(-k*t)))

Related

Weighted network between cluster centroids - Python

The following plots vectors derived from two sets of data points. I also measure and plot the centroid of these points using k-means clustering.
I'm hoping to measure some form of adjacency matrix to plot the network between each cluster based on the number of vectors, which also accounts for the amount of vectors between each cluster. So displaying the weight.
I was thinking the diagonal values of the adjacency matrix could indicate the number of vectors in the same cluster, while the non-diagonal values could indicate the number of vectors between different clusters, while considering the direction?
I'm hoping to produce an output to the one below. Where the nodes are the centroid of the cluster. The diameter of the node should indicate the number of vectors in the same cluster and the line thickness is the number of vectors between the two clusters.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
fig, ax = plt.subplots(figsize = (6,6))
df = pd.DataFrame(np.random.randint(-80,80,size=(500, 4)), columns=list('ABCD'))
A = df['A']
B = df['B']
C = df['C']
D = df['D']
Y_sklearn = df[['A','B','C','D']].values
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
model = KMeans(n_clusters = 20)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(Y_sklearn[:,0], Y_sklearn[:,1], color = 'blue', alpha = 0.2);
plt.scatter(Y_sklearn[:,2], Y_sklearn[:,3], color = 'red', alpha = 0.2);
Edit 2:
If if fix the data to get the intended network below, the following plots a total of 12 vectors. Two groups of 5 are overlapping, while two are unique.
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
If I just plot the scatter with cluster centroids, it should look like the following:
Y_sklearn = df[['A','B','C','D']].values
model = KMeans(n_clusters = 4)
model.fit(Y_sklearn)
model.cluster_centers_
cluster_centers = model.cluster_centers_
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
plt.scatter(cluster_centers[:, 2], cluster_centers[:, 3],
color = 'black', s = 100,
alpha = 0.7, zorder = 2)
This all works fine. The next step is where I'm having trouble. If I plot the network manually, it should look something like this. The thicker lines display 5 vectors between centroids, while the thinner lines display 1 vector.
The updated code produces the following network. The 5,5 - 7,7 line is correct, but I'm not getting the other lines that should replicate something similar to the network above.
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=points[:,:2][k[0]])
G.add_node(k[1],pos=points[:,2:][k[1]])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
Before solving the problem, I guess the KMeans clustering should be performed for the first two columns of the df and the other two columns separately. If you apply KMeans clustering directly to the entire df, the connections between two clusters (A,B) and (C,D) will be trivial.
If it is possible to use networkx in your project, here is how you can achieve what you are looking for. First, prepare the data
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
df = pd.DataFrame(np.random.randint(-80,80,size=(200, 4)), columns=list('ABCD'))
# tail
A,B = df['A'],df['B']
# end
C,D = df['C'],df['D']
nclusters_1,nclusters_2 = 20,20
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
As you can see in the above code, I performed KMeans clustering to (A,B) and (C,D) separately, and you can use different number of clusters for the two sets of points by using different nclusters_1 and nclusters_2. After the data is prepared, we can now visualize it using networkx
# plot the graph
import networkx as nx
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'navy';node_colors[k[1]] = 'darkviolet'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])
The output figure looks like
In this figure, the actual positions of the points are used for plotting, (A,B) are colored navy and (C,D) are colored violet. If you want to scale up the node size, just use a different number than 20 for this line
node_size=[node_sizes[v]*20 for v in d.keys()]
In order to adjust the width of the edges, use can use a different number other than 3 in this line
G.add_edge(k[0],k[1],color='k',weight=v/3)
UPDATE
In the script above, the keys of num_connections represent two graph nodes, and the corresponding values represent the number of connections. In order to extract adjacency matrix from num_connections, you can try
adj_mat = np.zeros((nclusters_1,nclusters_2))
for k,v in num_connections.items():
entry = (k[0],k[1]-nclusters_1)
adj_mat[entry] = v
UPDATE 2
To resolve the issue in OP's updated post,
add this pos=nx.get_node_attributes(G,'pos') to fix the node positions
change G.add_node(k[0],pos=points[:,:2][k[0]]) and G.add_node(k[1],pos=points[:,2:][k[1]]) to G.add_node(k[0],pos=cluster_one[0][k[0]]) and G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
with these two modifications,
df = pd.DataFrame({
'A' : [5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 5, 4],
'B' : [5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 5, 2],
'C' : [7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 7],
'D' : [7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 7],
})
fig,ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
def kmeans(arr,num_clusters):
model = KMeans(n_clusters = num_clusters)
model.fit(arr)
model.cluster_centers_
cluster_centers = model.cluster_centers_
all_labels = model.labels_
mem_count = Counter(all_labels)
return cluster_centers,all_labels,mem_count
nclusters_1,nclusters_2 = 2,2
points= df[['A','B','C','D']].values
cluster_one = kmeans(points[:,:2],nclusters_1)
cluster_two = kmeans(points[:,2:],nclusters_2)
# find connections between clusters
all_combs = [[n1,n2] for n1 in range(nclusters_1) for n2 in range(nclusters_2)]
num_connections = {}
for item in all_combs:
l1,l2 = cluster_one[1],cluster_two[1]
mask1 = np.where(l1==item[0])[0]
mask2 = np.where(l2==item[1])[0]
num_common = len(list(set(mask1).intersection(mask2)))
num_connections[(item[0],item[1]+nclusters_1)] = num_common
G = nx.Graph()
node_sizes = {}
node_colors = {}
for k,v in num_connections.items():
# the number of points in the two clusters
s1,s2 = cluster_one[2][k[0]],cluster_two[2][k[1]-nclusters_1]
G.add_node(k[0],pos=cluster_one[0][k[0]])
G.add_node(k[1],pos=cluster_two[0][k[1]-nclusters_1])
G.add_edge(k[0],k[1],color='k',weight=v/3)
node_sizes[k[0]] = s1;node_sizes[k[1]] = s2
node_colors[k[0]] = 'k';node_colors[k[1]] = 'k'
edges = G.edges()
d = dict(G.degree)
pos=nx.get_node_attributes(G,'pos')
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G,pos,edges=edges,
node_color=[node_colors[v] for v in d.keys()],
nodelist=d.keys(),
width=weights,
node_size=[node_sizes[v]*20 for v in d.keys()])

Python using row index as variable input to equation within numpy array

I can't figure out how to in python without creating a for loop. I'm hoping you can teach me the simpler way.
I trimmed the relevant stuff. I'm doing a polyfit and then want to use these a and b coefficients, coeff[0:1], to update an array and solve the relevant y's like: y = ax + b
I can brute force it and included two methods here, but they're both clunky.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
coeff = np.polyfit(np.arange(0, len(raw)), raw[:], 1) #fits slope of values in raw
fit = np.zeros(shape=(len(raw), 2))
fit[:,0] = np.arange(0,fit.shape[0]) # this creates an index so I can use the row index as the "x" variable
fit[:,1] = fit[:,0]*coeff[0] + fit[:,0]*coeff[1] # calculating y = ax * b in column [1]
## Alternate method with the for loop
for_fit = np.zeros(len(raw))
for i in range(0,len(raw)) :
for_fit[i] = i*coeff[0] + i*coeff[1]
I tried to make it a little bit cleaner. The main issue I saw is that you did not use the formula y = ax+b but rather y=ax+bx.
import numpy as np
raw = [0, 3, 6, 8, 11, 15]
x = np.arange(0, len(raw))
coeff = np.polyfit(x, raw[:], 1)
y = x*coeff[0] + coeff[1]
To visualise the result we can use:
import matplotlib.pyplot as plt
plt.plot(x, raw, 'bo')
plt.plot(x, y, 'r')
#EDIT
Are you looking for something like this?
y_arr = np.empty((10, len(x)))
for i in range(10):
...
y_arr[i] = y

Add list to Matplotlib

I am doing a randomly generated world and I'm starting of with basic graphing trying to be similar to Perlin Noise. I did everything and the last thing that I've written (important one) did work.
import math
import random
import matplotlib.pyplot as plt
print('ur seed')
a = input()
seed = int(a)
b = (math.cos(seed) * 100)
c = round(b)
# print(c)
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener = []
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#print(gener)
#print(change)
plt.ylabel('generated')
plt.xlabel('time')
#Here I wanna add them to the graph and it is Erroring a lot
plt.scatter(time, gener)
plt.title('graph')
plt.show()
the problem is that you're setting gener to [] in the loop not out of the loop. also, you don't need the time variable inside the loop either.
change
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener = []
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
to
gener = []
for i in range(10):
z = (random.randint(-1, 2))
change = (z + c)
gener.append(change)
time = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Piecewise regresion Python

Hi I'm trying to figure out how to fit those values with a piecewise linear function. I have read this question but I can't get forward (How to apply piecewise linear fit in Python? ). In this example is show how to implement a piecewise function for a 2 segment case. But I need to do it in a three segment case as in figure.
I'have written this code:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y1 = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03])
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, (x>= x1)] , [lambda x:k0*x + y0-k0*x0, lambda x:k1*(x-(x1+x0))-y1, lambda x:k2*x + y1-k2*x1])
p , e = optimize.curve_fit(piecewise_linear, x1, y1)
xd = np.linspace(0, 15, 100)
plt.figure()
plt.plot(x1, y1, "o")
plt.plot(xd, piecewise_linear(xd, *p))
but this is the output
Any suggestion? I belive that the problem is in return np.piecewise(x , [x <= x0, (x>= x1)] , [lambda x:k0*x + y0-k0*x0, lambda x:k1*(x-(x1+x0))-y1, lambda x:k2*x + y1-k2*x1]) in particular in the second lambda.
EDIT 1:
If I try to different data the solution provided by A.L. I don't get good results.
I get this result:
with
x=[ 16.01690476, 16.13801587, 14.63628571, 15.32664399,
15.8145 , 15.71507143, 15.56107143, 15.553 ,
15.08734524, 14.97275 , 15.51958333, 16.61981859,
16.36589286, 14.78708333, 14.41565476, 13.47763158,
13.42412281, 12.95551378, 13.66601504, 13.63315789,
13.21463659, 13.53464286, 14.60130952, 14.7774881 ,
13.04319048, 12.53385965, 12.65745614, 13.90535714,
14.82412281, 14.6565 , 15.09541667, 13.41434524,
13.66033333, 14.57964286, 13.55416667, 13.43041667,
13.01137566, 12.76429825, 11.55241667, 11.0634881 ,
10.92729762, 11.21625 , 10.72092857, 11.80380952,
12.55233333, 12.11307143, 11.78892857, 12.45458333,
11.05539286, 10.69214286, 10.32566667, 11.3439881 ,
9.69563492, 10.72535714, 10.26180272, 7.77272727,
6.37704082, 8.49666667, 8.5389881 , 5.68547619,
7.00616667, 8.22015873, 10.20315476, 15.35736842,
12.25158333, 11.09622153, 10.4118254 , 9.8602381 ,
10.16727273, 15.10858333, 13.82215539, 12.44719298,
10.92341667, 11.44565476, 11.43333333, 10.5045 ,
11.14357143, 10.37625 , 8.93421769, 9.48444444,
10.43483333, 10.8659881 , 10.96166667, 10.12872619,
9.64663265, 9.29979762, 9.67173469, 8.978322 ,
9.10419501, 9.45411565, 10.46411565, 7.95739229,
8.72616667, 7.03892857, 7.32547619, 7.56441667,
6.61022676, 9.09014739, 10.78141667, 10.85918367,
11.11665476, 10.141 , 9.17760771, 8.27968254,
11.02625 , 12.34809524, 11.17807018, 11.25416667,
11.29236905, 9.28357143, 9.77033333, 11.52086168,
9.8625 , 12.60281955, 12.42785714, 12.11902256,
13.1 , 13.02791667, 13.87779449, 15.09857143,
13.93935185, 13.69821429, 13.39880952, 12.45692982,
12.76921053, 13.23708333, 13.71666667, 15.39807143,
15.27916667, 14.66464286, 13.38694444, 10.97555556,
10.02191667, 11.99608333, 14.26325 , 15.40991667,
15.12908333, 15.76265476, 12.12763158, 15.01641667,
14.39602381, 12.98532143, 14.98807018, 18.30547619,
16.7564966 , 16.82982143, 19.8487013 , 19.18600907]
and
y=[ 2.36846863, 2.73722628, 2.77177583, 2.63930636, 2.80864749,
2.57066667, 2.65277287, 2.57162347, 2.76295667, 2.79835391,
2.60431154, 2.17326401, 2.67740698, 2.47138153, 2.49882574,
2.60987338, 2.69935565, 2.60755362, 2.77702029, 2.62996942,
2.45959517, 2.52750434, 2.73833005, 2.52009 , 2.80933226,
1.63807085, 2.49230099, 2.55441614, 3.19256506, 2.52609288,
1.02931596, 2.40266963, 2.3306463 , 2.69094276, 2.60779985,
2.48351648, 2.45131766, 2.40526763, 2.03952569, 1.86217009,
1.79971848, 1.91772218, 1.85895421, 2.32725731, 2.28189713,
2.11835833, 2.09636517, 2.2230303 , 1.85863317, 1.77550406,
1.68862391, 1.79187765, 1.70887476, 1.81911193, 1.74802483,
1.65776432, 1.58012849, 1.67781494, 1.62451541, 1.60555884,
1.56172214, 1.60083809, 1.65256994, 2.74794704, 2.27089627,
1.80364982, 1.51412482, 1.77738757, 1.56979564, 2.46538633,
2.37679625, 2.40389294, 2.04165763, 1.82086407, 1.90609219,
1.87480978, 1.8877854 , 1.76080074, 1.68369028, 1.57419297,
1.66470126, 1.74522552, 1.72459756, 1.65510503, 1.72131148,
1.6254417 , 1.57091907, 1.68755268, 1.70307911, 1.59445121,
1.74393783, 1.72913779, 1.66883237, 1.59859545, 1.62335831,
1.73378184, 1.62621588, 1.79532164, 1.78289992, 1.79475101,
1.7826266 , 1.68778918, 1.64484127, 1.62332696, 1.75372393,
1.99038021, 1.87268137, 1.86124502, 1.82435911, 1.62927102,
1.66443723, 1.86743516, 1.62745098, 2.20200312, 2.09641026,
2.26649111, 2.63271605, 2.18050721, 2.57138433, 2.51833359,
2.74684184, 2.57209998, 2.63762019, 2.30027877, 2.28471286,
2.40323668, 2.37103313, 2.16414489, 1.01027109, 2.64181007,
2.45467765, 2.05773672, 1.73624917, 2.05233688, 2.70820669,
2.65594222, 2.67445635, 2.37212985, 2.48221803, 2.77655216,
2.62839879, 2.26481307, 2.58005799, 2.1188172 , 2.14017268,
2.16459571, 1.95083406, 1.46224418]
Fitting a piecewise linear function is a nonlinear optimization problem which may have local optimas. The result you see is probably one of the local optimas where your optimization algorithm gets stuck.
One way to solve this problem is to repeat your optimization algorithm with different initial values and take the best fit. I used the mean absolute error (MAE) to compare the different fits against each other.
perr = np.sum(np.abs(y1-piecewise(x1, *p)))
I also changed your piecewise funtion because it was a bit confusing for me. But it still a piecewise function as before
Further think you forgot to extend the x and xd array to the value of 21. (thats why the green line ends early).
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, np.logical_and(x0<x, x<= x1),x>x1] , [lambda x:k0*x + y0, lambda x:k1*(x-x0)+y1+k0*x0,
lambda x:k2*(x-x1) + y0+y1+k0*x0+k1*(x1-x0)])
x1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y1 = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
perr_min = np.inf
p_best = None
for n in range(100):
k = np.random.rand(7)*20
p , e = optimize.curve_fit(piecewise, x1, y1,p0=k)
perr = np.sum(np.abs(y1-piecewise(x1, *p)))
if(perr < perr_min):
perr_min = perr
p_best = p
xd = np.linspace(0, 21, 100)
plt.figure()
plt.plot(x1, y1, "o")
y_out = piecewise(xd, *p_best)
plt.plot(xd, y_out)
plt.show()
this gives me:
with p = [ 6.34259491 15.00000023 2.97272604 7.05498314 2.00751828
13.88881542 1.99960597]
Edit1
You edited your question, and this ist the answer to the edited one.
Sorry Iam new at stackoverlfow and not sure if I should post another answer instead
In your second dataset you added noise to data. In my opinion there are two kinds of noises. A gaussian one, which places the points close to the underlying piecewise line and outlier noise which places points far away from the original underlying line.
Under the hood the optimization algorithm you use optimizes the following according to p:
E = sum(square(y-piecewise(x,p)))
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html#scipy.optimize.curve_fit
The gaussian noise is not very problematic. The optimization you use assumes indirectly this gaussian noise (by minimizing the least square error) and fits the line as good as possible. The real problem comes in with the outliers.
The problem is that outliers are far way from the original function. Even if the optimization tries the optimal parameters, the Energy function E will be not minimal, as your outliers are far away from the original function and this distance is even squared so it shifts away the minimum of the Function E far away from the true parameters of your function.
So whats the solution ?
Get rid of the outliers.
An automized approach to to that is ransac
https://en.wikipedia.org/wiki/RANSAC.
In Brief: You choose a random subset of the original data. You hope the subset has not outliers. You fit your function to the subset and discard the points, which are far way from the fitted function. If enough points survived this step, you take all the surviving points and repeat the fit. The error on this "inlier" set is a measure of the quality of your fit. Then you repeat the whole process and take the best final fit.
I ajusted my script accordingly:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, np.logical_and(x0<x, x<= x1),x>x1] , [lambda x:k0*x + y0, lambda x:k1*(x-x0)+y1+k0*x0,
lambda x:k2*(x-x1) + y0+y1+k0*x0+k1*(x1-x0)])
x = np.array(x)
y = np.array(y)
x1 = x
y1 = y
perr_min = np.inf
p_best = None
for n in range(100):
idx = np.random.choice(np.arange(len(x)), 10, replace=False)
x_sample = x[idx]
y_sample = y[idx]
k = np.random.rand(7)*20
try:
p , e = optimize.curve_fit(piecewise, x_sample,y_sample ,p0=k)
each_error = np.abs(y-piecewise(x, *p))
x_inliner = x[each_error < 1]
y_inlier = y[each_error < 1]
if(x_inliner.shape[0] < 0.8 * x.shape[0]):
continue
p_inlier , e_inlier = optimize.curve_fit(piecewise, x_inliner,y_inlier ,p0=p)
perr = np.sum(np.abs(y-piecewise(x, *p_inlier)))
if(perr < perr_min):
perr_min = perr
p_best = p_inlier
except RuntimeError:
pass
xd = np.linspace(0, 21, 100)
plt.figure()
plt.plot(x, y, "o")
y_out = piecewise(xd, *p_best)
plt.plot(xd, y_out)
print p_best
plt.show()
With 100 repetitions I get the following result:
The piecewise-regression python library can fit models with different numbers of breakpoints.
First of all, for demonstration purposes generate some data with 2 breakpoints:
import numpy as np
gradients = [2.5,12,2]
constant = 0
breakpoints = [6, 15]
n_points = 100
np.random.seed(1)
xx = np.linspace(0, 25, n_points)
yy = constant + gradients[0]*xx + np.random.normal(size=n_points)*10
for bp_n in range(len(breakpoints)):
yy += (gradients[bp_n+1] - gradients[bp_n]) * np.maximum(xx - breakpoints[bp_n], 0)
To fit and plot the model:
import piecewise_regression
import matplotlib.pyplot as plt
pw_fit = piecewise_regression.Fit(xx, yy, n_breakpoints=2)
pw_fit.plot()
plt.xlabel("x")
plt.ylabel("y")
plt.show()
It also gives you a statistical analysis:
pw_fit.summary()
It won't work well with the data you provided in your edit, because there are outliers that dominate the error cost function. This will be an issue whichever method you use to fit the data, you need to decide how to handle the outliers in this instance.

Minimize objective function using limfit.minimize in Python

I am having a problem with package lmfit.minimize minimization procedure. Actually, I could not create a correct objective function for my problem.
Problem definition
My function: yn = a_11*x1**2 + a_12*x2**2 + ... + a_m*xn**2,where xn- unknowns, a_m -
coefficients. n = 1..N, m = 1..M
In my case, N=5 for x1,..,x5 and M=3 for y1, y2, y3.
I need to find the optimum: x1, x2,...,x5 so that it can satisfy the y
My question:
Error: ValueError: operands could not be broadcast together with shapes (3,) (3,5).
Did I create the objective function of my problem properly in Python?
My code:
import numpy as np
from lmfit import Parameters, minimize
def func(x,a):
return np.dot(a, x**2)
def residual(pars, a, y):
vals = pars.valuesdict()
x = vals['x']
model = func(x,a)
return y - model
def main():
# simple one: a(M,N) = a(3,5)
a = np.array([ [ 0, 0, 1, 1, 1 ],
[ 1, 0, 1, 0, 1 ],
[ 0, 1, 0, 1, 0 ] ])
# true values of x
x_true = np.array([10, 13, 5, 8, 40])
# data without noise
y = func(x_true,a)
#************************************
# Apriori x0
x0 = np.array([2, 3, 1, 4, 20])
fit_params = Parameters()
fit_params.add('x', value=x0)
out = minimize(residual, fit_params, args=(a, y))
print out
if __name__ == '__main__':
main()
Directly using scipy.optimize.minimize() the code below solves this problem. Note that with more points yn you will tend to get the same result as x_true, otherwise more than one solution exists. You can minimize the effect of the ill-constrained optimization by adding boundaries (see the bounds parameter used below).
import numpy as np
from scipy.optimize import minimize
def residual(x, a, y):
s = ((y - a.dot(x**2))**2).sum()
return s
def main():
M = 3
N = 5
a = np.random.random((M, N))
x_true = np.array([10, 13, 5, 8, 40])
y = a.dot(x_true**2)
x0 = np.array([2, 3, 1, 4, 20])
bounds = [[0, None] for x in x0]
out = minimize(residual, x0=x0, args=(a, y), method='L-BFGS-B', bounds=bounds)
print(out.x)
If M>=N you could also use scipy.optimize.leastsq for this task:
import numpy as np
from scipy.optimize import leastsq
def residual(x, a, y):
return y - a.dot(x**2)
def main():
M = 5
N = 5
a = np.random.random((M, N))
x_true = np.array([10, 13, 5, 8, 40])
y = a.dot(x_true**2)
x0 = np.array([2, 3, 1, 4, 20])
out = leastsq(residual, x0=x0, args=(a, y))
print(out[0])

Categories

Resources