How do I solve this problem using linear least squares with scipy?

How do I solve this problem using linear least squares with scipy? - python

I had the following problem on a test and my code didn't give me the exact results I needed, but I can't really find what went wrong.
We discovered a new comet whose elliptical orbit can be represented in a Cartesian $(x, y)$ coordinate system by the equation
$$ay^2+ bxy + cx + dy + e = x^2$$
Use a SciPy routine to solve the linear least squares problem to determine the orbital parameters a, b, c, >d, e, given the following observations of the comet’s position
Observation
1
2
3
4
5
6
7
8
9
10
x
1.02
0.95
0.87
0.77
0.67
0.56
0.44
0.3
0.16
0.01
y
0.39
0.32
0.27
0.22
0.18
0.15
0.13
0.12
0.13
0.15
I wrote this code, but when i plot the equation using matplotlib.contour the curve doesn't match the data points.
def fit(x, y):
n = (np.shape(x)[0])
A = np.array([y**2, x * y, x, y, np.ones(n)]).T
b = x**2
return linalg.lstsq(A, b)[0]
obx = np.array([1.02, 0.95, 0.87, 0.77, 0.67, 0.56, 0.44, 0.3, 0.16, 0.01])
oby = np.array([0.39, 0.32, 0.27, 0.22, 0.18, 0.15, 0.13, 0.12, 0.13, 0.15])
fit(obx, oby)
Does somebody know what I am doing wrong here, should i maybe use curvefit instead of lstsq , or is my mistake in the plotting code?
Some follow-up clarification, the code I wrote gave this output for the constants a to e.
array([-2.63562548, 0.14364618, 0.55144696, 3.22294034, -0.43289427])
I plotted the result with this code
obx = np.array([1.02, 0.95, 0.87, 0.77, 0.67, 0.56, 0.44, 0.3, 0.16, 0.01])
oby = np.array([0.39, 0.32, 0.27, 0.22, 0.18, 0.15, 0.13, 0.12, 0.13, 0.15])
def data_plot(x, y, a, b, c, d, e):
def f(x, y):
return a * y**2 + b * x * y + c * x + d * y + e
plt.close()
size = 100
xrang = np.linspace(0,0.5, size)
yrang = np.linspace(0,90, size)
X, Y = np.meshgrid(xrang, yrang)
F = f(X, Y)
G = X**2
plt.contour( (F-G), [0])
plt.scatter(x, y)
plt.xlim([-0.5, 1.5])
plt.ylim([0, 0.5])
plt.xlabel('x-coordinate')
plt.ylabel('y-coordinate')
plt.show()
return None
data_plot(obx, oby, -2.63562548, 0.14364618, 0.55144696, 3.22294034, -0.43289427
)
which give this, obviously wrong, result.plot

Does somebody know what I am doing wrong here, should i maybe use curvefit instead of lstsq , or is my mistake in the plotting code?
I think it's a mistake in your plotting code. I plotted this in a different manner, and it agrees with the initial points.
from sympy import plot_implicit, symbols, Eq, And
from sympy.plotting.plot import List2DSeries
import numpy as np
obx = np.array([1.02, 0.95, 0.87, 0.77, 0.67, 0.56, 0.44, 0.3, 0.16, 0.01])
oby = np.array([0.39, 0.32, 0.27, 0.22, 0.18, 0.15, 0.13, 0.12, 0.13, 0.15])
x, y = symbols('x y')
a, b, c, d, e = -2.63562548, 0.14364618, 0.55144696, 3.22294034, -0.43289427
p1 = plot_implicit(Eq(a*y**2+ b*x*y + c*x + d*y + e, x**2), (x, -2, 2), (y, -2, 2), line_color='red')
p1.append(List2DSeries(obx, oby))
p1.show()
(Blue is initial points, red is the least-squares fit.)

Related

Optimization of equation parameter values such that largest distance between groups is created

For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
This creates the following table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
We thus calculate all the coordinates and plot them
Plot
What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.

Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):
You may want to experiment with other optimization methods - see scipy.optimize.minimize.

Iterating over numpy arange changes the values

I am using a numpy arange.
[In] test = np.arange(0.01, 0.2, 0.02)
[In] test
[Out] array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19])
But then, if I iterate over this array, it iterates over slightly smaller values.
[In] for t in test:
.... print(t)
[Out]
0.01
0.03
0.049999999999999996
0.06999999999999999
0.08999999999999998
0.10999999999999997
0.12999999999999998
0.15
0.16999999999999998
0.18999999999999997
Why is this happening?
To avoid this problem, I have been rounding the values, but is this the best way to solve this problem?
for t in test:
print(round(t, 2))

I think the nature of the floating point numbers mentioned in the comments is the issue.
If you still think you're afraid of leaving it that way I suggest that you multiply your numbers by 100 and so work with intergers:
test = np.arange(1, 20, 2)
print(test)
for t in test:
print(t / 100)
This gives me the following output:
[ 1 3 5 7 9 11 13 15 17 19]
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
Alternatively you can also try the following:
test = np.arange(1, 20, 2) / 100

Did you try:
test = np.arange(0.01, 0.2, 0.02, dtype=np.float32)

Plot excess points on x axis in python

I am looking to plot data captured at 240 hz(x axis) vs data captured at 60hz(y axis). The x axis data is 4 times that of y axis and I would like 4 points on x axis to be plotted for 1 point on y axis, so that the result graph looks like a step.
My list: Y axis: [0.0, 0.001, 0.003, 0,2, 0.4, 0.5, 0.7, 0.88, 0.9, 1.0]
X Axis: np.arange(1, 40) # numpy
Any ideas how to club the 4 excess points into one in the graph?

You can use numpy.repeat to duplicate each data point in your series as many times as you want. For your specific example:
from matplotlib import pyplot as plt
import numpy as np
fig, ax = plt.subplots()
X = np.arange(1,41)
Y = np.array([0.0, 0.001, 0.003, 0.2, 0.4, 0.5, 0.7, 0.88, 0.9, 1.0])
Y2 = np.repeat(Y,4)
print(Y2)
ax.plot(X,Y2)
plt.show()
Gives the following output for Y2:
[0. 0. 0. 0. 0.001 0.001 0.001 0.001 0.003 0.003 0.003 0.003
0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5
0.7 0.7 0.7 0.7 0.88 0.88 0.88 0.88 0.9 0.9 0.9 0.9
1. 1. 1. 1. ]
And the following figure:
You can also do the opposite with
X2 = X[::4]
ax.plot(X2, Y)
In which case you get this figure:

Compute percentile rank relative to a given population

I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?

Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998

I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])

You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64

I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.

Place x,y coordinates into bins

I have a Pandas dataframe with two of the columns containing x,y coordinates that I plot as below:
plt.figure(figsize=(10,5))
plt.scatter(df.x, df.y, s=1, marker = ".")
plt.xlim(-1.5, 1.5)
plt.ylim(0, 2)
plt.xticks(np.arange(-1.5, 1.6, 0.1))
plt.yticks(np.arange(0, 2.1, 0.1))
plt.grid(True)
plt.show()
I want to split the x and y axes every 0.1 units to give 600 bins (30x20). Then I want to know how many of my points are in each bin and the indices of these points so I can look them up in my dataframe. I basically want to create 600 new dataframes for each bin.
This is what I've tried so far:
df[(df.x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)]
This will give me part of the dataframe contained within the square (-0.1 ≤ x < 0) & (0.7 ≤ y < 0.8). I want a way to create 600 of these.

I would use the cut function to create the bins and then group by them and count
#create fake data with bounds for x and y
df = pd.DataFrame({'x':np.random.rand(1000) * 3 - 1.5,
'y':np.random.rand(1000) * 2})
# bin the data into equally spaced groups
x_cut = pd.cut(df.x, np.linspace(-1.5, 1.5, 31), right=False)
y_cut = pd.cut(df.y, np.linspace(0, 2, 21), right=False)
# group and count
df.groupby([x_cut, y_cut]).count()
Output
x y
x y
[-1.5, -1.4) [0, 0.1) 3.0 3.0
[0.1, 0.2) 1.0 1.0
[0.2, 0.3) 3.0 3.0
[0.3, 0.4) NaN NaN
[0.4, 0.5) 1.0 1.0
[0.5, 0.6) 3.0 3.0
[0.6, 0.7) 1.0 1.0
[0.7, 0.8) 2.0 2.0
[0.8, 0.9) 2.0 2.0
[0.9, 1) 1.0 1.0
[1, 1.1) 2.0 2.0
[1.1, 1.2) 1.0 1.0
[1.2, 1.3) 2.0 2.0
[1.3, 1.4) 3.0 3.0
[1.4, 1.5) 2.0 2.0
[1.5, 1.6) 3.0 3.0
[1.6, 1.7) 3.0 3.0
[1.7, 1.8) 1.0 1.0
[1.8, 1.9) 1.0 1.0
[1.9, 2) 1.0 1.0
[-1.4, -1.3) [0, 0.1) NaN NaN
[0.1, 0.2) NaN NaN
[0.2, 0.3) 2.0 2.0
And to completely answer your question. You can add the categories to the original dataframe as columns and then do your searching from there like this.
# add new columns
df['x_cut'] = x_cut
df['y_cut'] = y_cut
print(df.head(15)
x y x_cut y_cut
0 1.239743 1.348838 [1.2, 1.3) [1.3, 1.4)
1 -0.539468 0.349576 [-0.6, -0.5) [0.3, 0.4)
2 0.406346 1.922738 [0.4, 0.5) [1.9, 2)
3 -0.779597 0.104891 [-0.8, -0.7) [0.1, 0.2)
4 1.379920 0.317418 [1.3, 1.4) [0.3, 0.4)
5 0.075020 0.748397 [0, 0.1) [0.7, 0.8)
6 -1.227913 0.735301 [-1.3, -1.2) [0.7, 0.8)
7 -0.866753 0.386308 [-0.9, -0.8) [0.3, 0.4)
8 -1.004893 1.120654 [-1.1, -1) [1.1, 1.2)
9 0.007665 0.865248 [0, 0.1) [0.8, 0.9)
10 -1.072368 0.155731 [-1.1, -1) [0.1, 0.2)
11 0.819917 1.528905 [0.8, 0.9) [1.5, 1.6)
12 0.628310 1.022167 [0.6, 0.7) [1, 1.1)
13 1.002999 0.122493 [1, 1.1) [0.1, 0.2)
14 0.032624 0.426623 [0, 0.1) [0.4, 0.5)
And then to get the combination that you described above: df[(x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)] you would can set the index as x_cut and y_cut and do some hierarchical index selection.
df = df.set_index(['x_cut', 'y_cut'])
df.loc[[('[-0.1, 0)', '[0.7, 0.8)')]]
Output
x y
x_cut y_cut
[-0.1, 0) [0.7, 0.8) -0.043397 0.702029
[0.7, 0.8) -0.032508 0.799284
[0.7, 0.8) -0.036608 0.709394
[0.7, 0.8) -0.025254 0.741085

One of many ways to do it.
bins = (df // .1 * .1).round(1).stack().groupby(level=0).apply(tuple)
dict_of_df = {name: group for name, group in df.groupby(bins)}
You can get the dataframe of counts with
df.groupby(bins).size().unstack()

you could transform your units into their respective indices 0 - 19 and 0 - 29 and increment a matrix of zeros..
import numpy as np
shape = [30,20]
bins = np.zeros(shape, dtype=int)
xmin = np.min(df.x)
xmax = np.max(df.x)
xwidth = xmax - xmin
xind = int(((df.x - xmin) / xwidth) * shape[0])
#ymin
#ymax
#ywidth
#yind
for ind in zip(xind, yind):
bins[ind] += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I solve this problem using linear least squares with scipy? - python

Related

Optimization of equation parameter values such that largest distance between groups is created

Iterating over numpy arange changes the values

Plot excess points on x axis in python

Compute percentile rank relative to a given population

Place x,y coordinates into bins

Categories

Resources