Standard deviation/error of linear regression

Standard deviation/error of linear regression - python

So I have:
t = [0.0, 3.0, 5.0, 7.2, 10.0, 13.0, 15.0, 20.0, 25.0, 30.0, 35.0]
U = [12.5, 10.0, 7.6, 6.0, 4.4, 3.1, 2.5, 1.5, 1.0, 0.5, 0.3]
U_0 = 12.5
y = []
for number in U:
y.append(math.log(number/U_0, math.e))
(m, b) = np.polyfit(t, y, 1)
yp = np.polyval([m, b], t)
plt.plot(t, yp)
plt.show()
So by doing this I get linear regression fit with m=-0.1071 and b=0.0347.
How do I get deviation or error for m value?
I would like m = -0.1071*(1+ plus/minus error)
m is k and b is n in y=kx+n

import numpy as np
import pandas as pd
import statsmodels.api as sm
import math
U = [12.5, 10.0, 7.6, 6.0, 4.4, 3.1, 2.5, 1.5, 1.0, 0.5, 0.3]
U_0 = 12.5
y = []
for number in U:
y.append(math.log(number/U_0, math.e))
y = np.array(y)
t = np.array([0.0, 3.0, 5.0, 7.2, 10.0, 13.0, 15.0, 20.0, 25.0, 30.0, 35.0])
t = sm.add_constant(t, prepend=False)
model = sm.OLS(y,t)
result = model.fit()
result.summary()

You can use scipy.stats.linregress :
m, b, r_value, p_value, std_err = stats.linregress(t, yp)
The quality of the linear regression is given by the correlation coefficient in r_value, being r_value = 1.0 for a perfect correlation.
Note that, std_err is the standard error of the estimated gradient, and not from the linear regression.

Related

How to calculate sigma_1 and sigma_2 with Covariance Matrix

I'm reading this article.
In the "Covariance matrix & SVD" section,
there are two \sigmas, which are \sigma_1 and \sigma_2.
Those values are 14.4 and 0.19, respectively.
How can I get these values?
I already calculated the covariance matrix with Numpy:
import numpy as np
a = np.array([[2.9, -1.5, 0.1, -1.0, 2.1, -4.0, -2.0, 2.2, 0.2, 2.0, 1.5, -2.5],
[4.0, -0.9, 0.0, -1.0, 3.0, -5.0, -3.5, 2.6, 1.0, 3.5, 1.0, -4.7]])
cov_mat = (a.shape[1] - 1) * np.cov(a)
print(cov_mat)
# b = np.std(a, axis=1)**0.5
b = (a.shape[1] - 1) * np.std(a, axis=1)**0.5
# b = np.std(cov_mat, axis=1)
# b = np.std(cov_mat, axis=1)**0.5
print(b)
The result is:
[[ 53.46 73.42]
[ 73.42 107.16]]
[15.98102431 19.0154037 ]
No matter what I do, I can't get 14.4 and 0.19.
Are they just wrong values?
Please help me. Thank you in advance.

Don't know why you "un-sampled' your covariance, but the original np.cov output is what you want to get eigenvalues of:
np.linalg.eigvalsh(np.cov(a))
Out[]: array([ 0.19403958, 14.4077786 ])

Merging arrays and plots

Let's say I have 2 arrays like these:
x1 = [ 1.2, 1.8, 2.3, 4.5, 20.0]
y1 = [10.3, 11.8, 12.3, 11.5, 11.5]
and other two that represent the same function but sampled in different values
x2 = [ 0.2, 1,8, 5.3, 15.5, 17.2, 18.3, 20.0]
y2 = [10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0]
is there a way with numpy to merge x1 and x2 and according to the result merging also the related values of y without explicitly looping all over the arrays? (like doing an average of y or taking the max for that interval)

I don't know if you can find something in numpy, but here is a solution using pandas instead. (Pandas is using numpy behind the scenes, so there isn't so much data conversion.)
import numpy as np
import pandas as pd
x1 = np.asarray([ 1.2, 1.8, 2.3, 4.5, 20.0])
y1 = np.asarray([10.3, 11.8, 12.3, 11.5, 11.5])
x2 = np.asarray([ 0.2, 1.8, 5.3, 15.5, 17.2, 18.3, 20.0])
y2 = np.asarray([10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0])
c1 = pd.DataFrame({'x': x1, 'y': y1})
c2 = pd.DataFrame({'x': x2, 'y': y2})
c = pd.concat([c1, c2]).groupby('x').mean().reset_index()
x = c['x'].values
y = c['y'].values
# Result:
x = array([ 0.2, 1.2, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20. ])
y = array([10.3 , 10.3, 11.8, 12.3, 11.5, 12.3, 12.5, 15.2, 10.3, 10.75])
Here I concatenate the two vectors and do a groupby operation to get the equal values for 'x'. For these "groups" I than take the mean(). reset_index() will than move the index 'x' back to a column. To get the result back as a numpy array I use .values. (Use to_numpy() for pandas version 24.0 and higher.)

How about using numpy.hstack followed by sorting using numpy.sort ?
In [101]: x1_arr = np.array(x1)
In [102]: x2_arr = np.array(x2)
In [103]: y1_arr = np.array(y1)
In [104]: y2_arr = np.array(y2)
In [111]: np.sort(np.hstack((x1_arr, x2_arr)))
Out[111]:
array([ 0.2, 1.2, 1.8, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20. ,
20. ])
In [112]: np.sort(np.hstack((y1_arr, y2_arr)))
Out[112]:
array([10. , 10.3, 10.3, 10.3, 11.5, 11.5, 11.8, 11.8, 12.3, 12.3, 12.5,
15.2])
If you want to get rid of the duplicates, you can apply numpy.unique on top of the above results.

I'd propose a solution based on the accepted answer of this question:
import numpy as np
import pylab as plt
x1 = [1.2, 1.8, 2.3, 4.5, 20.0]
y1 = [10.3, 11.8, 12.3, 11.5, 11.5]
x2 = [0.2, 1.8, 5.3, 15.5, 17.2, 18.3, 20.0]
y2 = [10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0]
# create a merged and sorted x array
x = np.concatenate((x1, x2))
ids = x.argsort(kind='mergesort')
x = x[ids]
# find unique values
flag = np.ones_like(x, dtype=bool)
np.not_equal(x[1:], x[:-1], out=flag[1:])
# discard duplicated values
x = x[flag]
# merge, sort and select values for y
y = np.concatenate((y1, y2))[ids][flag]
plt.plot(x, y, marker='s', color='b', ls='-.')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This is the result:
x = [ 0.2 1.2 1.8 2.3 4.5 5.3 15.5 17.2 18.3 20. ]
y = [10.3 10.3 11.8 12.3 11.5 12.3 12.5 15.2 10.3 11.5]
As you notice, this code keeps only one value for y if several ones are available for the same x: in this way, the code is faster.
Bonus solution: the following solution is based on a loop and mainly standard Python functions and objects (not numpy), so I known that it is not acceptable; by the way, it is very coincise and elegant and it handles multiple values for y, so I decied to include it here as a plus:
x = sorted(set(x1 + x2))
y = np.nanmean([[d.get(i, np.nan) for i in x]
for d in map(lambda a: dict(zip(*a)), ((x1, y1), (x2, y2)))], axis=0)
In this case, you get the following results:
x = [0.2, 1.2, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20.0]
y = [10.3 10.3 11.8 12.3 11.5 12.3 12.5 15.2 10.3 10.75]

Is it allowed to assign a value to a variable before it enter a computational graph?

I define a simple computational graph involving a variable. When I change a value of the variable it has an expected influence on the output of the computational graph (so, everything works fine, as expected):
s = tf.Session()
x = tf.placeholder(tf.float32)
c = tf.Variable([1.0, 1.0, 1.0], tf.float32)
y = x + c
c = tf.assign(c, [3.0, 3.0, 3.0])
s.run(c)
print 'Y1:', s.run(y, {x : [10.0, 20.0, 30.0]})
c = tf.assign(c, [2.0, 2.0, 2.0])
s.run(c)
print 'Y2:', s.run(y, {x : [10.0, 20.0, 30.0]})
When I call this code I get:
Y1: [ 13. 23. 33.]
Y2: [ 12. 22. 32.]
So, the values after the Y1 and Y2 are different, as expected, because they are calculated with different values of c.
The problems start if I assign a value to the variable c before I define how it is involved into calculation of y. In this case I cannot assign a new value of c.
s = tf.Session()
x = tf.placeholder(tf.float32)
c = tf.Variable([1.0, 1.0, 1.0], tf.float32)
c = tf.assign(c, [4.0, 4.0, 4.0]) # this is the line that causes problems
y = x + c
c = tf.assign(c, [3.0, 3.0, 3.0])
s.run(c)
print 'Y1:', s.run(y, {x : [10.0, 20.0, 30.0]})
c = tf.assign(c, [2.0, 2.0, 2.0])
s.run(c)
print 'Y2:', s.run(y, {x : [10.0, 20.0, 30.0]})
As the output I get:
Y1: [ 14. 24. 34.]
Y2: [ 14. 24. 34.]
As you can see, each time I calculate y, I get results involving the old values of c. Why is that?

With TensorFlow, always keep in mind that you're building a computation graph. In your first code snippet, you basically define y = tf.placeholder(tf.float32) + tf.Variable([1.0, 1.0, 1.0], tf.float32). In your second example, you define y = tf.placeholder(tf.float32) + tf.assign(tf.Variable([1.0, 1.0, 1.0], tf.float32), [4.0, 4.0, 4.0]).
So, no matter which value you assign to c, the computation graph contains the assign operation and will always assign [4.0, 4.0, 4.0] to it before computing the sum.

I think this is because that you define the the add operation y = x + c right after c = tf.assign(c, [4.0, 4.0, 4.0]), so each time you run y out, c = tf.assign(c, [4.0, 4.0, 4.0]) this op will always be excuted and although other assign operations will also be excuted but don't affect the final result.

Can I use numpy gradient function with images

I have been trying to test the numpy.gradient function recently. However, it's behavior is little bit strange for me. I have created an array with random variables and then applied the numpy.gradient over it, but the values seems crazy and irrelevant. But when using numpy.diff the values are correct.
So, after viewing the documentation of numpy.gradient, I see that it uses distance=1 over the desired dimension.
This is what I mean:
import numpy as np;
a= np.array([10, 15, 13, 24, 15, 36, 17, 28, 39]);
np.gradient(a)
"""
Got this: array([ 5. , 1.5, 4.5, 1. , 6. , 1. , -4. , 11. , 11. ])
"""
np.diff(a)
"""
Got this: array([ 5, -2, 11, -9, 21, -19, 11, 11])
"""
I don't understand how the values in first result came. If the default distance is supposed to be 1, then I should have got the same results as numpy.diff.
Could anyone explain what distance means here. Is it relative to the array index or to the value in the array? If it depends on the value, then does that mean that numpy.gradient could not be used with images since values of neighbor pixels have no fixed value differences?

# load image
img = np.array([[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 99.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0]])
print "image =", img
# compute gradient of image
gx, gy = np.gradient(img)
print "gx =", gx
print "gy =", gy
# plotting
plt.close("all")
plt.figure()
plt.suptitle("Image, and it gradient along each axis")
ax = plt.subplot("131")
ax.axis("off")
ax.imshow(img)
ax.set_title("image")
ax = plt.subplot("132")
ax.axis("off")
ax.imshow(gx)
ax.set_title("gx")
ax = plt.subplot("133")
ax.axis("off")
ax.imshow(gy)
ax.set_title("gy")
plt.show()

Central differences in the interior and first differences at the boundaries.
15 - 10
13 - 10 / 2
24 - 15 / 2
...
39 - 28

For the boundary points, np.gradient uses the formulas
f'(x) = [f(x+h)-f(x)]/h for the left endpoint, and
f'(x) = [f(x)-f(x-h)]/h for the right endpoint.
For the interior points, it uses the formula
f'(x) = [f(x+h)-f(x-h)]/2h
The second approach is more accurate - O(h^2) vs O(h). Thus at the second data point, np.gradient estimates the derivative as (13-10)/2 = 1.5.
I made a video explaining the mathematics: https://www.youtube.com/watch?v=NvP7iZhXqJQ

Python Linear Regression Error

I have two arrays with the following values:
>>> x = [24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0, 18.0,
... 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
>>> y = [10.0, 9.0, 22.0, 7.0, 4.0, 7.0, 56.0, 5.0, 24.0, 25.0, 11.0, 2.0,
... 9.0, 1.0, 9.0, 12.0, 9.0, 4.0, 2.0]
I used the scipy library to calculate r-squared:
>>> from scipy.interpolate import polyfit
>>> p1 = polyfit(x, y, 1)
When I run the code below:
>>> yfit = p1[0] * x + p1[1]
>>> yfit
array([], dtype=float64)
The yfit array is empty. I don't understand why.

The problem is you are performing scalar addition with an empty list.
The reason you have an empty list is because you try to perform scalar multiplication with a python list rather than with a numpy.array. The scalar is converted to an integer, 0, and creates a zero length list.
We'll explore this below, but to fix it you just need your data in numpy arrays instead of in lists. Either create it originally, or convert the lists to arrays:
>>> x = numpy.array([24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0,
... 18.0, 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
An explanation of what was going on follows:
Let's unpack the expression yfit = p1[0] * x + p1[1].
The component parts are:
>>> p1[0]
-0.58791208791208893
p1[0] isn't a float however, it's a numpy data type:
>>> type(p1[0])
<class 'numpy.float64'>
x is as given above.
>>> p1[1]
20.230769230769241
Similar to p1[0], the type of p1[1] is also numpy.float64:
>>> type(p1[0])
<class 'numpy.float64'>
Multiplying a list by a non-integer interpolates the number to be an integer, so p1[0] which is -0.58791208791208893 becomes 0:
>>> p1[0] * x
[]
as
>>> 0 * [1, 2, 3]
[]
Finally you are adding the empty list to p[1], which is a numpy.float64.
This doesn't try to append the value to the empty list. It performs scalar addition, i.e. it adds 20.230769230769241 to each entry in the list.
However, since the list is empty there is no effect, other than it returns an empty numpy array with the type numpy.float64:
>>> [] + p1[1]
array([], dtype=float64)
An example of a scalar addition having an effect:
>>> [10, 20, 30] + p1[1]
array([ 30.23076923, 40.23076923, 50.23076923])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Standard deviation/error of linear regression - python

Related

How to calculate sigma_1 and sigma_2 with Covariance Matrix

Merging arrays and plots

Is it allowed to assign a value to a variable before it enter a computational graph?

Can I use numpy gradient function with images

Python Linear Regression Error

Categories

Resources