Python Linear Regression Error - python

I have two arrays with the following values:
>>> x = [24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0, 18.0,
... 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
>>> y = [10.0, 9.0, 22.0, 7.0, 4.0, 7.0, 56.0, 5.0, 24.0, 25.0, 11.0, 2.0,
... 9.0, 1.0, 9.0, 12.0, 9.0, 4.0, 2.0]
I used the scipy library to calculate r-squared:
>>> from scipy.interpolate import polyfit
>>> p1 = polyfit(x, y, 1)
When I run the code below:
>>> yfit = p1[0] * x + p1[1]
>>> yfit
array([], dtype=float64)
The yfit array is empty. I don't understand why.

The problem is you are performing scalar addition with an empty list.
The reason you have an empty list is because you try to perform scalar multiplication with a python list rather than with a numpy.array. The scalar is converted to an integer, 0, and creates a zero length list.
We'll explore this below, but to fix it you just need your data in numpy arrays instead of in lists. Either create it originally, or convert the lists to arrays:
>>> x = numpy.array([24.0, 13.0, 12.0, 22.0, 21.0, 10.0, 9.0, 12.0, 7.0, 14.0,
... 18.0, 1.0, 18.0, 15.0, 13.0, 13.0, 12.0, 19.0, 13.0]
An explanation of what was going on follows:
Let's unpack the expression yfit = p1[0] * x + p1[1].
The component parts are:
>>> p1[0]
-0.58791208791208893
p1[0] isn't a float however, it's a numpy data type:
>>> type(p1[0])
<class 'numpy.float64'>
x is as given above.
>>> p1[1]
20.230769230769241
Similar to p1[0], the type of p1[1] is also numpy.float64:
>>> type(p1[0])
<class 'numpy.float64'>
Multiplying a list by a non-integer interpolates the number to be an integer, so p1[0] which is -0.58791208791208893 becomes 0:
>>> p1[0] * x
[]
as
>>> 0 * [1, 2, 3]
[]
Finally you are adding the empty list to p[1], which is a numpy.float64.
This doesn't try to append the value to the empty list. It performs scalar addition, i.e. it adds 20.230769230769241 to each entry in the list.
However, since the list is empty there is no effect, other than it returns an empty numpy array with the type numpy.float64:
>>> [] + p1[1]
array([], dtype=float64)
An example of a scalar addition having an effect:
>>> [10, 20, 30] + p1[1]
array([ 30.23076923, 40.23076923, 50.23076923])

Related

Get an array of corresponding values in a reference array from very big input array

I have the following array:
table = np.array([
[1.0, 1.0, 3.0, 5.0],
[1.0, 2.0, 5.0, 3.0],
...
[2.0, 5.0, 2.0, 1.0],
[8.0, 9.0, 7.0, 2.0]])
Let's name the different columns respectively by ['a', 'b', 'm', 'n'].
"table" is my my reference table where I want to extract 'm' and 'n' given 'a' and 'b' contained in a list we will call 'my_list'. In that list, we allow duplicate pairs (a, b).
N.B.: Here list can be referred as array (not in the python sense)
It is easier to do it with for loop. But, for my problem, my list 'my_list' can contain more than 100000 pairs (a, b) so doing it with for loop is not optimal for my work.
How can I do it with numpy functions or pandas functions in a few lines (1 to 3 lines)?
An example of what I want: Given the following list
my_list = np.array([
[1.0, 2.0],
[1.0, 2.0],
[8.0, 9.0]])
I want to have the following result:
results = np.array([
[5.0, 3.0],
[5.0, 3.0],
[7.0, 2.0]])
Thank you in advance
Edit 1: equivalence with for loop
Here is the equivalent with for loop (simplest way with for loop without dichotomous search):
result = []
for x in my_list:
for y in table:
if (x[0] == y[0]) and (x[1] == y[1]):
result.append([y[2], y[3]])
break
print(results)
One possible approach using pandas is to perform inner merge
pd.DataFrame(table).merge(pd.DataFrame(my_list))[[2, 3]].to_numpy()
array([[5., 3.],
[5., 3.],
[7., 2.]])

Python How to Decompress a dictionary

I have a dictionary with:
inds = [0, 3, 7, 3, 3, 5, 1]
vals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
d = {'inds': inds, 'vals': vals}
print(d) will get me: {'inds': [0, 3, 7, 3, 3, 5, 1], 'vals': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0,
7.0]}
As you can see, inds(keys) are not ordered, there are dupes, and there are missing ones: range is 0 to 7 but there are only 0,1,3,5,7 distinct integers. I want to write a function that takes the dictionary (d) and decompresses this into a full vector like shown below. For any repeated indices (3 in this case), I'd like to sum the corresponding values, and for the missing indices, want 0.0.
# ind: 0 1 2 3* 4 5 6 7
x == [1.0, 7.0, 0.0, 11.0, 0.0, 6.0, 0.0, 3.0]
Trying to write a function that returns me a final list... something like this:
def decompressor (d, n=None):
final_list=[]
for i in final_list:
final_list.append()
return(final_list)
# final_list.index: 0 1 2 3* 4 5 6 7
# final_list = [1.0, 7.0, 0.0, 11.0, 0.0, 6.0, 0.0, 3.0]
Try it,
xyz = [0.0 for x in range(max(inds)+1)]
for i in range(max(inds)):
if xyz[inds[i]] != 0.0:
xyz[inds[i]] += vals[i]
else:
xyz[inds[i]] = vals[i]
Some things are still not clear to me but supposing you are trying to make a list in which the maximum index is the one you can find in your inds list, and you want a list as a result you can do something like this:
inds = [0, 3, 7, 3, 3, 5, 1]
vals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
#initialize a list of zeroes with lenght max index
res=[float(0)]*(max(inds)+1)
#[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
#Loop indexes and values in pairs
for i, v in zip(inds, vals):
#Add the value to the corresponding index
res[i] += v
print (res)
#[1.0, 7.0, 0.0, 11.0, 0.0, 6.0, 0.0, 3.0]
inds = [0, 3, 7, 3, 3, 5, 1]
vals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
first you have to initialise the dictionary , ranging from min to max value in the inds list
max_id = max(inds)
min_id = min(inds)
my_dict={}
i = min_id
while i <= max_id:
my_dict[i] = 0.0
i = i+1
for i in range(len(inds)):
my_dict[inds[i]] += vals[i]
my_dict = {0: 1.0, 1: 7.0, 2: 0, 3: 11.0, 4: 0, 5: 6.0, 6: 0, 7: 3.0}

sum elements of list under conditions of second list

I'm trying to add up certain elements of two lists that are related. I will put an example so you understand what I'm talking about. In the end I write the code I have, it works but I want to optimize it, otherwise I have to write lots of things by hand. Apologies if the question is not interesting.
list1 = [4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0]
list2 = [2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1]
List 1 corresponds to time, so what I want to do is to cluster everything every 10 [units of time], i.e. from list1 I can see that the first and second element belong to the range 0-10, so I would need to add their corresponding points in list2. Later from list1 I see that the third and fourth elements belong to the range (10< time <= 20), so I add the same elements in list2, later for the third range, I need to add the following 4 elements in list3 and so on. In the end I would like to create 2 new lists
list3 = [10., 20., 30., 40.]
list4 = [3.9, 14.5, 20.7, 35.6]
The code I wrote is the following:
list1 = [4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0]
list2 = [2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1]
list3 = numpy.arange(0., 40., 10.)
a = [[] for i in range(4)]
for i, j in enumerate(list1):
if 0.<=j<=10.:
a[0].append(list2[i])
elif 10.<j<=20.:
a[1].append(list2[i])
elif 20.<j<=30.:
a[2].append(list2[i])
elif 30.<j<=40.:
a[3].append(list2[i])
list4 = [sum(i) for i in a]
it works, however, list1 in reality is way more larger (few orders of magnitude) and I don't want to write all the if's by hand (as well as the sublists I make). Any suggestions will be appreciated.
First of all if we are talking about huge sets, I would use numpy, pandas, or another tool that is designed for this. From my experience, Python itself is not designed to work for things with more than 10M elements (unless there is a structure in the data you can exploit).
Now we can use this as follows:
import numpy as np
# construct lists
l1 = np.array(list1)
l2 = np.array(list2)
# determine the "groups" of the values
g = (l1-0.00001)//10
# create a boolean mask that determines where the groups change
flag = np.concatenate(([True], g[1:] != g[:-1]))
# determine the indices of the swaps
inv_idx, = flag.nonzero()
# calculate the sum per subrange
result = np.add.reduceat(list2,inv_idx)
For your sample output, this gives:
>>> result
array([ 3.9, 14.5, 20.7, 35.6])
The 0.00001 is used to push a 20.0 to some 19.9999 is and thus assign it to group 1 instead of group 2. The advantage of this approach is that (a) it works for an arbitrary number of "groups" and (b) a fixed number of "swipes" are done over the list so it scales linear with the number of elements in the list.
If you transform your list in numpy.array, there are easy way to extract some stuff in a 1D-array based on another one:
import numpy
list1 = numpy.array([4.0, 8.0, 14.0, 20.0, 22.0, 26.0, 28.0, 30.0, 32.0, 34.0, 36.0, 38.0, 40.0])
list2 = numpy.array([2.1, 1.8, 9.5, 5., 5.4, 6.7, 3.3, 5.3, 8.8, 9.4, 5., 9.3, 3.1])
step = 10
r, s = range(0,50,10), []
for i in r:
s.append(numpy.sum([l for l in list2[(list1 > i) & (list1 <= i+step)]]))
print r[1:], s[:-1]
#[10, 20, 30, 40] [3.9, 14.5, 20.7, 35.6]
Edit
In one line:
s = [numpy.sum([l for l in list2[(list1 > i) & (list1 < i+step)]]) for i in r]

Equating the lengths of the arrays in an array of arrays

Given an array of arrays with different lengths. Is there a cleaner (shorter) way to equate the lengths of the arrays by filling the shorter ones with zeros other than the following code:
a = [[1.0, 2.0, 3.0, 4.0],[2.0, 3.0, 1.0],[5.0, 5.0, 5.0, 5.0],[1.0, 1.0]]
max =0
for x in a:
if len(x) > max:
max = len(x)
print max
new = []
for x in a:
if len(x)<max:
x.extend([0.0]* (max-len(x)) )
new.append(x)
print new
You can find the length of the largest list within a using either:
len(max(a, key=len))
or
max(map(len, a))
and also use a list comprehension to construct a new list:
>>> a = [[1.0, 2.0, 3.0, 4.0], [2.0, 3.0, 1.0], [5.0, 5.0, 5.0, 5.0], [1.0, 1.0]]
>>> m = len(max(a, key=len))
>>> new = [x + [0]*(m - len(x)) for x in a]
>>> new
[[1.0, 2.0, 3.0, 4.0], [2.0, 3.0, 1.0, 0], [5.0, 5.0, 5.0, 5.0], [1.0, 1.0, 0, 0]]
In: b = [i+[0.]*(max(map(len,a))-len(i)) for i in a]
In: b
Out:
[[1.0, 2.0, 3.0, 4.0],
[2.0, 3.0, 1.0, 0.0],
[5.0, 5.0, 5.0, 5.0],
[1.0, 1.0, 0.0, 0.0]]

Can I use numpy gradient function with images

I have been trying to test the numpy.gradient function recently. However, it's behavior is little bit strange for me. I have created an array with random variables and then applied the numpy.gradient over it, but the values seems crazy and irrelevant. But when using numpy.diff the values are correct.
So, after viewing the documentation of numpy.gradient, I see that it uses distance=1 over the desired dimension.
This is what I mean:
import numpy as np;
a= np.array([10, 15, 13, 24, 15, 36, 17, 28, 39]);
np.gradient(a)
"""
Got this: array([ 5. , 1.5, 4.5, 1. , 6. , 1. , -4. , 11. , 11. ])
"""
np.diff(a)
"""
Got this: array([ 5, -2, 11, -9, 21, -19, 11, 11])
"""
I don't understand how the values in first result came. If the default distance is supposed to be 1, then I should have got the same results as numpy.diff.
Could anyone explain what distance means here. Is it relative to the array index or to the value in the array? If it depends on the value, then does that mean that numpy.gradient could not be used with images since values of neighbor pixels have no fixed value differences?
# load image
img = np.array([[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 99.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0],
[21.0, 20.0, 22.0, 24.0, 18.0, 11.0, 23.0]])
print "image =", img
# compute gradient of image
gx, gy = np.gradient(img)
print "gx =", gx
print "gy =", gy
# plotting
plt.close("all")
plt.figure()
plt.suptitle("Image, and it gradient along each axis")
ax = plt.subplot("131")
ax.axis("off")
ax.imshow(img)
ax.set_title("image")
ax = plt.subplot("132")
ax.axis("off")
ax.imshow(gx)
ax.set_title("gx")
ax = plt.subplot("133")
ax.axis("off")
ax.imshow(gy)
ax.set_title("gy")
plt.show()
Central differences in the interior and first differences at the boundaries.
15 - 10
13 - 10 / 2
24 - 15 / 2
...
39 - 28
For the boundary points, np.gradient uses the formulas
f'(x) = [f(x+h)-f(x)]/h for the left endpoint, and
f'(x) = [f(x)-f(x-h)]/h for the right endpoint.
For the interior points, it uses the formula
f'(x) = [f(x+h)-f(x-h)]/2h
The second approach is more accurate - O(h^2) vs O(h). Thus at the second data point, np.gradient estimates the derivative as (13-10)/2 = 1.5.
I made a video explaining the mathematics: https://www.youtube.com/watch?v=NvP7iZhXqJQ

Categories

Resources