I need to find the polynomial function of degree 29 that exactly fits thirty data points. We can be sure, that such a function exists. However, the error of numpy.polyfit increases dramatically after only three points.
import numpy as np
y = [126, 34, 78, 120, 83, 62, 104, 6, 70, 142, 147, 63, 35, 126, 9, 84, 7, 122, 93, 29, 95, 141, 42, 102, 38, 96, 130, 83, 138, 148]
print(len(y))
x = np.arange(len(y))
f = np.polyfit(x,y,30)
def eval_polynom(f, x):
res = 0
for i in range(len(f)):
res += f[i] * x**(len(f)-i-1)
return res
for i in range(len(y)):
print(y[i], " -- ", eval_polynom(f, x[i]))
My data points are (x,y) with x = [0,1,2,3,4,...,29]
The output is
126 -- 125.941598976
34 -- 34.7366402172
78 -- 73.703669116
120 -- 134.514176467
83 -- 51.6471546864
62 -- 105.143046704
104 -- 70.1470309453
6 -- 13.808372367
70 -- 347.425617622
142 -- -1281.11122538
...
Is there a way to get the exact polynomial function such that the error is 0?
There's almost certainly an integer overflow issue (due to large exponents) in your eval_polynom function, because the values in x are all integers. Try to replace
res += f[i] * x**(len(f)-i-1)
with
res += f[i] * float(x)**(len(f)-i-1)
You'll probably end up with values that still don't perfectly match, but remember that floating point operations are inherently inaccurate. Even more so if numbers become large, as is the case here.
y - green, polynome - red, error - blue, it's 140 degree polynome
I need to find the polynomial function of degree 29 that exactly fits thirty data points. We can be sure, that such a function exists
Why you sure of this? I tried some twists and visualizations and think you datapoints can't be fit by such polinome.
I'v tried Chebyshev's polynomes, it's doing better, but still can't fit these values even with 140 degree polynome.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from numpy.polynomial.chebyshev import chebfit,chebval
%matplotlib inline
y = [126, 34, 78, 120, 83, 62, 104, 6, 70, 142, 147, 63, 35, 126, 9, 84, 7, 122, 93, 29, 95, 141, 42, 102, 38, 96, 130, 83, 138, 148]
print(len(y))
x = np.arange(len(y))
c = chebfit(x, y, 30)
p = []
for i in np.arange(len(y)):
p.append(chebval(i, c))
df = pd.DataFrame(data={'x': x, 'y': y, 'p': p})
df['diff'] = df['y'] - df['p']
sns.pointplot(x = 'x', y = 'y', data=df, color='green')
sns.pointplot(x = 'x', y = 'p', data=df, color='red')
sns.pointplot(x = 'x', y = 'diff', data=df, color='blue')
While not exact, you get much better results if you use NumPys polyval
import numpy as np
y = [126, 34, 78, 120, 83, 62, 104, 6, 70, 142, 147, 63, 35, 126, 9, 84, 7, 122, 93, 29, 95, 141, 42, 102, 38, 96, 130, 83, 138, 148]
x = np.arange(len(y))
f = np.polyfit(x ,y, 30)
for i in range(len(y)):
print(y[i], " -- ", np.polyval(f, x[i]))
which gives
(126, ' -- ', 125.94427340268774)
(34, ' -- ', 34.674505165214924)
(78, ' -- ', 73.961360153890183)
(120, ' -- ', 133.96863767482208)
(83, ' -- ', 52.113307162099574)
(62, ' -- ', 105.65069882437891)
(104, ' -- ', 68.588480573695762)
(6, ' -- ', 14.814788499822299)
(70, ' -- ', 76.373263353880958)
(142, ' -- ', 149.39793233756134)
...
Note that you should be using a degree 29 polynomial to fit 30 points.
Related
I have a function which returns a multidimensional array of k clusters. My algorith works for the most part, but I need it to return a categorical array instead of a multidimensional array. Here is my code:
import numpy as np
import pandas as pd
import random
from bokeh.sampledata.iris import flowers
from typing import List, Tuple
def get_closest(data_point: np.ndarray, centroids: np.ndarray):
"""
Takes a data_point and a nd.array of multiple centroids and returns the index of the centroid closest to data_point
by computing the euclidean distance for each centroid and picking the closest.
"""
N = centroids.shape[0]
dist = np.empty(N)
for i, c in enumerate(centroids):
dist[i] = np.linalg.norm(c - data_point)
index_min = np.argmin(dist)
return index_min
# Use these centroids in the first iteration of you algorithm if "Random Centroids" is set to False in the Dashboard
DEFAULT_CENTROIDS = np.array([[5.664705882352942, 3.0352941176470587, 3.3352941176470585, 1.0176470588235293],
[5.446153846153847, 3.2538461538461543, 2.9538461538461536, 0.8846153846153846],
[5.906666666666667, 2.933333333333333, 4.1000000000000005, 1.3866666666666667],
[5.992307692307692, 3.0230769230769234, 4.076923076923077, 1.3461538461538463],
[5.747619047619048, 3.0714285714285716, 3.6238095238095243, 1.1380952380952383],
[6.161538461538462, 3.030769230769231, 4.484615384615385, 1.5307692307692309],
[6.294117647058823, 2.9764705882352938, 4.494117647058823, 1.4],
[5.853846153846154, 3.215384615384615, 3.730769230769231, 1.2076923076923078],
[5.52857142857143, 3.142857142857143, 3.107142857142857, 1.007142857142857],
[5.828571428571429, 2.9357142857142855, 3.664285714285714, 1.1]])
def k_means(data_np: np.ndarray, k:int=3, n_iter:int=500, random_initialization=False) -> Tuple[np.ndarray, int]:
"""
:param data: your data, a numpy array with shape (n_entries, n_features)
:param k: The number of clusters to compute
:param n_iter: The maximal numnber of iterations
:param random_initialization: If False, DEFAULT_CENTROIDS are used as the centroids of the first iteration.
:return: A tuple (cluster_indices: A numpy array of cluster_indices,
n_iterations: the number of iterations it took until the algorithm terminated)
"""
# Initialize the algorithm by assigning random cluster labels to each entry in your dataset
k=k+1
centroids = data_np[random.sample(range(len(data_np)), k)]
labels = np.array([np.argmin([(el - c) ** 2 for c in centroids]) for el in data_np])
clustering = []
for k in range(k):
clustering.append(data_np[labels == k])
# Implement K-Means with a while loop, which terminates either if the centroids don't move anymore, or
# if the number of iterations exceeds n_iter
counter = 0
while counter < n_iter:
# Compute the new centroids, if random_initialization is false use DEFAULT_CENTROIDS in the first iteration
# if you use DEFAULT_CENTROIDS, make sure to only pick the k first entries from them.
if random_initialization is False and counter == 0:
centroids = DEFAULT_CENTROIDS[random.sample(range(len(DEFAULT_CENTROIDS)), k)]
# Update the cluster labels using get_closest
labels = np.array([get_closest(el, centroids) for el in data_np])
clustering = []
for i in range(k):
clustering.append(np.where(labels == i)[0])
counter += 1
new_centroids = np.zeros_like(centroids)
for i in range(k):
if len(clustering[i]) > 0:
new_centroids[i] = data_np[clustering[i]].mean(axis=0)
else:
new_centroids[i] = centroids[i]
# if the centroids didn't move, exit the while loop
if clustering is not None and (centroids == new_centroids).sum() == 0:
break
else:
centroids = new_centroids
pass
# return the final cluster labels and the number of iterations it took
return clustering, counter
# read and store the dataset
data: pd.DataFrame = flowers.copy(deep=True)
data = data.drop(['species'], axis=1)
data_np = np.asarray(data)
clustering, counter = k_means(data_np,4,500,False)
So clustering looks like so
clustering
[array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 57,
98], dtype=int64),
array([60, 93], dtype=int64),
array([ 50, 51, 52, 53, 54, 55, 56, 58, 61, 62, 63, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 86, 87, 89, 90, 91, 92, 94, 95, 96,
97, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
dtype=int64),
array([59, 64, 84, 85, 88], dtype=int64)]
However, what I'm looking for is an array like
clustering
array([1, 3, 2, ..., 4, 1, 4], dtype=int64)]
Also, the while loop is always terminating after 1 iteration which shouldn't be the case.
counter
1
EDIT1:
The code continues as follows.
def callback(attr, old, new):
# recompute the clustering and update the colors of the data points based on the result
k = slider_k.valued_throttled
init = select_init.value
clustering_new, counter_new = k_means(data_np,k,500,init)
pass
# Create the dashboard
# 1. A Select widget to choose between random initialization or using the DEFAULT_CENTROIDS on top
select_init = Select(title='Random Centroids',value='False',options=['True','False'])
# 2. A Slider to choose a k between 2 and 10 (k being the number of clusters)
slider_k = Slider(start=2,end=10,value=3,step=1,title='k')
# 4. Connect both widgets to the callback
select_init.on_change('value',callback)
slider_k.on_change('value_throttled',callback)
# 3. A ColumnDataSource to hold the data and the color of each point you need
source = ColumnDataSource(dict(petal_length=data['petal_length'],sepal_length=data['sepal_length'],petal_width=data['petal_width'],clustering=clustering))
# 4. Two plots displaying the dataset based on the following table, have a look at the images
# in the handout if this confuses you.
#
# Axis/Plot Plot1 Plot2
# X Petal length Petal width
# Y Sepal length Petal length
#
# Use a categorical color mapping, such as Spectral10, have a look at this section of the bokeh docs:
# https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#filling
plot1 = figure(plot_width=100,plot_height=100,title='Scatterplot of flowers distribution by petal length and sepal length')
plot1.yaxis.axis_label = 'Sepal length'
plot1.xaxis.axis_label = 'Petal length'
scatter1 = plot1.scatter(x='petal_length',y='sepal_length',source=source,fill_color=factor_cmap('clustering', palette=Spectral10, factors=clustering))
plot2 = figure(plot_width=100,plot_height=100,title='Scatterplot of flowers distribution by petal width and petal length')
plot2.yaxis.axis_label = 'Petal length'
plot2.xaxis.axis_label = 'Petal width'
scatter2 = plot2.scatter(x='petal_width',y='petal_length',source=source,fill_color=factor_cmap('clustering', palette=Spectral10, factors=clustering))
# 5. A Div displaying the currently number of iterations it took the algorithm to update the plot.
div = Div(text='Number of iterations: ')
Thus the end result should look like so
I'm not sure I understand what you need.
If clustering contains a list of arrays where each array represent a cluster and the ith array contains the indices of the samples that belong to the ith cluster and what you need is to convert this to a single vector of size number_of_samples that represent the cluster each sample belongs to you can do it like this:
def to_classes(clustering):
# Get number of samples (you can pass it directly to the function)
num_samples = sum(x.shape[0] for x in clustering)
indices = np.empty((num_samples,)) # An empty array with correct size
for ith, cluster in enumerate(clustering):
# use cluster indices to assign to correct the cluster index
indices[cluster] = ith
return indices
The loops exists after a single iteration because the break condition is wrong, I think what you want is actually
# note the !=
if clustering is not None and (centroids != new_centroids).sum() == 0:
break
Description: I have a sample: sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]. I need to calculate third central moment of this sample.
My approach:
I'm making a table with top row being unique values from the sample and bottom row - frequency of each value from the top row:
table = dict(Counter(sample))
Then I'm calculating empirical k-th central moment with this formula:
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
Program:
from collections import Counter
def empirical_central_moment(table: dict, k):
mean = sum([value * frequency for value, frequency in table.items()]) / sum(list(table.values()))
N = sum(list(table.values()))
return sum([(value - mean)**k * frequency / N for value, frequency in table.items()])
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
table = dict(Counter(sample))
print(empirical_central_moment(table, 3))
Problem: Instead of desired -545.33983 ... I'm getting -26721.65147589292 and I just can't wrap my head around as to why I'm gettting wrong. Will appreciate any help, thanks in advance.
Your answer is correct. Not sure what other answer you might be looking for. In general, and unless the purpose of this code is to exercise programming the logic of it, you don't need to reinvent the wheel and you'll be much faster and safer by doing something as simple as:
from scipy.stats import moment
sample = [100, 86, 51, 100, 95, 100, 12, 61, 0, 0, 12, 86, 0, 52, 62, 76, 91, 91, 62, 91, 65, 91, 9, 83, 67, 58, 56]
print(scipy.stats.moment(sample, moment=3, axis=0, nan_policy='propagate'))
I have multiple 5x5 arrays which are contained within one large array - the overarching shape is: 5 x 5 x 29. I want to sum every 5 x 5 array to produce one single array, instead of 29 single arrays.
I know that you can do something along the lines of:
new_data = data1[:,:,0] + data1[:,:,1] + ... + data1[:,:,29]
However, this gets very cumbersome for large arrays. Is there an easier way to do this?
Assuming you are using NumPy, you should be able to do this with:
In [13]: data1 = np.arange(100).reshape(5, 5, 4) # For example
In [14]: data1[:,:,0] + data1[:,:,1] + data1[:,:,2] + data1[:,:,3] # Bad way
Out[14]:
array([[ 6, 22, 38, 54, 70],
[ 86, 102, 118, 134, 150],
[166, 182, 198, 214, 230],
[246, 262, 278, 294, 310],
[326, 342, 358, 374, 390]])
In [15]: data1.sum(axis=2) # Good way
Out[15]:
array([[ 6, 22, 38, 54, 70],
[ 86, 102, 118, 134, 150],
[166, 182, 198, 214, 230],
[246, 262, 278, 294, 310],
[326, 342, 358, 374, 390]])
If you are saying you have a list of arrays, then use a for loop.
for i in range(29):
new_data+= data1[:,:,i]
If you are saying you have a tensor or some ND array you should review and research numpy's ND array docs.
You can use a for loop. Like this:
import numpy as np
new_data = np.zeros((5, 5))
for i in range(29):
new_data += data1[:,:,i]
I am trying to use python to just compute a local pixel color average, however my output is not at all that.
Image:
Output:
Code:
image = cv2.imread('perspective.jpeg')
for i in range(image.shape[1]):
for j in range(image.shape[0]):
up = image[min(j + 1, image.shape[0]-1), i]
down = image[max(j - 1, 0), i]
right = image[j, min(i + 1, image.shape[1]-1)]
left = image[j, max(i - 1, 0)]
average = (up + down + left + right + image[j, i]) / 5
image[j, i] = average
The issues that you are observing is due to integer arithmetic overflow while computing the average. The reason of overflow is that the pixels are of type np.uint8 which when added together, generate result of type np.uint8 which is not large enough to hold the result of addition.
The solution to this problem is to cast the pixels to a larger data-type before adding them. Then cast the final value back to np.uint8 before storing back to the result image.
In-fact, casting only one of the values (say up) to larger data type will suffice as the rest of them will automatically be upgraded while performing addition.
The corrected code may look like this:
image = cv2.imread('perspective.jpeg')
for i in range(image.shape[1]):
for j in range(image.shape[0]):
up = np.float32(image[min(j + 1, image.shape[0]-1), i])
down = image[max(j - 1, 0), i]
right = image[j, min(i + 1, image.shape[1]-1)]
left = image[j, max(i - 1, 0)]
average = (up + down + left + right + image[j, i]) / 5
image[j, i] = np.uint8(average)
You can easily do this with filter2D as shown in the example below. It will work on any number of channels.
im = np.random.randint(0, 256, (5, 5), np.uint8)
kernel = np.array([[0, 1./5, 0], [1./5, 1./5, 1./5], [0, 1./5, 0]])
filt = cv2.filter2D(im, cv2.CV_8U, kernel)
For example:
im
array([[ 14, 127, 221, 74, 2],
[132, 251, 88, 19, 215],
[183, 140, 17, 60, 76],
[208, 144, 182, 11, 64],
[183, 89, 217, 131, 23]], dtype=uint8)
filt
array([[106, 173, 120, 67, 116],
[166, 148, 119, 91, 66],
[161, 147, 97, 37, 95],
[172, 153, 114, 90, 37],
[155, 155, 160, 79, 83]], dtype=uint8)
You can choose the border type, I've used the default.
I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])