KNeighborsRegressor as denoising algorithm

KNeighborsRegressor as denoising algorithm - python

On Kaggle I have found algorithms used for signals denoising. Such as Golay filters, spline functions, Autoregressive modelling or KNeighborsRegressor itself.
Link:
https://www.kaggle.com/residentmario/denoising-algorithms
How exactly does it work as I cannot find any article explaining its use for signal denoising? What kind of algorithm is it? I would like to understand how it works

It is a supervised learning algorithm - that is the best answer,
normally the algorithm is first trained with known data and it tries to interpret a function that best represents that data such that a new point can be produced for a previously unseen input.
Put simply it will determine the point for a previously unseen value based on an average of the k nearest points for which it has previously seen, a better, more detailed answer can be found below:
https://towardsdatascience.com/the-basics-knn-for-classification-and-regression-c1e8a6c955
in the kaggle code:
the time vector is:
df.index.values[:, np.newaxis]
and the signal vector is:
df.iloc[:, 0]
it appears the person in kaggle is using the data to first train the network - see below:
## define the KNN network
clf = KNeighborsRegressor(n_neighbors=100, weights='uniform')
## train the network
clf.fit(df.index.values[:, np.newaxis],
df.iloc[:, 0])
giving him a function that represents the relationship between time and the signal value. With this he then passes the time vector back to the network to get it to reproduce the signal.
y_pred = clf.predict(df.index.values[:, np.newaxis])
this new signal will represent the model's best interpretation of the signal, as you can see from the the link I have posted above, you can adjust certain parameters which will result in a 'cleaner' signal but also could degrade the original signal
One thing to note is that using this method in the same way as that guy in kaggle means it would only work for that one signal since the input is time it cannot be used to interpret future values:
y_pred = clf.predict(df.index.values[:, np.newaxis] + 400000)
ax = pd.Series(df.iloc[:, 0]).plot(color='lightgray')
pd.Series(y_pred).plot(color='black', ax=ax, figsize=(12, 8))

Related

Confidence score for machine learning with SciKit Learn?

I have followed an example of applying SciKit Learning's machine learning to facial recognition.
https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
I have been able to adapt the example to my own data successfully. However, I am lost on one point:
after preparing the data, training the model, ultimately, you end up with the line:
Y_pred = clf.predict(X_test_pca)
This produces a vector of predictions, one per face.
What I can't figure out is how to get any confidence measurement to correspond with that.
The classification method is a forced choice, so that each face passed in MUST be classified as one of the known faces, even if it isn't even close.
How can I get a number per face that will reflect how well the result matches the known face?

It seems like you are looking for the .predict_proba() method of the scikit-learn estimators. It returns the probabilities of possible outcomes instead of a single prediction.
The example you are referring to is using an SVC. It is a little special in regard to this function as it states:
The model need to have probability information computed at training time: fit with attribute probability set to True.
So, if you are using the same model as in the example, instantiate it with:
SVC(kernel='rbf', class_weight='balanced', probability=True)
and use .predict_proba() instead of .predict():
y_pred = clf.predict_proba(X_test_pca)
This returns an array of shape (n_samples, n_classes), i.e. the probabilities for each class for each sample. Accessing the probabilities for class k could then be done by calling y_pred[k] for example.

Gaussian Process Regression incremental learning

I am using the scikit-learn implementation of Gaussian Process Regression here and I want to fit single points instead of fitting a whole set of points. But the resulting alpha coefficients should remain the same e.g.
gpr2 = GaussianProcessRegressor()
for i in range(x.shape[0]):
gpr2.fit(x[i], y[i])
should be the same as
gpr = GaussianProcessRegressor().fit(x, y)
But when accessing gpr2.alpha_ and gpr.alpha_, they are not the same. Why is that?
Indeed, I am working on a project where new data points arise. I dont want to append the x, y arrays and fit on the whole dataset again as it is very time intense. Let x be of size n, then I am having:
n+(n-1)+(n-2)+...+1 € O(n^2) fittings
when considering that the fitting itself is quadratic (correct me if I'm wrong), the run time complexity should be in O(n^3). It would be more optimal, if I do a single fitting on n points:
1+1+...+1 = n € O(n)

What you refer to is actually called online learning or incremental learning; it is in itself a huge sub-field in machine learning, and is not available out-of-the-box for all scikit-learn models. Quoting from the relevant documentation:
Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.
Following this excerpt in the linked document above, there is a complete list of all scikit-learn models currently supporting incremental learning, from where you can see that GaussianProcessRegressor is not one of them.

Although sklearn.gaussian_process.GaussianProcessRegressor does not directly implement incremental learning, it is not necessary to fully retrain your model from scratch.
To fully understand how this works, you should understand the GPR fundamentals. The key idea is that training a GPR model mainly consists of optimising the kernel parameters to minimise some objective function (the log-marginal likelihood by default). When using the same kernel on similar data these parameters can be reused. Since the optimiser has a stopping condition based on convergence, reoptimisation can be sped up by initialising the parameters with pre-trained values (a so-called warm-start).
Below is an example based on the one in the sklearn docs.
from time import time
from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=0)
kernel = DotProduct() + WhiteKernel()
start = time()
gpr = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 4.851
print(gpr.score(X, y))
# 0.3530096529277589
# the kernel is copied to the regressor so we need to
# retieve the trained parameters
kernel.set_params(**(gpr.kernel_.get_params()))
# use slightly different data
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=1)
# note we do not train this model
start = time()
gpr2 = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 1.661
print(gpr2.score(X, y))
# 0.38599549162834046
You can see retraining can be done in significantly less time than training from scratch. Although this might not be fully incremental, it can help speed up training in a setting with streaming data.

How to run predict() on "precomputed" data for clustering in python

I have my own precomputed data for running AP or Kmeans in python. However when I go to run predict() as I would like to run a train() and test() on the data to see if the clusterings have a good accuracy on the class or clusters, Python tells me that predict() is not available for "precomputed" data.
Is there another way to run a train / test on clustered data in python?

Most clustering algorithms, including AP, have no well-defined way to "predict" on new data. K-means is one of the few cases simple enough to allow a "prediction" consistent with the initial clusters.
Now sklearn has this oddity of trying to squeeze everything into a supervised API. Clustering algorithms have a fit(X, y) method, but ignore y, and are supposed to have a predict method even though the algorithms don't have such a capability.
For affinity propagation, someone at some point decided to add a predict based on k-means: It always predicts the nearest center. Computing the mean only is possible with coordinate data, and hence the method fails with metric=precomputed.
If you want to replicate this behavior, computer the distances to all cluster centers, and choose the argmin, that's all. You can't fit this into the sklearn API easily with "precomputed" metrics. You could require the user to pass a distance vector to all "training" examples for the precomputed metric, but only few of them are needed...
In my opinion, I'd rather remove this method altogether:
It is not in published research on affinity propagation that I know
Affinity propagation is based on concepts of similarity ("affinity") not on distance or means
This predict will not return the same results as the points were labeled by AP, because AP is labeling points using a "propagated responsibility", rather than the nearest "center". (The current sklearn implementation may be losing this information...)
Clustering methods don't have a consistent predict anyway - it's not a requirement to have this.
If you want to do this kind of prediction, just pass the cluster centers to a nearest neighbor classifier. That is what is re-implemented here, a hidden NN classifier. So you get more flexibility if you make prediction a second (classification) step.
Note that it clustering it is not common to do any test-train split, because you don't use the labels anyway, and use only unsupervised evaluation methods (if any at all, because these have their own array of issues) if any at all - you cannot reliably do "hyperparameter optimization" here, but have to choose parameters based on experience and humans looking at the data.

Principle component regression using python

I have strain temperature data and I have read that article
https://www.idtools.com.au/principal-component-regression-python-2/
I'm trying to build a model and predict the strain out of the temperature.
I have got the following results with cross validation is negative.
I have the data set here
http://www.mediafire.com/file/r7dg7i9dacvpl2j/curve_fitting_ahmed.xlsx/file
My question is Is it results of Cross validation makes sense ?
My code is the following
The input is dataframe from panda.
def pca_analysis(temperature, strain):
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Data
print("process data")
T1 = temperature['T1'].tolist()
W_A1 = strain[0]
N = len(T1)
xData = np.reshape(T1, (N, 1))
yData = np.reshape(W_A1, (N, 1))
# Define the PCA object
pca = PCA()
Xstd = StandardScaler().fit_transform(xData)
# Run PCA producing the reduced variable Xred and select the first pc components
Xreg = pca.fit_transform(Xstd)[:, :2]
''' Step 2: regression on selected principal components'''
# Create linear regression object
regr = linear_model.LinearRegression()
# Fit
regr.fit(Xreg,W_A1)
# Calibration
y_c = regr.predict(Xreg)
# Cross-validation
y_cv = cross_val_predict(regr, Xreg, W_A1, cv=10)
# Calculate scores for calibration and cross-validation
score_c = r2_score(W_A1, y_c)
score_cv = r2_score(W_A1, y_cv)
# Calculate mean square error for calibration and cross validation
mse_c = mean_squared_error(W_A1, y_c)
mse_cv = mean_squared_error(W_A1, y_cv)
print(mse_c)
print(mse_cv)
print(score_c)
print(score_cv)
# Regression plot
z = np.polyfit(W_A1, y_c, 1)
with plt.style.context(('ggplot')):
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(W_A1, y_c, c='red', s = 0.4, edgecolors='k')
ax.plot(W_A1, z[1] + z[0] * yData, c='blue', linewidth=1)
ax.plot(W_A1, W_A1, color='green', linewidth=1)
plt.title('$R^{2}$ (CV): ' + str(score_cv))
plt.xlabel('Measured $^{\circ}$Strain')
plt.ylabel('Predicted $^{\circ}$Strain')
plt.show()
Here is the result of PCR
How would I improve the prediction using that data ?
enter image description here

From the Scikit Documentation, the value given by r2_score can be negative if your model is arbitrarily worse than random. Now, obviously this is not what one wants from using ML; you expect better than random results.
The first thing I would note is that your data seems like it may be quite nonlinear, in which case PCA struggles to improve model performance.
One potential substitute for PCA which accounts for essentially any nonlinearities in data is the use of autoencoders to preprocess data (Good article on these here). They can account for nonlinearities in data if you use non-linear activation functions on some of your hidden layers of the autoencoder, which may help your model's performance. There are many articles around the web that explain this, let me know if you want some resources if you so choose to pursue this course
The next thing that I would note is that r2_score is really not the best way to measure error, and that using mean-squared error is much more popular, especially for linear regression. So, if you want to keep your model as simple as this, I would simply ignore the r2_score and move on from there. However, that being said, linear regression is not equipped to solve very complex problems due to its simplicity, and judging by the picture you provided, it's pretty clear to me that linear regression is very rough when applied to this dataset.
I would be interested to know the difference in mean-squared-error between the PCA and non-PCA applied data. Here, the PCA should have less error than the the normal, non-PCA applied data. If it does not, then either your data is horribly non-linear (maybe?) or there is an error in your code (I looked over it and nothing is immediately obviously wrong with it). For linear regression, mean-squared-error is really almost the unanimous error function of choice, and is remarkably effective. Hope this answers your question, leave a comment/question about my answer if you have one and I will try to clarify as best as I can.
Also, while answering your question, I cam across this other question that I believe explains your problem pretty well (and uses some math, so be prepared). Most notably, there are situations where R^2 error is appropriate to use for your model, but given your results, I would say that R^2 error would probably be a poor choice of error function for this data.
Update: Given the values that you get for the mean squared error, my first guess would be that PCA is 1) either not working bc of the nature of the data, or 2) is implemented incorrectly. While I am not an expert with the libraries you are using, I would make sure that you transform all of the data in the same way, i.e. make sure that the PCA transformed vectors are being compared with transformed vectors.
For moving on from linear regression, I would investigate into making a simple neural network or SVR (this might be a little trickier). Both these methods are proven to work well for complex data and are very adaptable. There tons of resources online for both of these things, and I think giving specifics on implementation of either of these methods might be out of the scope of this question (you might have to ask a separate one about this).

Changes of clustering results after each time run in Python scikit-learn

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:
vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)
Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:
vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)
I appreciate your helps.

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.
km = KMeans(n_clusters=number_of_k, init='k-means++',
max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)
This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.
I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.
The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.
In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0 and all gray images of the original data set were in the cluster 1. For another data set, I want color images / gray images to be in cluster 0 and cluster 1 as well.
Here is the code I stole from a Kaggler - in addition to set the random_state to a seed, you use the k-mean model returned by KMeans for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn document saying that.
# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans
seed = 42
def create_color_clusters(img_df, cluster_count = 2, cluster_maker=None):
if cluster_maker is None:
cluster_maker = KMeans(cluster_count, random_state=seed)
cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])
img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)
return img_df, cluster_maker
# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)
However, even setting random_state to a int seed cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0 on one machine and clustered as group 1 on another machine. But at least with the same K-Means model (cluster_maker in my code) we make sure data from another distribution will be clustered in the same way as the original data set.

Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.
When I use K-Means I always run it several times and use the best result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.