In a dataset like this: (y is angle and x is datapoints)
How to find the weighted average of each "band" (in that case would be 0.1 and -90 something) whilst ignoring potential random points.
I was thinking of the FFT transform but that might not be the right approach.
Perhaps transforming that in a graph alike normal distribution and find the peaks?
Solve Using KMeans
Step 1. Generate Data
from random import randint, choice
from numpy import random
import numpy as np
from matplotlib import pyplot as plt
def gen_pts(mean_, std_, n):
"""Generate gaussian distributed random data
mean: mean_
standard deviation: std_
number points: n
"""
return np.random.normal(loc=mean_, scale = std_, size = n)
# Number of groups of horizontal blobs
n_groups = 20
# Genereate random count for each group
counts = [randint(100, 200) for _ in range(n_groups)]
# Generate random mean for each group (i.e. 0 or -90)
means = [random.choice([0, -90]) for _ in range(n_groups)]
# All the groups
data = [gen_pts(mean_, 5, n) for mean_, n in zip(means, counts)]
# Concatenate groups into 1D array
X = np.concatenate(data, axis=0)
# Show Data
plt.plot(X)
plt.show()
Step 2-Find Cluster Centers
# Reshape 1D data so it's suitable for kmeans model
X = X.reshape(-1,1)
# Get model for two clusters
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
# Fit Data to model
pred_y = kmeans.fit_predict(X)
# Cluster Centers
centers = kmeans.cluster_centers_
print(*centers)
# Output: [-89.79165334] [-0.07875314]
Related
I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
I'm working in the estimation of cloud displacement for wind energy purposes with RGB GOES satellital images. I find the following the methodology from this paper "An Automated Technique for Obtaining Cloud Motion From Geosynchronous Satellite Data Using Cross Correlation" to achieve it. I don't know if this is a good way to compute this. The code bassically gets the cross correlation from the Fourier Transform to calculate cloud displacement between roi_a and roi_b images.
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt
img_a = cv.imread('2019.1117.1940.goes-16.rgb.tif', 0)
img_b = cv.imread('2019.1117.1950.goes-16.rgb.tif', 0)
roi_a = img_a[700:900, 1900:2100]
roi_b = img_b[700:900, 1900:2100]
def Fou(image):
fft_roi= np.fft.fft2(image)
return fft_roi
def inv_Fou(C_w):
c_t = np.fft.ifft2(C_w)
c_t = np.abs(c_t)
return c_t
#Step 1: gets the FFT
G_t0 = Fou(roi_a)##t_0
fft_roiA_conj = np.conj(G_t0) #Conjugate
G_t1 = Fou(roi_b)##t_1
#Step 2: Compute C(m, v)
prod = np.dot(fft_roiA_conj, G_t1)
#Step 3: Perform the inverse FFT
inv = inv_Fou(prod)
plt.imshow(inv, cmap = 'gray', )
plt.title('C (m,v) --> Cov(p,q)')
plt.xticks([])
plt.yticks([])
plt.show()
#Step 4: Compute cross correlation coefficient and the maximum cross correlation coefficient
def rms(sigma):
"Compute the standar deviation of an image"
rms = np.std(sigma)
return rms
R_t = inv / (rms(roi_a) * rms(roi_b))
This is the first time that I use FFT on images, so I have some questions about it:
I don't add fftshift, is this can affect the result?
What is difference between use np.dot in step 2 and simple '*', like prod = fft_roiA_conj * G_t1
How to interpret the image result (C(m, v) -> Cov (p, q)) from step 3?
How can I obtain the maximum coefficient p' and q' (maximum coefficient of x and y directions) from R_t?
1 - fftshift is a circular rotation, if you have a two sided signal you are computing the correlation is shifted (circularly), what is important is that you map your indices to displacements correctly, with or without fftshift.
2 - numpy.dot is the matrix product (equivalent to # operator for recent python versions), and the * operator does element-wise multiplication, in my understanding you want the element-wise product at step 2.
3 - Once you correct the step 2 you will have an image such that inv[i, j] the correlation of the immage roi_a and the image roi_b rolled by i rows and j columns
To answer the last question I will workout an example.
I will use the image scipy.misc.face, it is a RGB image, so it brings three matrices that are highly correlated.
import scipy
import numpy as np
import matplotlib.pyplot as plt
f = scipy.misc.face()
plt.figure(figsize=(12, 4))
plt.subplot(131), plt.imshow(f[:,:, 0])
plt.subplot(132), plt.imshow(f[:,:, 1])
plt.subplot(133), plt.imshow(f[:,:, 2])
The function img_corrcombine the three steps of the cross correlation (for images of the same size), notice that I am use rfft2 and irfft2, this are the FFT for real data, that take advantage of symmetry in the frequency domain.
def img_corr(foi_a, foi_b):
return np.fft.irfft2(np.fft.rfft2(foi_a) * np.conj(np.fft.rfft2(foi_b)))
C = img_corr(f[:,:,1], f[:,:,2])
plt.figure(figsize=(12, 4))
plt.subplot(121), plt.imshow(C), plt.title('FFT indices')
plt.subplot(122), plt.imshow(np.fft.fftshift(C, (0, 1))), plt.title('fftshift ed version')
To retrieve the position
# this returns the indice in the vector of all pixels
best_corr = np.argmax(C)
# unravel index gives the 2D index
best_pos = np.unravel_index(best_corr, C.shape)
# this get the positions as a fraction of the image size
relative_pos = [np.fft.fftfreq(size)[index] for index, size in zip(best_pos, C.shape)]
I hope this completes the answer.
I wrote a Python script that uses scikit-learn to fit Gaussian Processes to some data.
IN SHORT: the problem I am facing is that while the Gaussian Processses seem to learn very well the training dataset, the predictions for the testing dataset are off, and it seems to me there is a problem of normalization behind this.
IN DETAIL: my training dataset is a set of 1500 time series. Each time series has 50 time components. The mapping learnt by the Gaussian Processes is between a set of three coordinates x,y,z (which represent the parameters of my model) and one time series. In other words, there is a 1:1 mapping between x,y,z and one time series, and the GPs learn this mapping. The idea is that, by giving to the trained GPs new coordinates, they should be able to give me the predicted time series associated to those coordinates.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
Here examples of performance on the training data.
Here examples of performance on the testing data.
first you should use the sklearn preprocessing tool to treat your data.
from sklearn.preprocessing import StandardScaler
There are other useful tools to organaize but this specific one its to normalize the data.
Second you should normalize the training set and the test set with the same parameters¡¡ the model will fit the "geometry" of the data to define the parameters, if you train the model with other scale its like use the wrong system of units.
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
this will use the same tranformation in the sets.
and finaly you need to normalize the features not the traget, I mean to normalize the X entries not the Y output, the normalization helps the model to find the answer faster changing the topology of the objective function in the optimization process the outpu doesnt affect this.
I hope this respond your question.
Hej,
I have a dataset from different cohorts and I want to bicluster them with the sklearn function Spectral Biclustering.
As you can see in the link above this approach is using a kind of normalization to calculate the SVD.
Is it necessary to normalize the data before biclustering, eg with StandardScaling (zero mean and std of one)? Because the function above still uses a kind of normalization.
Is that enough or do I have to normalise them before, eg when the data is coming from different distributions?
I am getting different results with and without standardscaling and I can not find information in the original paper if it is necessary or not.
You can find the code and an example of my dataset. This is real data so I do not know the truth. I calculated at the end the consensus score to compare the 2 biclusters. Unfortunately the clusters are not the same.
I tried it also with artificial data (see example last link) and here the results are the same, but not with the real data.
So how do I know which approach is the right one?
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler
n_clusters = (4, 4)
data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0)
# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)
# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()
data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]
models_all = []
for name, data in zip(data_type,data_all):
# spectral biclustering on the shuffled dataset
model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
, svd_method='randomized', n_jobs=-1
, random_state=0
)
model.fit(data)
newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_row = [i[1] for i in newOrder_row]
newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
order_col = [i[1] for i in newOrder_col]
# reorder the data matrix
X_plot = data_scaled.copy()
X_plot = X_plot.reindex(order_row) # rows
X_plot = X_plot[[str(x) for x in order_col]] # columns
# use clustermap without clustering
cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
,row_cluster=False, row_colors=None
, col_cluster=False, col_colors=None
, yticklabels=1, xticklabels=1
, standard_scale=None, z_score=None, robust=False
, vmin=-3, vmax=5
)
ax = cm.ax_heatmap
# set labelsize smaller
cm_ax = plt.gcf().axes[-2]
cm_ax.tick_params(labelsize=5.5)
# plot lines for the different clusters
hor_lines = [sum(item) for item in model.biclusters_[0]]
hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))
ver_lines = [sum(item) for item in model.biclusters_[1]]
ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))
for pp in range(len(hor_lines)-1):
cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')
for pp in range(len(ver_lines)-1):
cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')
# title
title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
plt.title(title)
cm.savefig(title,dpi=300)
plt.show()
# save models
models_all.append(model)
# compare models
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))
I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.
Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]