Suppose I have a column of data whose value ranges from -1.23 to +2.56. What I want is to add 10% random white noise to my data. I'm not sure how to do it in python; please help me with the code.
Add independent Gaussian (normal) randomness to your values.
Technically it need not be Gaussian. White noise is called that because it has a flat spectrum, meaning it is composed of all frequencies in equal proportions. The Weiner-Khinchin theorem shows that this is mathematically equivalent to having the serial correlation be zero. Many people believe that it requires Gaussian noise, but independence is sufficient to yield a flat spectrum.
Related
I have an intensity v/s velocity spectrum and my aim is to find the RMS noise in the spectrum excluding the channels where the peak is present.
So, after some research, I came to know that RMS noise is the same as the standard deviation of the spectrum and the signal-to-noise ratio of the signal is the average of the signal divided by the same standard deviation. Can anybody please tell me if I am wrong here?
This is how I coded it in python
def Average(data):
return sum(data) / len(data)
average = Average(data)
print("Average of the list =", average)
standardDev = data.std()
print('The standard deviation is',standardDev)
SNR = average/standardDev
print('SNR = ',SNR)
My original data points are:
x-axis(velocity) :
[-5.99999993e+04 -4.99999993e+04 -3.99999993e+04 -2.99999993e+04
-1.99999993e+04 -9.99999934e+03 6.65010004e-04 1.00000007e+04
2.00000007e+04 3.00000007e+04 4.00000007e+04 5.00000007e+04
6.00000007e+04 7.00000007e+04 8.00000007e+04 9.00000007e+04
1.00000001e+05 1.10000001e+05 1.20000001e+05 1.30000001e+05
1.40000001e+05]
y-axis (data):
[ 0.00056511 -0.00098584 -0.00325616 -0.00101042 0.00168894 -0.00097406
-0.00134408 0.00128847 -0.00111633 -0.00151621 0.00299326 0.00916455
0.00960554 0.00317363 0.00311124 -0.00080881 0.00215932 0.00596419
-0.00192256 -0.00190138 -0.00013216]
If I want to measure the standard deviation excluding the channels where the line is present, should I exclude values from y[10] to y[14] and then calculate the standard deviation?
Yes, since you are to determine some properties of the noise, you should exclude the points that do not constitute the noise. If these are points number 10 to 14 - exclude them.
Then you compute the average of the remaining y-values (intensity). However, from your data and the fitting function, a * exp(-(x-c)**2 / w), one might infer that the theoretical value of this mean value is just zero. If so, the average is only a means of validating your experiment / theory ("we've obtained almost zero, as expected) and use 0 as the true average value. Then, the noise level would amount to the square root of the second moment, E(Y^2).
You should compare the stddev from your code with the square root of the second moment, they should be similar to each other, so similar, that it should not matter which of them you'll chose as the noise value.
The part with SNR, signal to noise ratio, is wrong in your derivation. The signal is the signal, that is - it is the amplitude of the Gaussian obtained from the fit. You divide it by the noise level (either the square root of the second moment, or stddev). To my eye, you should obtain a value between 2 and about 10.
Finally, remember that this is a public forum and that some people read it and may be puzzled by the question & answer: both are based on the previous question Fitting data to a gaussian profile which should've been referred to in the question itself.
If this is a university assignment and you work on real experimental data, remember the purpose. Imagine yourself as a scientist who is to convince others that this a real signal, say, from the Aliens, not just an erratic result of the Mother Nature tossing dice at random. That's the primary purpose of the signal to noise ratio.
After several manipulation of some data I find my self with an array of values which I want to approximate with a sum of Gaussian curves.
As you can see in the picture I have a curve (red) and I'd like to find a given number of Gaussian that approximates that curve in order to describe it in another system: e.g.1I want to find the four green gaussian described by mean and variance parameters (something like N(m,s)).
The red curve is described by an array of 25 values.
===EDIT===
Here I post all the curves I need to descrbe, with noise removed, in order to give more info about the kind of data we are talking about:
Intensity Curve. these are 11 float array of variable lenght.
I have a 2d probability map (please correct me if I use any term wrong). Something like this:
Here yellow is a high value and violet is zero. Please ignore the red cross. It is represented as a matrix in numpy/pytorch.
You can see, this examples has two clusters. How can I find those clusters including the mode coordinates (matrix indices) and accumulated probability mass corresponding to these clusters. The number of clusters can vary in each probability map and needs to be determined automatically.
I believe something like mean-shift should work, but I am new to this field so I don't know the best way to do it. I found a code from sklearn.cluster.MeanShift but it needs as input sampled points, which seems very expensive to do for images with size of roughly 512 x 512. Can I do it without sampling first?
I am trying to implement an algorithm in python, but I am not sure when I should use fftshift(fft(fftshift(x))) and when only fft(x) (from numpy). Is there a rule of thumb based on the shape of input data?
I am using fftshift instead of ifftshift due to the even number of values in the vector x.
It really just depends on what you want. The DFT (and hence the FFT) is periodic in the frequency domain with period equal to 2pi.
The fft() function will return the approximation of the DFT with omega (radians/s) from 0 to pi (i.e. 0 to fs, where fs is the sampling frequency). All fftshift() does is swap the output vector of the fft() right down the middle. So the output of fftshift(fft()) is now from -pi/2 to pi/2.
Usually, people like to plot a good approximation of the DTFT (or maybe even the CTFT) using the FFT, so they zero-pad the input with a huge amount of zeros (the function fft() does this on it's own) and then they use the fftshift() function to plot between -pi and pi.
In other words, use fftshift(fft()) for plotting, and fft() for the math!
fft(fftshift(x)) rotates the input vector so the the phase of the complex FFT result is relative to the center of the original data window. If the input waveform is not exactly integer periodic in the FFT width, phase relative to the center of the original window of data may make more sense than the phase relative to some averaging between the discontinuous beginning and end. fft(fftshift(x)) also has the property that the imaginary component of a result will always be positive for a positive zero crossing at the center of the window of any antisymmetric waveform component.
fftshift(fft(y)) rotates the FFT results so that the DC bin is in the center of the result, halfway between -Fs/2 and Fs/2, which is a common spectrum display format.
I'm using meanshift clustering to remove unwanted noise from my input data..
Data can be found here. Here what I have tried so far..
import numpy as np
from sklearn.cluster import MeanShift
data = np.loadtxt('model.txt', unpack = True)
## data size is [3X500]
ms = MeanShift()
ms.fit(data)
after trying some different bandwidth value I am getting only 1 cluster.. but the outliers and noise like in the picture suppose to be in different cluster.
when decreasing the bandwidth a little more then I ended up with this ... which is again not what I was looking for.
Can anyone help me with this?
You can remove outliers before using mean shift.
Statistical removal
For example, fix a number of neighbors to analyze for each point (e.g. 50), and the standard deviation multiplier (e.g. 1). All points who have a distance larger than 1 standard deviation of the mean distance to the query point will be marked as outliers and removed. This technique is used in libpcl, in the class pcl::StatisticalOutlierRemoval, and a tutorial can be found here.
Deterministic removal (radius based)
A simpler technique consists in specifying a radius R and a minimum number of neighbors N. All points who have less than N neighbours withing a radius of R will be marked as outliers and removed. Also this technique is used in libpcl, in the class pcl::RadiusOutlierRemoval, and a tutorial can be found here.
Mean-shift is not meant to remove low-density areas.
It tries to move all data to the most dense areas.
If there is one single most dense point, then everything should move there, and you get only one cluster.
Try a different method. Maybe remove the outliers first.
set his parameter to false cluster_allbool, default=True
If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.