I have a simulated signal which is displayed as an histogram. I want to emulate the real measured signal using a convolution with a Gaussian with a specific width, since in the real experiment a detector has a certain uncertainty in the measured channels.
I have tried to do a convolution using np.convolve as well as scipy.signal.convolve but can't seem to get the filtering correctly. Not only the expected shape is off, which would be a slightly smeared version of the histogram and the x-axis e.g. energy scale is off aswell.
I tried defining my Gaussian with a width of 20 keV as:
gauss = np.random.normal(0, 20000, len(coincidence['esum']))
hist_gauss = plt.hist(gauss, bins=100)[0]
where len(coincidence['esum']) is the length of my coincidencedataframe column.This column I bin using:
counts = plt.hist(coincidence['esum'], bins=100)[0]
Besides this approach to generate a suitable Gaussian I tried scipy.signal.gaussian(50, 30000) which unfortunately generates a parabolic looking curve and does not exhibit the characteristic tails.
I tried doing the convolution using both coincidence['esum'] and counts with the both Gaussian approaches. Note that when doing a simple convolution with the standard example according to Finding the convolution of two histograms it works without problems.
Would anyone know how to do such a convolution in python? I exported the column of coincidende['esum'] that I use for my histogram to a pastebin, in case anyone is interested and wants to recreate it with the specific data https://pastebin.com/WFiSBFa6
As you may be aware, doing the convolution of the two histograms with the same bin size will give the histogram of the result of adding each element of one of the samples with each elements of the other of the samples.
I cannot see exactly what you are doing. One important thing that you seem to not be doing is to make sure that the bins of the histograms have the same width, and you have to take care of the position of the edges of the second bin.
In code we have
def hist_of_addition(A, B, bins=10, plot=False):
A_heights, A_edges = np.histogram(A, bins=bins)
# make sure the histogram is equally spaced
assert(np.allclose(np.diff(A_edges), A_edges[1] - A_edges[0]))
# make sure to use the same interval
step = A_edges[1] - A_edges[0]
# specify parameters to make sure the histogram of B will
# have the same bin size as the histogram of A
nBbin = int(np.ceil((np.max(B) - np.min(B))/step))
left = np.min(B)
B_heights, B_edges = np.histogram(B, range=(left, left + step * nBbin), bins=nBbin)
# check that the bins for the second histogram matches the first
assert(np.allclose(np.diff(B_edges), step))
C_heights = np.convolve(A_heights, B_heights)
C_edges = B_edges[0] + A_edges[0] + np.arange(0, len(C_heights) + 1) * step
if plot:
plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.bar(A_edges[:-1], A_heights, step)
plt.title('A')
plt.subplot(132)
plt.bar(B_edges[:-1], B_heights, step)
plt.title('B')
plt.subplot(133)
plt.bar(C_edges[:-1], C_heights, step)
plt.title('A+B')
return C_edges, C_heights
Then
A = -np.cos(np.random.rand(10**6))
B = np.random.normal(1.5, 0.025, 10**5)
hist_of_addition(A, B, bins=100, plot=True);
Gives
Related
I have a set of points [x1,y1][x2,y2]...[xn,yn]. I need to display them using the kernel density Estimation in a 2D image. How to perform this? I was referring the following code and it's bit confusing. Looking for a simple explanation.
https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
img = np.zeros((height, width), np.uint8)
circles_xy =[[524,290][234,180]...[432,30]]
kde = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde.fit(circles_xy)
I would continue on the same path by drawing the contours of the PDF of the kernel density estimate. However, this might not give the information you need, because the values of the PDF are not very informative. Instead, I would rather compute the minimum volume level set. From a given probability level, the minimum level set is the domain containing that fraction of the distribution. If we consider a domain defined by a given value of the PDF, this must correspond to an unknown PDF value. The problem of finding this PDF value is done by inversion.
Based on a given sample, the natural idea is to compute an approximate distribution based on kernel smoothing, just like you did. Then, for any distribution in OpenTURNS, the computeMinimumVolumeLevelSetWithThreshold method computes the required level set and the corresponding PDF value.
Let's see how it goes in practice. In order to get an interesting example, I create a 2D distribution from a mixture of two gaussian distributions.
import openturns as ot
# Create a gaussian
corr = ot.CorrelationMatrix(2)
corr[0, 1] = 0.2
copula = ot.NormalCopula(corr)
x1 = ot.Normal(-1., 1)
x2 = ot.Normal(2, 1)
x_funk = ot.ComposedDistribution([x1, x2], copula)
# Create a second gaussian
x1 = ot.Normal(1.,1)
x2 = ot.Normal(-2,1)
x_punk = ot.ComposedDistribution([x1, x2], copula)
# Mix the distributions
mixture = ot.Mixture([x_funk, x_punk], [0.5,1.])
# Generate the sample
sample = mixture.getSample(500)
This is where your problem starts. Creating the bivariate kernel smoothing from multidimensional Scott's rule only requires two lines.
factory = ot.KernelSmoothing()
distribution = factory.build(sample)
It would be straightforward just to plot the contours of this estimated distribution.
distribution.drawPDF()
produces:
This shows the shape of the distribution. However, the contours of the PDF do not convey much information on the initial sample.
The inversion to compute the minimum volume level set requires an initial sample which is generated from Monte-Carlo method when the dimension is greater than 1. The default sample size (close to 16 000) is OK, but I usually set it up by myself just to make sure that I understand what I do.
ot.ResourceMap.SetAsUnsignedInteger(
"Distribution-MinimumVolumeLevelSetSamplingSize", 1000
)
alpha = 0.9
levelSet, threshold = distribution.computeMinimumVolumeLevelSetWithThreshold(alpha)
The threshold variable contains the solution of the problem, i.e. the PDF value which corresponds to the minimum volume level set.
The final step is to plot the sample and the corresponding minimum volume level set.
def drawLevelSetContour2D(
distribution, numberOfPointsInXAxis, alpha, threshold, sample
):
"""
Compute the minimum volume LevelSet of measure equal to alpha and get the
corresponding density value (named threshold).
Draw a contour plot for the distribution, where the PDF is equal to threshold.
"""
sampleSize = sample.getSize()
X1min = sample[:, 0].getMin()[0]
X1max = sample[:, 0].getMax()[0]
X2min = sample[:, 1].getMin()[0]
X2max = sample[:, 1].getMax()[0]
xx = ot.Box([numberOfPointsInXAxis], ot.Interval([X1min], [X1max])).generate()
yy = ot.Box([numberOfPointsInXAxis], ot.Interval([X2min], [X2max])).generate()
xy = ot.Box(
[numberOfPointsInXAxis, numberOfPointsInXAxis],
ot.Interval([X1min, X2min], [X1max, X2max]),
).generate()
data = distribution.computePDF(xy)
graph = ot.Graph("", "X1", "X2", True, "topright")
labels = ["%.2f%%" % (100 * alpha)]
contour = ot.Contour(xx, yy, data, ot.Point([threshold]), ot.Description(labels))
contour.setColor("black")
graph.setTitle(
"%.2f%% of the distribution, sample size = %d" % (100 * alpha, sampleSize)
)
graph.add(contour)
cloud = ot.Cloud(sample)
graph.add(cloud)
return graph
We finally plot the contours of the level set with 50 points in each axis.
numberOfPointsInXAxis = 50
drawLevelSetContour2D(mixture, numberOfPointsInXAxis, alpha, threshold, sample)
The following figure plots the sample along with the contour of the domain which contains 90% of the population estimated from the kernel smoothing distribution. Any point outside of this region can be considered as an outlier, although we might use the higher alpha=0.95 value for this purpose.
The full example is detailed in Minimum volume level set. An application of this to stochastic processes is done in othdrplot. The ideas used here are detailed in : Rob J Hyndman and Han Lin Shang. Rainbow plots , bagplots and boxplots for functional data. Journal of Computational and Graphical Statistics, 19:29-45, 2009.
I want to create a uint16 image of gaussian noise with a defined mean and standard deviation.
I've tried using numpy's random.normal for this but it returns a float64 array:
mu = 10
sigma = 100
shape = (1024,1024)
gauss_img = np.random.normal(mu, sigma, shape)
print(gauss_img.dtype)
>>> dtype('float64')
Is there a way to convert gauss_img to a uint16 array while preserving the original mean and standard deviation? Or is there another approach entirely to creating a uint16 noise image?
EDIT: As was mentioned in the comments, np.random.normal will inevitably sample negative values given a sd > mean, which is a problem for converting to uint16.
So I think I need a different method that will create an unsigned gaussian image directly.
So I think this is close to what you're looking for.
Import libraries and spoof some skewed data. Here, since the input is of unknown origin, I created skewed data using np.expm1(np.random.normal()). You could use skewnorm().rvs() as well, but that's kind of cheating since that's also the lib you'll use to characterize it.
I flatten the raw samples to make plotting histograms easier.
import numpy as np
from scipy.stats import skewnorm
# generate dummy raw starting data
# smaller shape just for simplicity
shape = (100, 100)
raw_skewed = np.maximum(0.0, np.expm1(np.random.normal(2, 0.75, shape))).astype('uint16')
# flatten to look at histograms and compare distributions
raw_skewed = raw_skewed.reshape((-1))
Now find the params that characterize your skewed data, and use those to create a new distribution to sample from that hopefully matches your original data well.
These two lines of code are really what you're after I think.
# find params
a, loc, scale = skewnorm.fit(raw_skewed)
# mimick orig distribution with skewnorm
new_samples = skewnorm(a, loc, scale).rvs(10000).astype('uint16')
Now plot the distributions of each to compare.
plt.hist(raw_skewed, bins=np.linspace(0, 60, 30), hatch='\\', label='raw skewed')
plt.hist(new_samples, bins=np.linspace(0, 60, 30), alpha=0.65, color='green', label='mimic skewed dist')
plt.legend()
The histograms are pretty close. If that looks good enough, reshape your new data to the desired shape.
# final result
new_samples.reshape(shape)
Now... here's where I think it probably falls short. Take a look at the heatmap of each. The original distribution had a longer tail to the right (more outliers that skewnorm() didn't characterize).
This plots a heatmap of each.
# plot heatmaps of each
fig = plt.figure(2, figsize=(18,9))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
im1 = ax1.imshow(raw_skewed.reshape(shape), vmin=0, vmax=120)
ax1.set_title("raw data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(raw_skewed), np.std(raw_skewed)), fontsize=20)
im2 = ax2.imshow(new_samples.reshape(shape), vmin=0, vmax=120)
ax2.set_title("mimicked data - mean: {:3.2f}, std dev: {:3.2f}".format(np.mean(new_samples), np.std(new_samples)), fontsize=20)
plt.tight_layout()
# add colorbar
fig.subplots_adjust(right=0.85)
cbar_ax = fig.add_axes([0.88, 0.1, 0.08, 0.8]) # [left, bottom, width, height]
fig.colorbar(im1, cax=cbar_ax)
Looking at it... you can see occasional flecks of yellow indicating very high values in the original distribution that didn't make it into the output. This also shows up in the higher std dev of the input data (see titles in each heatmap, but again, as in comments to original question... mean & std don't really characterize the distributions since they're not normal... but they're in as a relative comparison).
But... that's just the problem it has with the very specific skewed sample i created to get started. There's hopefully enough here to mess around with and tune until it suits your needs and your specific dataset. Good luck!
With that mean and sigma you are bound to sample some negative values. So i guess the option could be that you find the most negative value, after sampling, and add its absolute value to all the samples. After that convert to uint as suggested in the comments. But ofcourse you loose the mean this way.
If you have a range of uint16 numbers to sample from, then you should check out this post.
This way you could use scipy.stats.truncnorm to generate a gaussian of unsigned integers.
I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)
THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.
This question has a lot of useful answers on how to get a moving average.
I have tried the two methods of numpy convolution and numpy cumsum and both worked fine on an example dataset, but produced a shorter array on my real data.
The data are spaced by 0.01. The example dataset has a length of 50, the real data tens of thousands. So it must be something about the window size that is causing the problem and I don't quite understand what is going on in the functions.
This is how I define the functions:
def smoothMAcum(depth,temp, scale): # Moving average by cumsum, scale = window size in m
dz = np.diff(depth)
N = int(scale/dz[0])
cumsum = np.cumsum(np.insert(temp, 0, 0))
smoothed=(cumsum[N:] - cumsum[:-N]) / N
return smoothed
def smoothMAconv(depth,temp, scale): # Moving average by numpy convolution
dz = np.diff(depth)
N = int(scale/dz[0])
smoothed=np.convolve(temp, np.ones((N,))/N, mode='valid')
return smoothed
Then I implement it:
scale = 5.
smooth = smoothMAconv(dep,data, scale)
but print len(dep), len(smooth)
returns 81071 80572
and the same happens if I use the other function.
How can I get the smooth array of the same length as the data?
And why did it work on the small dataset? Even if I try different scales (and use the same for the example and for the data), the result in the example has the same length as the original data, but not in the real application.
I considered an effect of nan values, but if I have a nan in the example, it doesn't make a difference.
So where is the problem, if possible to tell without the full dataset?
The second of your approaches is easy to modify to preserve the length, because numpy.convolve supports the parameter mode='same'.
np.convolve(temp, np.ones((N,))/N, mode='same')
This is made possible by zero-padding the data set temp on both sides, -
which will inevitably have some effect at the boundaries unless your data happens to be 0 near the boundaries. Example:
N = 10
x = np.linspace(0, 2, 100)
y = x**2 + np.random.uniform(size=x.shape)
y_smooth = np.convolve(y, np.ones((N,))/N, mode='same')
plt.plot(x, y, 'r.')
plt.plot(x, y_smooth)
plt.show()
The boundary effect of zero-padding is very visible at the right end, where the data points are about 4-5 but are padded by 0.
To reduce this undesired effect, use numpy.pad for more intelligent padding; reverting to mode='valid' for convolution. The pad width must be such that in total N-1 elements are added, where N is the size of moving window.
y_padded = np.pad(y, (N//2, N-1-N//2), mode='edge')
y_smooth = np.convolve(y_padded, np.ones((N,))/N, mode='valid')
Padding by edge values of an array looks much better.
I am trying to come up with a generalised way in Python to identify pitch rotations occurring during a set of planned spacecraft manoeuvres. You could think of it as a particular case of a shift detection problem.
Let's consider the solar_elevation_angle variable in my set of measurements, identifying the elevation angle of the sun measured from the spacecraft's instrument. For those who might want to play with the data, I saved the solar_elevation_angle.txt file here.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy.signal import argrelmax
from scipy.ndimage.filters import gaussian_filter1d
solar_elevation_angle = np.loadtxt("solar_elevation_angle.txt", dtype=np.float32)
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
plt.show()
The scanline is my time dimension. The four points where the slope changes identify the spacecraft pitch rotations.
As you can see, the solar elevation angle evolution outside the spacecraft manoeuvres regions is pretty much linear as a function of time, and that should always be the case for this particular spacecraft (except for major failures).
Note that during each spacecraft manoeuvre, the slope change is obviously continuous, although discretised in my set of angle values. That means: for each manoeuvre, it does not really make sense to try to locate a single scanline where a manoeuvre has taken place. My goal is rather to identify, for each manoeuvre, a "representative" scanline in the range of scanlines defining the interval of time where the manoeuvre occurred (e.g. middle value, or left boundary).
Once I get a set of "representative" scanline indexes where all manoeuvres have taken place, I could then use those indexes for rough estimations of manoeuvres durations, or to automatically place labels on the plot.
My solution so far has been to:
Compute the 2nd derivative of the solar elevation angle using
np.gradient.
Compute absolute value and clipping of resulting
curve. The clipping is necessary because of what I assume to be
discretisation noise in the linear segments, which would then severely affect the identification of the "real" local maxima in point 4.
Apply smoothing to the resulting curve, to get rid of multiple peaks. I'm using scipy's 1d gaussian filter with a trial-and-error sigma value for that.
Identify local maxima.
Here's my code:
fig = plt.figure(figsize=(8,12))
gs = gridspec.GridSpec(5, 1)
ax0 = plt.subplot(gs[0])
ax0.set_title('Solar elevation angle')
ax0.plot(solar_elevation_angle)
solar_elevation_angle_1stdev = np.gradient(solar_elevation_angle)
ax1 = plt.subplot(gs[1])
ax1.set_title('1st derivative')
ax1.plot(solar_elevation_angle_1stdev)
solar_elevation_angle_2nddev = np.gradient(solar_elevation_angle_1stdev)
ax2 = plt.subplot(gs[2])
ax2.set_title('2nd derivative')
ax2.plot(solar_elevation_angle_2nddev)
solar_elevation_angle_2nddev_clipped = np.clip(np.abs(np.gradient(solar_elevation_angle_2nddev)), 0.0001, 2)
ax3 = plt.subplot(gs[3])
ax3.set_title('absolute value + clipping')
ax3.plot(solar_elevation_angle_2nddev_clipped)
smoothed_signal = gaussian_filter1d(solar_elevation_angle_2nddev_clipped, 20)
ax4 = plt.subplot(gs[4])
ax4.set_title('Smoothing applied')
ax4.plot(smoothed_signal)
plt.tight_layout()
plt.show()
I can then easily identify the local maxima by using scipy's argrelmax function:
max_idx = argrelmax(smoothed_signal)[0]
print(max_idx)
# [ 689 1019 2356 2685]
Which correctly identifies the scanline indexes I was looking for:
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
ax.scatter(max_idx, solar_elevation_angle[max_idx], marker='x', color='red')
plt.show()
My question is: Is there a better way to approach this problem?
I find that having to manually specify the clipping threshold values to get rid of the noise and the sigma in the gaussian filter weakens this approach considerably, preventing it to be applied to other similar cases.
First improvement would be to use a Savitzky-Golay filter to find the derivative in a less noisy way. For example, it can fit a parabola (in the sense of least squares) to each data slice of certain size, and then take the second derivative of that parabola. The result is much nicer than just taking 2nd order difference with gradient. Here it is with window size 101:
savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
Second, instead of looking for points of maximum with argrelmax it is better to look for places where the second derivative is large; for example, at least half its maximal size. This will of course return many indexes, but we can then look at the gaps between those indexes to identify where each peak begins and ends. The midpoint of the peak is then easily found.
Here is the complete code. The only parameter is window size, which is set to 101. The approach is robust; the size 21 or 201 gives essentially the same outcome (it must be odd).
from scipy.signal import savgol_filter
window = 101
der2 = savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
max_der2 = np.max(np.abs(der2))
large = np.where(np.abs(der2) > max_der2/2)[0]
gaps = np.diff(large) > window
begins = np.insert(large[1:][gaps], 0, large[0])
ends = np.append(large[:-1][gaps], large[-1])
changes = ((begins+ends)/2).astype(np.int)
plt.plot(solar_elevation_angle)
plt.plot(changes, solar_elevation_angle[changes], 'ro')
plt.show()
The fuss with insert and append is because the first index with large derivative should qualify as "peak begins" and the last such index should qualify as "peak ends", even though they don't have a suitable gap next to them (the gap is infinite).
Piecewise linear fit
This is an alternative (not necessarily better) approach, which does not use derivatives: fit a smoothing spline of degree 1 (i.e., a piecewise linear curve), and notice where its knots are.
First, normalize the data (which I call y instead of solar_elevation_angle) to have standard deviation 1.
y /= np.std(y)
The first step is to build a piecewise linear curve that deviates from the data by at most the given threshold, arbitrarily set to 0.1 (no units here because y was normalized). This is done by calling UnivariateSpline repeatedly, starting with a large smoothing parameter and gradually reducing it until the curve fits. (Unfortunately, one can't simply pass in the desired uniform error bound).
from scipy.interpolate import UnivariateSpline
threshold = 0.1
m = y.size
x = np.arange(m)
s = m
max_error = 1
while max_error > threshold:
spl = UnivariateSpline(x, y, k=1, s=s)
interp_y = spl(x)
max_error = np.max(np.abs(interp_y - y))
s /= 2
knots = spl.get_knots()
values = spl(knots)
So far we found the knots, and noted the values of the spline at those knots. But not all of these knots are really important. To test the importance of each knot, I remove it and interpolate without it. If the new interpolant is substantially different from the old (doubling the error), the knot is considered important and is added to the list of found slope changes.
ts = knots.size
idx = np.arange(ts)
changes = []
for j in range(1, ts-1):
spl = UnivariateSpline(knots[idx != j], values[idx != j], k=1, s=0)
if np.max(np.abs(spl(x) - interp_y)) > 2*threshold:
changes.append(knots[j])
plt.plot(y)
plt.plot(changes, y[np.array(changes, dtype=int)], 'ro')
plt.show()
Ideally, one would fit piecewise linear functions to given data, increasing the number of knots until adding one more does not bring "substantial" improvement. The above is a crude approximation of that with SciPy tools, but far from best possible. I don't know of any off-the-shelf piecewise linear model selection tool in Python.