I have a couple of data sets with clusters of peaks that look like the following:
You can see that the main features here a clusters of peaks, each cluster having three peaks. I would like to find the x values of those local peaks, but I am running into a few problems. My current code is as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy import loadtxt, optimize
from scipy.signal import argrelmax
def rounddown(x):
return int(np.floor(x / 10.0)) * 10
pixel, value = loadtxt('voltage152_4.txt', unpack=True, skiprows=0)
ax = plt.axes()
ax.plot(pixel, value, '-')
ax.axis([0, np.max(pixel), np.min(value), np.max(value) + 1])
maxTemp = argrelmax(value, order=5)
maxes = []
for maxi in maxTemp[0]:
if value[maxi] > 40:
maxes.append(maxi)
ax.plot(maxes, value[maxes], 'ro')
plt.yticks(np.arange(rounddown(value.min()), value.max(), 10))
plt.savefig("spectrum1.pdf")
plt.show()
Which works relatively well, but still isn't perfect. Some peaks labeled: The main problem here is that my signal isn't smooth, so a few things that aren't actually my relevant peaks are getting picked up. You can see this in the stray maxima about halfway down a cluster, as well as peaks that have two maxima where in reality it should be one. You can see near the center of the plot there are some high frequency maxima. I was picking those up so I added in the loop only considering values above a certain point.
I am afraid that smoothing the curve will actually make me loose some of the clustered peaks that I want, as in some of my other datasets there are even closer together. Maybe my fears are unfounded, though, and I am just misunderstanding how smoothing works. Any help would be appreciated.
Does anyone have a solution on how to pick out only "prominent" peaks? That is, only those peaks that are quick large compared to the others?
Starting with SciPy version 1.1.0 you may also use the function scipy.signal.find_peaks which allows you to select detected peaks based on their topographic prominence. This function is often easier to use than find_peaks_cwt. You'll have to play around a little bit to find the optimal lower bound to pass as a value to prominence but e.g. find_peaks(..., prominence=5) will ignore the unwanted peaks in your example. This should bring you reasonably close to your goal. If that's not enough you might do your own peak selection based upon peak properties like the left_/right_bases which are optionally returned.
I'd also recommend scipy.signal.find_peaks for what you're looking for. The other, older, scipy alternate find_peaks_cwt is quite complicated to use.
It will basically do what you're looking for in a single line. Apart from the prominence parameter that lagru mentioned, for your data either the threshold or height parameters might also do what you need.
height = 40 would filter to get all the peaks you like.
Prominence is a bit hard to wrap your head around for exactly what it does sometimes.
Related
I have surface data Z over an [X,Y] mesh. In general Z = 0, but there will be peaks which stick up above this flat background, and these peaks will have roughly elliptical cross sections. These are diffraction intensity peaks, if anyone is curious. I would like to measure the elliptical cross section at about half the peak's maximum value.
So typically with diffraction, if it's a peak y = f(x), we want to look at the Full Width at Half Max (FWHM), which can be done by finding the peak's maximum, then intersecting the peak at that value and measuring the width. No problem.
Here I want to perform the analogous operation, but at higher dimension. If the peak had a circular cross section, then I could just do the FWHM = diameter of cross section. However, these peaks are elliptical, so I want to slice the peak at its half max and then fit an ellipse to the cross section. That way I can get the major and minor axes, inclination angle, and goodness of fit, all of which contain relevant information that a simple FWHM number would not provide.
I can hack together a way to do this, but it's slow and messy, so it feels like there must be a better way to do this. So my question really just comes down to, has anyone done this kind of problem before, and if so, are there any modules that I could use to perform the calculation quickly and with a simple, clean code?
I have a series of many thousands of (1D) spectra corresponding to different repetitions of an experiment. For each repetition, the same data has been recorded by two different instruments - so I have two very similar spectra, each consisting of a few hundred individual peaks/events. The instruments have different resolutions, precisions and likely detection efficiencies so the each pair of spectra are non-identical but similar - looking at them closely by eye one can confidently match many peaks in each spectra. I want to be able to automatically and reliably match the two spectra for each pair of spectra, i.e confidently say which peak corresponds to which. This will likely involve 'throwing away' some data which can't be confidently matched (e.g only one of the two instruments detect an event).
I've attached an image of what the data look like over an entire spectra and zoomed into a relatively sparse region. The red spectra has essentially already been peak found, such that it is 0 everywhere apart from where a real event is. I have used scipy.signal.find_peaks() on the blue trace, and plotted the found peaks, which seems to work well.
Now I just need to find a reliable method to match peaks between the spectra. I have tried matching peaks by just pairing the peaks which are closest to each other - however this runs into significant issues due to some peaks not being present in both spectra. I could add constraints about how close peaks must be to be matched but I think there are probably better ways out there. There are also issues arising from the red trace being a lower resolution than the blue. I expect there are pattern finding algorithms/python packages out there that would be best suited for this - but this is far from my area of expertise so I don't really know where to start. Thanks in advance.
Zoom in of relatively spare region of example pair of spectra :
An entire example pair of spectra, showing some very dense regions :
Example code to generate to plot the spectra:
from scipy.signal import find_peaks
for i in range(0, 10):
spectra1 = spectra1_list[i]
spectra2 = spectra2_list[i]
fig, ax1 = plt.subplots(1, 1,figsize=(12, 8))
peaks, properties = scipy.signal.find_peaks(shot_ADC, height=(6,None), threshold=(None,None), distance=2, prominence = (5, None))
plt.plot(spectra1)
plt.plot(spectra2_axis, spectra2, color='red')
plt.plot(peaks, spectra1[peaks], "x")
plt.show()
Deep learning perspective: you could train a pair of neural networks using cycle loss - mapping from signal A to signal B, and back again should bring you to the initial point on your signal.
Good start would be to read about CycleGAN which uses this to change style of images.
Admittedly this would be a bit of a research project and might take some time until it will work robustly.
I'm referencing this question and this documentation in trying to turn a set of points (the purple dots in the image below) into an interpolated grid.
As you can see, the image has missing spots where dots should be. I'd like to figure out where those are.
import numpy as np
from scipy import interpolate
CIRCLES_X = 25 # There should be 25 circles going across
CIRCLES_Y = 10 # There should be 10 circles going down
points = []
values = []
# Points range from 0-800 ish X, 0-300 ish Y
for point in points:
points.append([points.x, points.y])
values.append(1) # Not sure what this should be
grid_x, grid_y = np.mgrid[0:CIRCLES_Y, 0:CIRCLES_X]
grid = interpolate.griddata(points, values, (grid_x, grid_y), method='linear')
print(grid)
Whenever I print out the result of the grid, I get nan for all of my values.
Where am I going wrong? Is my problem even the correct use case for interpolate.grid?
First, your uncertain points are mainly at an edge, so it's actually extrapolation. Second, interpolation methods built into scipy deal with continuous functions defined on the entire plane and approximate it as a polynomial. While yours is discrete (1 or 0), somewhat periodic rather than polynomial and only defined in a discrete "grid" of points.
So you have to invent some algorithm to inter/extrapolate your specific kind of function. Whether you'll be able to reuse an existing one - from scipy or elsewhere - is up to you.
One possible way is to replace it with some function (continuous or not) defined everywhere, then calculate that approximation in the missing points - whether as one step as scipy.interpolate non-class functions do or as two separate steps.
e.g. you can use a 3-D parabola with peaks in your dots and troughs exactly between them. Or just with ones in the dots and 0's in the blanks and hope the resulting approximation in the grid's points is good enough to give a meaningful result (random overswings are likely). Then you can use scipy.interpolate.RegularGridInterpolator for both inter- and extrapolation.
or as a harmonic function - then what you're seeking is Fourier transformation
Another possible way is to go straight for a discrete solution rather than try to shoehorn the continual mathanalysis' methods into your case: design a (probably entirely custom) algorithm that'll try to figure out the "shape" and "dimensions" of your "grids of dots" and then simply fill in the blanks. I'm not sure if it is possible to add it into the scipy.interpolate's harness as a selectable algorithm in addition to the built-in ones.
And last but not the least. You didn't specify whether the "missing" points are points where the value is unknown or are actual part of the data - i.e. are incorrect data. If it's the latter, simple interpolation is not applicable at all as it assumes that all the data are strictly correct. Then it's a related but different problem: you can approximate the data but then have to somehow "throw away irregularities" (higher order of smallness entities after some point).
The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?
As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.
My software should judge spectrum bands, and given the location of the bands, find the peak point and width of the bands.
I learned to take the projection of the image and to find width of each peak.
But I need a better way to find the projection.
The method I used reduces a 1600-pixel wide image (eg 1600X40) to a 1600-long sequence. Ideally I would want to reduce the image to a 10000-long sequence using the same image.
I want a longer sequence as 1600 points provide too low resolution. A single point causes a large difference (there is a 4% difference if a band is judged from 18 to 19) to the measure.
How do I get a longer projection from the same image?
Code I used: https://stackoverflow.com/a/9771560/604511
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("band2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
What you want to do is called interpolation. Scipy has an interpolate module, with a whole bunch of different functions for differing situations, take a look here, or specifically for images here.
Here is a recently asked question that has some example code, and a graph that shows what happens.
But it is really important to realise that interpolating will not make your data more accurate, so it will not help you in this situation.
If you want more accurate results, you need more accurate data. There is no other way. You need to start with a higher resolution image. (If you resample, or interpolate, you results will acually be less accurate!)
Update - as the question has changed
#Hooked has made a nice point. Another way to think about it is that instead of immediately averaging (which does throw away the variance in the data), you can produce 40 graphs (like your lower one in your posted image) from each horizontal row in your spectrum image, all these graphs are going to be pretty similar but with some variations in peak position, height and width. You should calculate the position, height, and width of each of these peaks in each of these 40 images, then combine this data (matching peaks across the 40 graphs), and use the appropriate variance as an error estimate (for peak position, height, and width), by using the central limit theorem. That way you can get the most out of your data. However, I believe this is assuming some independence between each of the rows in the spectrogram, which may or may not be the case?
I'd like to offer some more detail to #fraxel's answer (to long for a comment). He's right that you can't get any more information than what you put in, but I think it needs some elaboration...
You are projecting your data from 1600x40 -> 1600 which seems like you are throwing some data away. While technically correct, the whole point of a projection is to bring higher dimensional data to a lower dimension. This only makes sense if...
Your data can be adequately represented in the lower dimension. Correct me if I'm wrong, but it looks like your data is indeed one-dimensional, the vertical axis is a measure of the variability of that particular point on the x-axis (wavelength?).
Given that the projection makes sense, how can we best summarize the data for each particular wavelength point? In my previous answer, you can see I took the average for each point. In the absence of other information about the particular properties of the system, this is a reasonable first-order approximation.
You can keep more of the information if you like. Below I've plotted the variance along the y-axis. This tells me that your measurements have more variability when the signal is higher, and low variability when the signal is lower (which seems useful!):
What you need to do then, is decide what you are going to do with those extra 40 pixels of data before the projection. They mean something physically, and your job as a researcher is to interpret and project that data in a meaningful way!
The code to produce the image is below, the spec. data was taken from the screencap of your original post:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spec2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Compute the variance
variance = pic_avg.var(axis=0)
from pylab import *
scale = 1/40.
X_val = range(projection.shape[0])
errorbar(X_val,projection*scale,yerr=variance*scale)
imshow(pic,origin='lower',alpha=.8)
axis('tight')
show()