I have a series of many thousands of (1D) spectra corresponding to different repetitions of an experiment. For each repetition, the same data has been recorded by two different instruments - so I have two very similar spectra, each consisting of a few hundred individual peaks/events. The instruments have different resolutions, precisions and likely detection efficiencies so the each pair of spectra are non-identical but similar - looking at them closely by eye one can confidently match many peaks in each spectra. I want to be able to automatically and reliably match the two spectra for each pair of spectra, i.e confidently say which peak corresponds to which. This will likely involve 'throwing away' some data which can't be confidently matched (e.g only one of the two instruments detect an event).
I've attached an image of what the data look like over an entire spectra and zoomed into a relatively sparse region. The red spectra has essentially already been peak found, such that it is 0 everywhere apart from where a real event is. I have used scipy.signal.find_peaks() on the blue trace, and plotted the found peaks, which seems to work well.
Now I just need to find a reliable method to match peaks between the spectra. I have tried matching peaks by just pairing the peaks which are closest to each other - however this runs into significant issues due to some peaks not being present in both spectra. I could add constraints about how close peaks must be to be matched but I think there are probably better ways out there. There are also issues arising from the red trace being a lower resolution than the blue. I expect there are pattern finding algorithms/python packages out there that would be best suited for this - but this is far from my area of expertise so I don't really know where to start. Thanks in advance.
Zoom in of relatively spare region of example pair of spectra :
An entire example pair of spectra, showing some very dense regions :
Example code to generate to plot the spectra:
from scipy.signal import find_peaks
for i in range(0, 10):
spectra1 = spectra1_list[i]
spectra2 = spectra2_list[i]
fig, ax1 = plt.subplots(1, 1,figsize=(12, 8))
peaks, properties = scipy.signal.find_peaks(shot_ADC, height=(6,None), threshold=(None,None), distance=2, prominence = (5, None))
plt.plot(spectra1)
plt.plot(spectra2_axis, spectra2, color='red')
plt.plot(peaks, spectra1[peaks], "x")
plt.show()
Deep learning perspective: you could train a pair of neural networks using cycle loss - mapping from signal A to signal B, and back again should bring you to the initial point on your signal.
Good start would be to read about CycleGAN which uses this to change style of images.
Admittedly this would be a bit of a research project and might take some time until it will work robustly.
Related
I have a dataset with different quality data. There are A-grade, B-grade, C-grade and D-grade data, being the A-grade the best ones and the D-grade the ones with more scatter and uncertainty.
The data comes from a quasi periodic signal and, taking into consideration all the data, it covers just one cycle. If we only take into account the best data (A and B graded, in green and yellow) we don't cover the whole cycle but we are sure that we are only using the best data points.
After computing a periodogram to determine the period of the signal for both, the whole sample and only the A and B graded, I ended up with the following results: 5893 days and 4733 days respectively.
Using those values I fit the data to a sine and I plot them in the following plot:
Plot with the data
In the attached file the green points are the best ones and the red ones are the worst.
As you can see, the data only cover one cycle, and the red points (the worst ones) are crucial to cover that cycle, but they are not as good in quality. So I would like to know if the curve fit is better with those points or not.
I was trying to use the R² parameter but I have read that it only works properly for lineal functions...
How can I quantify which of those fits is better?
Thank you
I have two sets of points, one from an analysis and another that I will use for the results of post-processing on the analysis data.
The analysis data, in black, is scattered.
The points used for results are red.
Here are the two sets on the same plot:
The problem I have is this: I will be interpolating onto the red points, but as you can see there are red points which fall inside areas of the black data set that are in voids. Interpolation causes there to be non-zero values at those points but it is essential that these values be zero in the final data set.
I have been thinking of several strategies for getting those values to zero. Here are several in no particular order:
Find a convex hull whose vertices only contain black data points and which contains only red data points inside the convex set. Also, the area of this hull should be maximized while still meeting the two criteria.
This has proven to be fairly difficult to implement, mostly due to having to select which black data points should be excluded from the iterative search for a convex hull.
Add an extra dimension to the data sets with a single value, like 1 or 0, so both can be part of the same data set yet still distinguishable. Use a kNN (nearest neighbor) algorithm to choose only red points in the voids. The basic idea is that red points in voids will have nearest n(6?) nearest neighbors which are in their own set. Red data points that are separated by a void boundary only will have a different amount, and lastly, the red points at least one step removed from a boundary will have a almost all black data set neighbors. The existing algorithms I have seen for this approach return indices or array masks, both of which will be a good solution. I have not yet tried implementing this yet.
Manually extract boundary points from the SolidWorks model that was used to create the black data set. No on so many levels. This would have to be done manually, z-level by z-level, and the pictures I have shown only represent a small portion of the actual, full set.
Manually create masks by making several refinements to a subset of red data points that I visually confirm to be of interest. Also, no. Not unless I have run out of options.
If this is a problem with a clear solution, then I am not seeing it. I'm hoping that proposed solution 2 will be the one, because that actually looks like it would be the most fun to implement and see in action. Either way, like the title says, I'm still looking for direction on strategies to solve this problem. About the only thing I'm sure of is that Python is the right tool.
EDIT:
The analysis data contains x, y, z, and 3 electric field component values, Ex, Ey, and Ez. The voids in the black data set are inside of metal and hence have no change in electric potential, or put another way, the electric field values are all exactly zero.
This image shows a single z-layer using linear interpolation of the Ex component with scipy's griddata. The black oval is a rough indicator of the void boundary for that central racetrack shaped void. You can see that there is red and blue (for + and - E field in the x direction) inside the oval. It should be zero (lt. green in this plot). The finished data is going to be used to track a beam of charged particles and so if a path of one of the particles actually crossed into the void the software that does the tracking can only tell if the electric potential remains constant, i.e. it knows that the path goes through solid metal and it discards that path.
If electric field exists in the void the particle tracking software doesn't know that some structure is there and bad things happen.
You might be able to solve this with the big-data technique called "Support Vector Machine". Assign the 0 and 1 classifications as you mentioned, and then run this through the libsvm algorithm. You should be able to use this model to classify and identify the points you need to zero out, and do so programmatically.
I realize that there is a learning curve for SVM and the libsvm implementation. If this is outside your effort budget, my apologies.
I am doing a project about image-processing, and I asked about how to solve a very large overdetermined systems of linear equations here. Before I can figure out a better way to accomplish the task, I just split the image into four equal parts and solve the systems of equations separately. The result is shown in the attached file.
The image represents the surface height of a pictured object. You can think of the two axes as the x and y axis, and the z-axis is the axis coming out of the screen. I solved the very large systems of equations to get z(x,y), which is displayed in this intensity plot. I have the following questions:
The lower left part is not shown because when I solved the equations for that region, the intensity plot calculated is affected by some extreme values. One or two pixels have the intensity (which represents the height) as high as 60, and because of the scaling of the colourbar scale, the rest of the image (which can be seen has height ranging only from -15 to 9) appears largely the same colour. I am still figuring out why those one or two pixels have such abnormal results, but if I do get these abnormal results, how can I eliminate/ignore them so the rest of the image can be seen properly?
I am using the imshow() in matplotlib. I also tried using a 3D plot, with the z-axis representing the surface height, but the result is not good. Are there any other visualization tools that can display the results in a nice way (preferably showing it in a 3D way) given that I have obtained z(x,y) for many pairs of (x,y)?
The four separate parts are clearly visible. Are there any ways to merge the separate parts together? First, I am thinking of sharing the central column and row. e.g. The top-left region spans from column=0 to 250, and the top-right region spans from column=250 to the right. In this case, values in col=250 will be calculated twice in total, and the values in each region will almost certainly differ from the other one slightly. How to reconcile the two slightly different values together to combine the different regions? Just taking the average of the two, do something related to curve fitting to merge the two regions, or what? Or should I stick to col=0 to 250, then col=251 to rightmost?
thanks
About point 2: you could try hill shading. See the matplotlib example and/or the novitsky blog
My software should judge spectrum bands, and given the location of the bands, find the peak point and width of the bands.
I learned to take the projection of the image and to find width of each peak.
But I need a better way to find the projection.
The method I used reduces a 1600-pixel wide image (eg 1600X40) to a 1600-long sequence. Ideally I would want to reduce the image to a 10000-long sequence using the same image.
I want a longer sequence as 1600 points provide too low resolution. A single point causes a large difference (there is a 4% difference if a band is judged from 18 to 19) to the measure.
How do I get a longer projection from the same image?
Code I used: https://stackoverflow.com/a/9771560/604511
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("band2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Set the min value to zero for a nice fit
projection /= projection.mean()
projection -= projection.min()
What you want to do is called interpolation. Scipy has an interpolate module, with a whole bunch of different functions for differing situations, take a look here, or specifically for images here.
Here is a recently asked question that has some example code, and a graph that shows what happens.
But it is really important to realise that interpolating will not make your data more accurate, so it will not help you in this situation.
If you want more accurate results, you need more accurate data. There is no other way. You need to start with a higher resolution image. (If you resample, or interpolate, you results will acually be less accurate!)
Update - as the question has changed
#Hooked has made a nice point. Another way to think about it is that instead of immediately averaging (which does throw away the variance in the data), you can produce 40 graphs (like your lower one in your posted image) from each horizontal row in your spectrum image, all these graphs are going to be pretty similar but with some variations in peak position, height and width. You should calculate the position, height, and width of each of these peaks in each of these 40 images, then combine this data (matching peaks across the 40 graphs), and use the appropriate variance as an error estimate (for peak position, height, and width), by using the central limit theorem. That way you can get the most out of your data. However, I believe this is assuming some independence between each of the rows in the spectrogram, which may or may not be the case?
I'd like to offer some more detail to #fraxel's answer (to long for a comment). He's right that you can't get any more information than what you put in, but I think it needs some elaboration...
You are projecting your data from 1600x40 -> 1600 which seems like you are throwing some data away. While technically correct, the whole point of a projection is to bring higher dimensional data to a lower dimension. This only makes sense if...
Your data can be adequately represented in the lower dimension. Correct me if I'm wrong, but it looks like your data is indeed one-dimensional, the vertical axis is a measure of the variability of that particular point on the x-axis (wavelength?).
Given that the projection makes sense, how can we best summarize the data for each particular wavelength point? In my previous answer, you can see I took the average for each point. In the absence of other information about the particular properties of the system, this is a reasonable first-order approximation.
You can keep more of the information if you like. Below I've plotted the variance along the y-axis. This tells me that your measurements have more variability when the signal is higher, and low variability when the signal is lower (which seems useful!):
What you need to do then, is decide what you are going to do with those extra 40 pixels of data before the projection. They mean something physically, and your job as a researcher is to interpret and project that data in a meaningful way!
The code to produce the image is below, the spec. data was taken from the screencap of your original post:
import Image
from scipy import *
from scipy.optimize import leastsq
# Load the picture with PIL, process if needed
pic = asarray(Image.open("spec2.png"))
# Average the pixel values along vertical axis
pic_avg = pic.mean(axis=2)
projection = pic_avg.sum(axis=0)
# Compute the variance
variance = pic_avg.var(axis=0)
from pylab import *
scale = 1/40.
X_val = range(projection.shape[0])
errorbar(X_val,projection*scale,yerr=variance*scale)
imshow(pic,origin='lower',alpha=.8)
axis('tight')
show()
I know that this problem has been solved before, but I've been great difficulty finding any literature describing the algorithms used to process this sort of data. I'm essentially doing some edge finding on a set of 2D data. I want to be able to find a couple points on an eye diagram (generally used to qualify high speed communications systems), and as I have had no experience with image processing I am struggling to write efficient methods.
As you can probably see, these diagrams are so called because they resemble the human eye. They can vary a great deal in the thickness, slope, and noise, depending on the signal and the system under test. The measurements that are normally taken are jitter (the horizontal thickness of the crossing region) and eye height (measured at either some specified percentage of the width or the maximum possible point). I know this can best be done with image processing instead of a more linear approach, as my attempts so far take several seconds just to find the left side of the first crossing. Any ideas of how I should go about this in Python? I'm already using NumPy to do some of the processing.
Here's some example data, it is formatted as a 1D array with associated x-axis data. For this particular example, it should be split up every 666 points (2 * int((1.0 / 2.5e9) / 1.2e-12)), since the rate of the signal was 2.5 GB/s, and the time between points was 1.2 ps.
Thanks!
Have you tried OpenCV (Open Computer Vision)? It's widely used and has a Python binding.
Not to be a PITA, but are you sure you wouldn't be better off with a numerical approach? All the tools I've seen for eye-diagram analysis go the numerical route; I haven't seen a single one that analyzes the image itself.
You say your algorithm is painfully slow on that dataset -- my next question would be why. Are you looking at an oversampled dataset? (I'm guessing you are.) And if so, have you tried decimating the signal first? That would at the very least give you fewer samples for your algorithm to wade through.
just going down your route for a moment, if you read those images into memory, as they are, wouldn't it be pretty easy to do two flood fills (starting centre and middle of left edge) that include all "white" data. if the fill routine recorded maximum and minimum height at each column, and maximum horizontal extent, then you have all you need.
in other words, i think you're over-thinking this. edge detection is used in complex "natural" scenes when the edges are unclear. here you edges are so completely obvious that you don't need to enhance them.