Identification of Peaks from a Line Graph

Identification of Peaks from a Line Graph - python

I have added a link to the data set here.The first script produces a line graph using signal output data. My next step was to identify the peaks present on the line graph. The second script has an algorithm to identify all the peaks present on the line graph. However it is too sensitive. It classifies even the slightest bumps on the graph as a peak. I do not want this. I only wish to identify the conspicous(large) bumps as peaks. How do I modify the second script to do this?[Line Graph][2]
import matplotlib.pyplot as plt
import numpy as np
X, Y = [], []
X = np.zeros((10, 4096))
Y = np.zeros((10, 4096))
n=0
m=0
for line in open('data_set2.txt', 'r'):
values = [float(s) for s in line.split()]
X[n,0] = values[0]-1566518691968
for m in range(4096):
Y[n,m]=values[m+1]
n=n+1
plt.plot(Y[1,0:4095])
plt.show()
b = (X[1:]-X[:-1])[:-1]
c = (X[:-1]-X[1:])[1:]
minima = np.where(np.bitwise_and(b<0, c<0))[0]+1
maxima = np.where(np.bitwise_and(b>0, c>0))[0]+1
all_peaks = np.where((b*c)>0)[0]+1
del b,c
print(minima)
print(maxima)
print(all_peaks)

I wish you had attached the data set so I can try my solution before I post it here. What I think you're doing is looking for all the points that are higher than the point before and point after them, which is why you're ending up with too many points. What your code should look for tho is a number of the highest peaks. It does not matter if the point is a peak, and the peak's height itself does not matter either, what matters is the "uniqueness" of the peak, or how much it is higher than the average point. Imagine removing the highest three peaks in your example and zooming in, you will find a new number of peaks that look much higher than the rest; and so on.
The tricky thing for you is to find the number of those peaks which depends on how sensitive you want your code to be.

There are some packages for identifying peaks. scypi provides scipy.signal.find_peaksfunction (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html). There are also peak identification for matlab, octave and others.
All of them require some quantitative criteria for peak identification. None will work with just "conspicous(large) bumps".
UPD. So, if you want to write the code yourself, you have choose some filtering function. There may be some pretty obvious:
Peaks that have value greater than x;
Peaks that are greater than x% of maximum peak's value;
x largest peaks (where x < total numbers of peaks.
Varying parameter x you can come to solution that satisfies your criteria of "conspicous(large) bumps".
P. S. There might be other filtering functions, but it looks that in your case width and form of peak do not matter.

I am unaware of the algorithms suggested by someone here; if I were you I'd check them out because their way should be more efficient. However, if you want to be lazy, here is a solution:
First I gather all the peaks (the same peaks you get):
x_range = range(4096)
peaks = []
for line in open('data_set2.txt', 'r'):
values = [float(s) for s in line.split()]
X[n, 0] = values[0] - 1566518691968
for m in range(4096):
Y[n, m] = values[m + 1]
# only work with the plotted row starting from the third value
if n != 1 or m in (0, 1):
continue
# if a point is higher than the point before and point after
if Y[n, m - 2] < Y[n, m - 1] > Y[n, m]:
peaks.append((x_range[m-1], Y[n, m - 1]))
n = n + 1
plt.plot(Y[1, 0:4095])
plt.show()
Then I loop through each 100 points (assuming no two peaks can occur in 100 point-range, otherwise one of them will be discarded) and find the maximum. If the segment maximum is some percentage of the absolute maximum, it is included.
max_ = max(peaks, key=lambda x: x[1])[1]
# If the graph does not have peaks
if max_ < np.mean(Y) * 10: # Tweak point 1
print("No peaks")
sys.exit()
highest = []
sensitivity = 0.2 # Tweak point 2
for i in range(0, len(peaks), 100):
try:
segment = peaks[i: i + 100]
except IndexError:
segment = peaks[i:]
finally:
segment_max = max(segment, key=lambda x: x[1])[1]
if segment_max >= max_ * sensitivity:
highest.append(segment_max)
print(highest)

Related

How to get the highest Two peak values from a discrete sampled data point using python?

I have Radar Data for a vehicle going further from it, The Radar outputs a .csv file. Once The Radar detects something, the amplitude column switches from a 0 to a one and starts outputting values, plotting it across. For example, here:
When the column for Distance/Amplitude goes from 0 to a number, it can be inferred that target has been seen by the Radar. So plotting the row of the first instance gives out this blue wave
If we plot the rows below it, We get this,
the Radar was placed in the back, so the target was moving away from it. The x-axis represents distance multiplied by .077 m. So, for the first blue wave, the distance that the Radar registers for 37*.077m. I was wondering if there is a way where I can get a range of values from the .csv file to take into account of the two peaks, for example: I was wondering how I could get the top two peaks from the blue wave, get the x-axis coordinates for them and then get a median point for them and track it for the orange, which is the second row below the first one.
I have attached below the .csv file.
https://drive.google.com/file/d/1IJOebiXuScjLPytemulcXph7ZB1X65wU/view?usp=sharing
I have an algorithm that gets the index of the first hit and the last hit, for example when switching from a 0 to a value and a value to a zero, these allow me to catch the when the radar detects a target. This was helpful while I was using the values directly given by the Radar, like the distance and amplitude values, but now that I need a whole row, I don't know how to proceed with this. I don't know if Pandas or Numpy has ways I can utilize to deal with this

There are a few ways to get peaks, and thus get the two peak positions. Get the derivative of the data set. The points where the derivative data intersects the x-axis will be your peaks and valleys of the original data. While doing that, you can also grab the indices of those peaks and valleys. From there, you can iterate through those points in the original data to get the two maximum values, and their indices.
It would look something like this:
import matplotlib.pyplot as plt
import numpy as np
# My data set (example is a crazy cosine wave)
x = np.linspace(1, 100, 1000)
y = np.cos(x*0.5)*((x - 50)**3)
# Find first derivative:
m = np.diff(y)/np.diff(x)
# Get indicies of peaks and valleys
c = len(m)
indices = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val >= 0:
indices.append(i)
elif prev_val > 0 and val <= 0:
indices.append(i)
# Get the values, positions, and indicies of the two peaks
max_list = [0, 0]
index_list = [0, 0]
pos_list = [0, 0]
for index in indices:
val = y[index]
if val > max_list[0]:
max_list[0] = val
index_list[0] = index
pos_list[0] = x[index]
elif val > max_list[1]:
max_list[1] = val
index_list[1] = index
pos_list[1] = x[index]
print('Two peak indices:', index_list)
print('Two peak values:', max_list)
print('Two peak x-positions:', pos_list)
average_pos = (pos_list[0] + pos_list[1])/2
print('Average x-position:', average_pos)
plt.plot(x, y)
plt.show()

How to select numeric samples based on their distance relative to samples already selected (Python)

I have some random test data in a 2D array of shape (500,2) as such:
xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
From this array, I first select 10 random samples, to select the 11th sample, I would like to pick the sample that is the furthest away from the original 10 selected samples collectively, I am using the euclidean distance to do this. I need to keep doing this until a certain amount have been picked. Here is my attempt at doing this.
# Function to get the distance between samples
def get_dist(a, b):
return np.sqrt(np.sum(np.square(a - b)))
# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []
# This selects 10 random samples and appends them to selected_xy
for i in range(10):
idx = np.random.randint(len(xy))
starting_10 = xy[idx, :]
selected_xy.append(starting_10)
starting.append(starting_10)
xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)
# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
dists = np.zeros(len(xy))
for selected_xy_ in selected_xy:
# Get the distance between each already selected sample, and every other unselected sample
dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
# Apply some kind of penalty function - this is the key
dists_[dists_ < 90] -= 25000
# Sum dists_ onto dists
dists += dists_
# Select the largest one
dist_max_idx = np.argmax(dists)
selected_xy.append(xy[dist_max_idx])
xy = np.delete(xy, dist_max_idx, axis = 0)
The key to this is this line - the penalty function
dists_[dists_ < 90] -= 25000
This penalty function exists to prevent the code from just picking a ring of samples at the edge of the space, by artificially shortening values that are close together.
However, this eventually breaks down, and the selection starts clustering, as shown in the image. You can clearly see that there are much better selections that the code can make before any kind of clustering is necessary. I feel that a kind of decaying exponential function would be best for this, but I do not know how to implement it.
So my question is; how would I change the current penalty function to get what I'm looking for?

From your question, I understand that what you are looking for are Periodic Boundary Conditions (PBC). Meaning that a point which at the left edge of your space is just next to the on the right end side. Thus, the maximal distance you can get along one axis is given by the half of the box (i.e. between the edge and the center).
To take into account the PBC you need to compute the distance on each axis and subtract the half of the box to that:
For example, if you have a point with x1 = 100 and a second one with x2 = 900, using the PBC they are 200 units apart : |x1 - x2| - 500. In the general case, given 2 coordinates and the half size box, you end up by having:
In your case this simplifies to:
delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
To wrap it up, I rewrote your code using a new distance function (note that I removed some unnecessary for loops):
import numpy as np
def distance(p, arr, 500):
delta_x = np.abs(p[0] - arr[:,0])
delta_y = np.abs(p[1] - arr[:,1])
delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
delta_y[delta_y > 500] = delta_y[delta_y > 500] - 500
return np.sqrt(delta_x**2 + delta_y**2)
xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
idx = np.random.randint(500, size=10)
selected_xy = list(xy[idx])
_initial_selected = xy[idx]
xy = np.delete(xy, idx, axis = 0)
n_xy_to_select = 120
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
dists = np.zeros(len(xy))
for selected_xy_ in selected_xy:
# Compute the distance taking into account the PBC
dists_ = distance(selected_xy_, xy)
dists += dists_
# Select the largest one
dist_max_idx = np.argmax(dists)
selected_xy.append(xy[dist_max_idx])
xy = np.delete(xy, dist_max_idx, axis = 0)
And indeed it creates clusters, and this is normal as you will tend to create points clusters that are at the maximal distance from each others. More than that, due to the boundary conditions, we set that the maximal distance between 2 points along one axis is given by 500. The maximal distance between two clusters is thus also 500 ! And as you can see on the image, it is the case.
More over, picking more numbers will start to draws line to connect the different clusters, starting from the central one as you can see here :

What I was looking for is called 'Furthest Point Sampling'. I have some some more research into the solution, and the Python code used to perform this is found here: https://minibatchai.com/ai/2021/08/07/FPS.html

How to find the smoothest part of the trajectory using python?

I have came across the following problem. I have created a program that estimates the trajectory of camera recording video along x and y axis. This approach is very common in warp-stabilizers. I want to find the range of values which represent the most stable video footage and get their values. Here is the graph of trajectory. Graph is based on the numpy array. I guess that the best idea would be to pick up the part where the increase of values is the slowest but i am not sure how to do it.

The following code will find the longest range of values that for both X and Y are stable for a certain threshold. This threshold limits the change in X and Y.
You can use it to tune your result.
If you want more stable section than choose epsilon lower.
This will result in a shorter range.
import numpy as np
X=150*np.random.rand(270) # mock data as I do not have yours
Y=150*np.random.rand(270) # Replace X and Y with your X and Y data
epsilon = 80 #threshold
# get indexes where the difference is smaller than the threshold for X and Y
valid_values = np.logical_and(np.abs(np.diff(X))<epsilon,np.abs(np.diff(Y))<epsilon)
cummulative_valid_values=[]
count = 0
# find longest range of values that satisfy the threshold
for id,value in enumerate(valid_values):
if value == True:
count=count+1
else:
count = 0
cummulative_valid_values.append(count)
# Calculate start and end of largest stable range
end_of_range = cummulative_valid_values.index(max(cummulative_valid_values))+1
start_of_range = end_of_range-max(cummulative_valid_values)+1
print("Largest stable range is: ",start_of_range," - ",end_of_range)

Select a point randomly, but without the bias of density

I have this distribution of points (allPoints, which is a list of lists: [[x1,y1][x2,y2][x3,y3][x4,y4]...[xn,yn]]):
From which I'd like to select points, randomly.
in Python I would do something like:
from random import *
point = choice(allPoints)
Except, I need the random pick to not be biased by the existing density. For instance, here, "choice" would tend to pick a point in the upmost-leftmost part of the plot.
How can I, in Python, get rid of this bias?
I've tried to divide the space in portions of size "div", and then, sample within this portion, but in many cases, no points exist at all and the while loop doesn't find any solution:
def column(matrix, i):
return [row[i] for row in matrix]
div = 10
min_x,max_x = min(column(allPoints,0)),max(column(allPoints,0))
min_y, max_y = min(column(allPoints,1)),max(column(allPoints,1))
zone_x_min = randint(1,div-1) * (max_x - min_x) / div + min_x
zone_x_max = zone_x_min + (max_x - min_x) / div
zone_y_min = randint(1,div-1) * (max_y - min_y) / div + min_y
zone_y_max = zone_yl_min + (max_y - min_y) / div
p = choice(allPoints)
cont = True
while cont == True:
if (p[0] > zone_x_min and p[0] < zone_x_max) and (e[1] > zone_y_min and e[1] < zone_y_max):
cont = False
else:
p = choice(allPoints)
what would be a correct, inexpensive (if possible) solution to this problem?
If it wasn't ridiculous, I think something like would work for me, in theory:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
while p not in allPoints:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]

The question is a little ill-formed, but here's a stab.
The idea is to use a gaussian kernel density estimate, then sample from your data with weights equal to the inverse of the pdf at each point.
This is not statistically justifiable in any real sense.
import numpy as np
from scipy import stats
#random data
x = np.random.normal(size = 200)
y = np.random.normal(size = 200)
#estimate the density
kernel = stats.gaussian_kde(np.vstack([x,y]))
#calculate the inverse of pdf for each point, and normalise to sum to 1
pvector = 1/kernel.pdf(np.vstack([x,y]))/sum(1/kernel.pdf(np.vstack([x,y])))
#get a vector of indices based on your weights
np.random.choice(range(len(x)), size = 10, replace = True, p = pvector)

I believe you want to randomly select a datum point from your graph.That is, one of the little black dots.
Compute a centroid, or pick a point like (1.0, 70).
Compute the distance from each point to the centroid and let that be the probability of your choice of that point.
That is if distance(P,C) is 100 and distance(Q,C) is 1 then let P be 100x more likely to be chosen. All points are eligible to win, but the crowded ones are individually less likely (but make it up with.volume).

If I understand your initial attempt correctly, I believe there is a simple adjustment you can make to make this work.
Randomly generate an x value (0,4.5), and a y value (0,70).
Then loop through allPoints to find the closest dot.
This has the downside of large empty areas all converging to a single point. A way to help (not remove) this problem would be to make your random point have a range. If no dot exists in that range, randomly generate a new dot.

Assuming you want your selected points to be visually spread I can think of at least one "efficient/easy" method.
Choose a random point (with random.choice for example) ;
remove from your initial set any point that is "close"*;
repeat until there is no point left in your set.
*This requires that you know from the beginning how dense you want your sample to be.

Fast, elegant way to calculate empirical/sample covariogram

Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.

One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.