Identify horizontal regions in noisy graph

Identify horizontal regions in noisy graph - python

I have a bunch of noisy data.
I need be able to find the points where the increase in y begins and ends. Visually it's pretty obvious, but i've been having a hard time trying to come up with an algorithm that would be consistent and accurate.
I tried getting the slope directly (just as a difference of neighboring points):
But here still, i'm not sure how to properly identify the beginning and end of a step. I tried just going off of the magnitude of difference between points, but I get either a lot of false positives (like in that very noisy spike in the second graph), or I miss the very small steps (like the first and third). I also tried going in steps of ten points, calculating a best fit line, and the MSE, and when the MSE gets about a certain threshold, i would consider that a corner in the graph. For example, for 10 points in the somewhat horizontal line, the MSE for the best fit line would be small, but for 9 points and 1 that is at the beginning of the incline, the MSE is much larger.
I thought about trying to convert it into a step graph, but I'm not sure how to do it, plus i feel like i might end up with just one point where the graph goes from low to high, rather than two points, one for when it starts increasing, and another when it stops.
Does anyone have any ideas on how one might go about doing this?

Related

How to draw smooth contour/level curves of multivariable functions

G'day programmers and math enthusiasts.
Recently I have been exploring how CAS graphing calculators function; in particular, how they are able to draw level curves and hence contours for multivariable functions.
Just a couple of notes before I ask my question:
I am using Python's Pygame library purely for the window and graphics. Of course there are better options out there but I really wanted to keep my code as primitive as I am comfortable with, in an effort to learn the most.
Yes, yes. I know about matplotlib! God have I seen 100 different suggestions for using other supporting libraries. And while they are definitely stunning and robust tools, I am really trying to build up my knowledge from the foundations here so that one day I may even be able to grow and support libraries such as them.
My ultimate goal is to get plots looking as smooth as this:
Mathematica Contour Plot Circle E.g.
What I currently do is:
Evaluate the function over a grid of 500x500 points equal to 0, with some error tolerance (mine is 0.01). This gives me a rough approximation of the level curve at f(x,y)=0.
Then I use a dodgy distance function to find each point's closest neighbour, and draw an anti-aliased line between the two.
The results of both of these steps can be seen here:
First Evaluating Valid Grid Points
Then Drawing Lines to Closest Points
For obvious reasons I've got gaps in the graph where the next closest point is always keeping the graph discontinuous. Alas! I thought of another janky work around. How about on top of finding the closest point, it actually looks for the next closest point that hasn't already been visited? This idea came close, but still doesn't really seem to be even close to efficient. Here are my results after implementing that:
Slightly Smarter Point Connecting
My question is, how is this sort of thing typically implemented in graphing calculators? Have I been going about this all wrong? Any ideas or suggestions would be greatly appreciated :)
(I haven't included any code, mainly because it's not super clear, and also not particularly relevant to the problem).
Also if anyone has some hardcore math answers to suggest, don't be afraid to suggest them, I've got a healthy background in coding and mathematics (especially numerical and computational methods) so here's me hoping I should be able to cope with them.

so you are evaluating the equation for every x and y point on your plane. then you check if the result is < 0.01 and if so, you are drawing the point.
a better way to check if the point should be drawn is to check if one of the following is true:
(a) if the point is zero
(b) if the point is positive and has at least one negative neighbor
(c) if the point is negative and has at least one positive neighbor
there are 3 problems with this:
it doesn't support any kind of antialisasing so the result will not look as smooth as you would want
you can't make thicker lines (more then 1 pixel)
if the 0-point line is only touching (it's positive on both sides and not positive on one, negative on the other)
this second solution may fix those problems but it was made by me and not tested so it may or may not work:
you assign the value to a corner and then calculate the distance to the zero line for each point from it's corners. this is the algorithm for finding the distance:
def distance(tl, tr, bl, br): # the 4 corners
avg = abs((tl + tr + bl + br) / 4) # getting the absolute average
m = min(map(abs, (tl + tr + bl + br))) # absolute minimum of points
if min == 0: # special case
return float('inf')
return avg / m # distance to 0 point assuming the trend will continue
this returns the estimated distance to the 0 line you can now draw the pixel e.g. if you want a 5-pixel line, then if the result is <4 you draw the pixel full color, elif the pixel is <5 you draw the pixel with an opacity of distance - 4 (*255 if you are using pygames alpha option)
this solution assumes that the function is somewhat linear.
just try it, in the worst case it doesn't work...

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.1319&rep=rep1&type=pdf
This 21 Page doc has everything I need to draw the implicit curves accurately and smoothly. It even covers optimisation methods and supports bifurcation points for implicit functions. Would highly recommend to anyone with questions similar to my own above.
Thanks to everyone who had recommendations and clarifying questions, they all helped lead me to this resource.

3D curve fitting using python

I am trying to reduce the number of data points for a 3D curve, currently I have 20000 points and I would like to reduce this to around 2000 without losing much information.
I am doing this on python.
As a simple example, think of a spiral on the surface of a cylinder.
Are there any built-in functions that will do this?
I've tried using the Ramer–Douglas–Peucker algorithm to simplify the line, but due to the nature of the curve, for every data point ignored the final plot is undershooting. See picture of a 2D example, orange is what using rdp produces, green is what I want.
I would like the output of the program to be an array of ~2000 coordinates that still represent the shape of the 3D curve but they don't necessarily have to be original coordinates, I want some points to overshoot and others to undershoot.
Thank you for your help
UPDATE:
In the end I chose to do something quite involved but gave me exactly what I wanted. I started using the rdp algorithm to reduce the number of points. With this new information I then fit a straight line of best fit to the spread of the original points between the new reduced points:
i.e. if the algorithm 'ignored' 13 points, I fit the line from point 0 to point 14, and did the same for the next segment where the algorithm had skipped for example 7 points, so I fit from 14 to 22 etc.
Having those lines of best fit, I found the points were the lines intersected or if the lines did not intersect, the closest point on each of the lines to the other line.
Due to the nature of my problem, I did not need my data to be continuous, so 2000 "discontinuous" segments were not a problem.
Thank you very much for your help!

In the end I chose to do something quite involved but gave me exactly what I wanted. I started using the rdp algorithm to reduce the number of points. With this new information I then fit a straight line of best fit to the spread of the original points between the new reduced points: i.e. if the algorithm 'ignored' 13 points, I fit the line from point 0 to point 14, and did the same for the next segment where the algorithm had skipped for example 7 points, so I fit from 14 to 22 etc. Having those lines of best fit, I found the points were the lines intersected or if the lines did not intersect, the closest point on each of the lines to the other line. Due to the nature of my problem, I did not need my data to be continuous, so 2000 "discontinuous" segments were not a problem. Thank you very much for your help!

Find "anomalous" curves in a "generic-linear" shape

I am not a mathematician but I am pretty sure my problem could be solved with a bit (maybe a lot?) of good maths.
Let me explain the problem with a picture.
I have a network (GIS data) which is composed of many linear segments.
Rarely, a curve is present throughout these segments and I would need to find a reasonable method to detect them rather automatically.
Given that I have the coordinates of my segments and the curves (the green dots in the picture), would you reccomend a reasonable way to detect these curves?
I am not sure but it could be similar to the opposite of what is asked in this other SO question, but I don't actually have a function to calculate a second derivative, only line segments (and curves) made by vertices...

Assuming you can easily list out the points in a segment and iterate over them, and that a segment is "mostly" linear, you can take the end-points of a segment and interpolate a line between them.
Next, check if each point of the segment lies on the interpolated line and add a margin of error.
You can then assume that several adjacent points of the segment that do not lie on the interpolated line make up a curve.
You may need to implement other checks:
Are the end-points are part of a straight segment -- i.e. that the segment does not end in a curve
Does the segment bend and should the segment be treated as two segments?
Can two curves be adjacent to one another without a point between them that's on the line?
To get started with python, I'd write the function is_on_line and loop over all the points, calling it each time to see if the point is on the line.
Excuse the verbose pseudo code (makes lots of assumptions about data structures, can be done in one loop), but this should help you break the problem apart to get started:
points_on_line = []
for idx, point in enumerate(segment):
result = is_on_line(
endpoint_1_x=segment[0].x,
endpoint_1_y=segment[0].y,
endpoint_2_x=segment[-1].x,
endpoint_2_y=segment[-1].y,
coord_x=point.x,
coord_y=point.y,
error_margin=0.1,
)
points_on_line.append((point, result,))
for point, on_line in points_on_line:
# figure out where your curves are

Efficient 2D edge detection in Python

I know that this problem has been solved before, but I've been great difficulty finding any literature describing the algorithms used to process this sort of data. I'm essentially doing some edge finding on a set of 2D data. I want to be able to find a couple points on an eye diagram (generally used to qualify high speed communications systems), and as I have had no experience with image processing I am struggling to write efficient methods.
As you can probably see, these diagrams are so called because they resemble the human eye. They can vary a great deal in the thickness, slope, and noise, depending on the signal and the system under test. The measurements that are normally taken are jitter (the horizontal thickness of the crossing region) and eye height (measured at either some specified percentage of the width or the maximum possible point). I know this can best be done with image processing instead of a more linear approach, as my attempts so far take several seconds just to find the left side of the first crossing. Any ideas of how I should go about this in Python? I'm already using NumPy to do some of the processing.
Here's some example data, it is formatted as a 1D array with associated x-axis data. For this particular example, it should be split up every 666 points (2 * int((1.0 / 2.5e9) / 1.2e-12)), since the rate of the signal was 2.5 GB/s, and the time between points was 1.2 ps.
Thanks!

Have you tried OpenCV (Open Computer Vision)? It's widely used and has a Python binding.
Not to be a PITA, but are you sure you wouldn't be better off with a numerical approach? All the tools I've seen for eye-diagram analysis go the numerical route; I haven't seen a single one that analyzes the image itself.
You say your algorithm is painfully slow on that dataset -- my next question would be why. Are you looking at an oversampled dataset? (I'm guessing you are.) And if so, have you tried decimating the signal first? That would at the very least give you fewer samples for your algorithm to wade through.

just going down your route for a moment, if you read those images into memory, as they are, wouldn't it be pretty easy to do two flood fills (starting centre and middle of left edge) that include all "white" data. if the fill routine recorded maximum and minimum height at each column, and maximum horizontal extent, then you have all you need.
in other words, i think you're over-thinking this. edge detection is used in complex "natural" scenes when the edges are unclear. here you edges are so completely obvious that you don't need to enhance them.

Test if point is in some rectangle

I have a large collection of rectangles, all of the same size. I am generating random points that should not fall in these rectangles, so what I wish to do is test if the generated point lies in one of the rectangles, and if it does, generate a new point.
Using R-trees seem to work, but they are really meant for rectangles and not points. I could use a modified version of a R-tree algorithm which works with points too, but I'd rather not reinvent the wheel, if there is already some better solution. I'm not very familiar with data-structures, so maybe there already exists some structure that works for my problem?
In summary, basically what I'm asking is if anyone knows of a good algorithm, that works in Python, that can be used to check if a point lies in any rectangle in a given set of rectangles.
edit: This is in 2D and the rectangles are not rotated.

This Reddit thread addresses your problem:
I have a set of rectangles, and need to determine whether a point is contained within any of them. What are some good data structures to do this, with fast lookup being important?
If your universe is integer, or if the level of precision is well known and is not too high, you can use abelsson's suggestion from the thread, using O(1) lookup using coloring:
As usual you can trade space for
time.. here is a O(1) lookup with very
low constant. init: Create a bitmap
large enough to envelop all rectangles
with sufficient precision, initialize
it to black. Color all pixels
containing any rectangle white. O(1)
lookup: is the point (x,y) white? If
so, a rectangle was hit.
I recommend you go to that post and fully read ModernRonin's answer which is the most accepted one. I pasted it here:
First, the micro problem. You have an
arbitrarily rotated rectangle, and a
point. Is the point inside the
rectangle?
There are many ways to do this. But
the best, I think, is using the 2d
vector cross product. First, make sure
the points of the rectangle are stored
in clockwise order. Then do the vector
cross product with 1) the vector
formed by the two points of the side
and 2) a vector from the first point
of the side to the test point. Check
the sign of the result - positive is
inside (to the right of) the side,
negative is outside. If it's inside
all four sides, it's inside the
rectangle. Or equivalently, if it's
outside any of the sides, it's outside
the rectangle. More explanation here.
This method will take 3 subtracts per
vector * times 2 vectors per side,
plus one cross product per side which
is three multiplies and two adds. 11
flops per side, 44 flops per
rectangle.
If you don't like the cross product,
then you could do something like:
figure out the inscribed and
circumscribed circles for each
rectangle, check if the point inside
the inscribed one. If so, it's in the
rectangle as well. If not, check if
it's outside the circumscribed
rectangle. If so, it's outside the
rectangle as well. If it falls between
the two circles, you're f****d and you
have to check it the hard way.
Finding if a point is inside a circle
in 2d takes two subtractions and two
squarings (= multiplies), and then you
compare distance squared to avoid
having to do a square root. That's 4
flops, times two circles is 8 flops -
but sometimes you still won't know.
Also this assumes that you don't pay
any CPU time to compute the
circumscribed or inscribed circles,
which may or may not be true depending
on how much pre-computation you're
willing to do on your rectangle set.
In any event, it's probably not a
great idea to test the point against
every rectangle, especially if you
have a hundred million of them.
Which brings us to the macro problem.
How to avoid testing the point against
every single rectangle in the set? In
2D, this is probably a quad-tree
problem. In 3d, what generic_handle
said - an octree. Off the top of my
head, I would probably implement it as
a B+ tree. It's tempting to use d = 5,
so that each node can have up to 4
children, since that maps so nicely
onto the quad-tree abstraction. But if
the set of rectangles is too big to
fit into main memory (not very likely
these days), then having nodes the
same size as disk blocks is probably
the way to go.
Watch out for annoying degenerate
cases, like some data set that has ten
thousand nearly identical rectangles
with centers at the same exact point.
:P
Why is this problem important? It's
useful in computer graphics, to check
if a ray intersects a polygon. I.e.,
did that sniper rifle shot you just
made hit the person you were shooting
at? It's also used in real-time map
software, like say GPS units. GPS
tells you the coordinates you're at,
but the map software has to find where
that point is in a huge amount of map
data, and do it several times per
second.
Again, credit to ModernRonin...

For rectangles that are aligned with the axes, you only need two points (four numbers) to identify the rectangle - conventionally, bottom-left and top-right corners. To establish whether a given point (Xtest, Ytest) overlaps with a rectangle (XBL, YBL, XTR, YTR) by testing both:
Xtest >= XBL && Xtest <= XTR
Ytest >= YBL && Ytest <= YTR
Clearly, for a large enough set of points to test, this could be fairly time consuming. The question, then, is how to optimize the testing.
Clearly, one optimization is to establish the minimum and maximum X and Y values for the box surrounding all the rectangles (the bounding box): a swift test on this shows whether there is any need to look further.
Xtest >= Xmin && Xtest <= Xmax
Ytest >= Ymin && Ytest <= Ymax
Depending on how much of the total surface area is covered with rectangles, you might be able to find non-overlapping sub-areas that contain rectangles, and you could then avoid searching those sub-areas that cannot contain a rectangle overlapping the point, again saving comparisons during the search at the cost of pre-computation of suitable data structures. If the set of rectangles is sparse enough, there may be no overlapping, in which case this degenerates into the brute-force search. Equally, if the set of rectangles is so dense that there are no sub-ranges in the bounding box that can be split up without breaking rectangles.
However, you could also arbitrarily break up the bounding area into, say, quarters (half in each direction). You would then use a list of boxes which would include more boxes than in the original set (two or four boxes for each box that overlapped one of the arbitrary boundaries). The advantage of this is that you could then eliminate three of the four quarters from the search, reducing the amount of searching to be done in total - at the expense of auxilliary storage.
So, there are space-time trade-offs, as ever. And pre-computation versus search trade-offs. If you are unlucky, the pre-computation achieves nothing (for example, there are two boxes only, and they don't overlap on either axis). On the other hand, it could achieve considerable search-time benefit.

I suggest you take a look at BSP trees (and possible quadtrees or octrees, links available on that page as well). They are used to partition the whole space recursively and allow you to quickly check for a point which rectangles you need to check at all.
At minimum you just have one huge partition and need to check all rectangles, at maximum your partitions get so small, that they get down to the size of single rectangles. Of course the more fine-grained the partition, the longer you need to walk down the tree in order to find the rectangles you want to check.
However, you can freely decide how many rectangles are suitable to be checked for a point and then create the corresponding structure.
Pay attention to overlapping rectangles though. As the BSP tree needs to be precomputed anyways, you may as well remove overlaps during that time, so you can get clear partitions.

Your R-tree approach is the best approach I know of (that's the approach I would choose over quadtrees, B+ trees, or BSP trees, as R-trees seem convenient to build in your case). Caveat: I'm no expert, even though I remember a few things from my senior year university class of algorithmic!

Why not try this. It seems rather light on both computation and memory.
Consider the projections of all the rectangles onto the base line of your space. Denote that set of line intervals as
{[Rl1, Rr1], [Rl2, Rr2],..., [Rln, Rrn]}, ordered by increasing left coordinates.
Now suppose your point is (x, y), start a search at the left of this set until you reach a line interval that contains the point x.
If none does, your point (x,y) is outside all rectangles.
If some do, say [Rlk, Rrk], ..., [Rlh, Rrh], (k <= h) then just check whether y is within the vertical extent of any of these rectangles.
Done.
Good luck.
John Doner

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.